Introduction

Examining the Pittsburgh Metropolitan Statistical Area, this analysis uses historic patterns of development to predict areas that may be developed in the future. This model employs a national Land Cover dataset from 2011 and 2019 indicating development and variables such as highway distances, distance from city center, and distance from developed areas. The model is then utilized to predict where growth might occur and create an allocation procedure for urban growth based on population change, development demand, and supply of land in 2030.

Scenarios

After a model is built, there are two scenarios predicted upon in this analysis. The first scenario is demand-side, based on population projections for the Pittsburgh MSA in 2030. The population projections show where they may be demand for more development. The supply-side scenario is based on the incorporation of a hypothetical new edge city, called Michael’s Place. By adding in a new city center, we see if that changes the potential allocation of development.

Use Case

Predicting growth is tricky, but the efforts are especially important given the impacts of different types of settlements. As the U.S. population is projected to increase over the next 40 or so years, especially in established corridors, it is important to consider where exactly that development may occur. Although this projected development might be included within a ā€œmetroā€ area, a lot of the development can be characterized as suburban, which often maximizes greenfield development, takes up a lot of land, and is auto-oriented. By employing this type of model, one can not only see where land development might occur based on a variety of variables and find methods to incentivize or deincentivize different types of development. This model is a helpful tool for many, but especially urban planners, who are aiding in a lot of decisions about where infrastructure should go, approving development projects, and advocating for incentives for certain types of growth.

Set up

The following chunks of code include necessary setup information like packages and map themes. There are six hard-coded functions in this analysis, descriptions to follow.

options(scipen=999)

library(tidyverse)
library(sf)
library(raster)
library(knitr)
library(kableExtra)
library(tidycensus)
library(tigris)
library(FNN)
#library(QuantPsyc) # JE Note: in R 4.1, QuantPsyc package not available.
library(caret)
library(yardstick)
library(pscl)
library(plotROC) 
library(ggrepel)
library(pROC)
library(grid)
library(gridExtra)
library(viridis)
library(igraph)
library(stargazer)
plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette2 <- c("#A0C1B9","#70A0AF")
palette4 <- c("#A0C1B9","#70A0AF","#497583","#33525B")
palette5 <- c("#DAE7E4","#A0C1B9","#70A0AF","#497583","#33525B")
palette10 <- c("#f7fcf0","#e0f3db","#ccebc5","#a8ddb5","#7bccc4",
               "#4eb3d3","#2b8cbe","#0868ac","#084081","#f7fcf0")
#this function converts a column in to quintiles. It is used for mapping.
quintileBreaks <- function(df,variable) {
    as.character(quantile(df[[variable]],
                          c(.01,.2,.4,.6,.8),na.rm=T))
}

#This function can be used to convert a polygon sf to centroids xy coords.
xyC <- function(aPolygonSF) {
  as.data.frame(
    cbind(x=st_coordinates(st_centroid(aPolygonSF))[,1],
          y=st_coordinates(st_centroid(aPolygonSF))[,2]))
} 

#this function convert a raster to a data frame so it can be plotted in ggplot
rast <- function(inRaster) {
  data.frame(
    xyFromCell(inRaster, 1:ncell(inRaster)), 
    value = getValues(inRaster)) }

#this function aggregates the rasters
aggregateRaster <- function(inputRasterList, theFishnet) {
  #create an empty fishnet with the same dimensions as the input fishnet
  theseFishnets <- theFishnet %>% dplyr::select()
  #for each raster in the raster list
  for (i in inputRasterList) {
  #create a variable name corresponding to the ith raster
  varName <- names(i)
  #convert raster to points as an sf
    thesePoints <-
      rasterToPoints(i) %>%
      as.data.frame() %>%
      st_as_sf(coords = c("x", "y"), crs = st_crs(theFishnet)) %>%
      filter(.[[1]] == 1)
  #aggregate to the fishnet
    thisFishnet <-
      aggregate(thesePoints, theFishnet, length) %>%
      mutate(!!varName := ifelse(is.na(.[[1]]),0,1))
  #add to the larger fishnet
    theseFishnets <- cbind(theseFishnets,thisFishnet)
  }
  #output all aggregates as one large fishnet
   return(theseFishnets)
}

# this function creates spatial lag of development using k-nearest neighbor analysis
nn_function <- function(measureFrom,measureTo,k) {
  #convert the sf layers to matrices
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

# this function creates spatial cross-validation, a method of generalizability where you remove a neighborhood and evaluate against it
spatialCV <- function(dataFrame, uniqueID, dependentVariable, modelName) {

#initialize a data frame 
endList <- list()

#create a list that is all the spatial group unqiue ids in the data frame (ie counties)    
  uniqueID_List <- unique(dataFrame[[uniqueID]])  
  x <- 1
  y <- length(uniqueID_List)
  
#create a counter and while it is less than the number of counties...  
  while(x <= y) 
  {
#call a current county    
    currentUniqueID <- uniqueID_List[x]
#create a training set comprised of units not in that county and a test set of units
#that are that county
    training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
    testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
#create seperate xy vectors
    trainingX <- training[ , -which(names(training) %in% c(dependentVariable))]
    testingX <- testing[ , -which(names(testing) %in% c(dependentVariable))]
    
    trainY <- training[[dependentVariable]]
    testY <- testing[[dependentVariable]]
#Calculate predictions on the test county as part of a data frame including the observed
#outcome and the unique county ID    
   thisPrediction <- 
     data.frame(class = testY,
                probs = predict(modelName, testingX, type="response"),
                county = currentUniqueID) 

#Row bind the predictions to a data farme
   endList <- rbind(endList, thisPrediction)
#iterate counter    
    x <- x + 1 
  } 
#return the final list of counties and associated predictions  
  return (as.data.frame(endList))
}

Data Wrangling

The data wrangling and feature engineering for this analysis is extensive. Data from the National Land Cover Change database, U.S. Census Bureau, and the Western PA Regional Data Center were pulled.

Land Cover Data

Land Cover Data is pulled from USGS National Land Cover Change database (https://www.mrlc.gov/data/nlcd-land-cover-change-index-conus). Land Cover was pulled from 2011 and 2019, to see the 8 year change and to align better with census data analysis (more on that below). Before pulling the raster into R, the raster was clipped to the Pittsburgh MSA in ArcGIs Pro–this was a method to save time as the National Land Cover Change raster is possible too big for R. Beyond that, the rest of the workflow and methodology is included in this markdown. As seen in this below chunk, the Land Cover data was also downsampled. This made the raster cells bigger and we downsampled the data as it not only enables the code to run better, but if not downsampled, this particular model cannot perform.

# Land cover data

setwd("C:/Users/ammar/OneDrive/Desktop/CPLN675/Week1415")


PASEproj <- "+proj=lcc +lat_0=39.3333333333333 +lon_0=-77.75 +lat_1=40.9666666666667 +lat_2=39.9333333333333 +x_0=600000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs"

# THERE ARE WARNINGS ABOUT PROJECTIONS...so then reproject just in case

#lc2011 <- raster("landcovpb/landcovpb11_2.tif")
#lc2011 <- projectRaster(lc2011, crs = crs(PASEproj))

#lc2019 <- raster("landcovpb/landcovpb19_2.tif") 
#lc2019 <- projectRaster(lc2019, crs = crs(PASEproj)) 


# option to downsize the raster
#lc2011 <- aggregate(lc2011, fact = 2, fun = modal)
#lc2019 <- aggregate(lc2019, fact = 2, fun = modal)

#writeRaster(lc2011, filename=file.path("C:/Users/ammar/OneDrive/Desktop/CPLN675/Week1415", "lc11mdds.tif"), format="GTiff", overwrite=TRUE)

#writeRaster(lc2019, filename=file.path("C:/Users/ammar/OneDrive/Desktop/CPLN675/Week1415", "lc19mdds.tif"), format="GTiff", overwrite=TRUE)

lc2011 <- raster("lc11mdds.tif") 
lc2011 <- projectRaster(lc2011, crs = crs(PASEproj))
lc2019 <- raster("lc19mdds.tif") 
lc2019 <- projectRaster(lc2019, crs = crs(PASEproj)) 

The metropolitan-Statistical Area (MSA/metro area) is also pulled for clipping the land cover data. The Pittsburgh MSA is made up of 5 counties, including Allegheny County which is where the City of Pittsburgh is located. The following three chunks pertain to this process.

# MSA data
acs_vars <- c("B02001_001E")
census_api_key("05b9c101eb2ee7dc7abb88140da527ce637ac07f", overwrite = TRUE)
PBMSAPop11 <- get_acs(geography = "tract", 
                        variables = acs_vars, 
                        year = 2011,
                        state = 42, 
                        geometry = TRUE, 
                        output = "wide",
                        county=c("Allegheny", "Armstrong", "Beaver","Butler","Fayette","Washington",
                         "Westmoreland")) %>%
                rename(pop2019 = B02001_001E) %>%
                dplyr::select(-starts_with("B")) %>%
  st_transform(PASEproj)

PBMSAPop19 <- get_acs(geography = "tract", 
                        variables = acs_vars, 
                        year = 2019,
                        state = 42, 
                        geometry = TRUE, 
                        output = "wide",
                        county=c("Allegheny", "Armstrong", "Beaver","Butler","Fayette","Washington",
                         "Westmoreland")) %>%
                rename(pop2019 = B02001_001E) %>%
                dplyr::select(-starts_with("B")) %>% 
  st_transform(PASEproj)
lc_PBMSA_2011_crop <- crop(lc2011, extent(PBMSAPop11))
lc_PBMSA_2019_crop <- crop(lc2019, extent(PBMSAPop19)) 
PBMSA_lc2011 <- raster::mask(lc_PBMSA_2011_crop, PBMSAPop11)
PBMSA_lc2019 <- raster::mask(lc_PBMSA_2019_crop, PBMSAPop19)

Calculating Land Cover Change

To figure out what areas have changed over the 8-year time frame, we have to reclassify the data. The reclassify matrix uses the categorized land cover and essentially assigns developed areas a 1 value and all other values a 0. Using the development change equation, output values that are 1 have changed over time. Thus, development change becomes the dependent variable.

reclassMatrix <- 
  matrix(c(
    0,12,0,
    13,24,1,
    25,Inf,0),
  ncol=3, byrow=T)

developed_2011 <- 
  reclassify(PBMSA_lc2011,reclassMatrix)  

developed_2019 <- 
  reclassify(PBMSA_lc2019,reclassMatrix)

development_change <- developed_2011+developed_2019

#hist(development_change)

development_change[development_change != 1] <- NA

This map shows the results of the above reclassification by displaying raster cells that have changed over the 8 year period and are now developed.

ggplot() +
  geom_sf(data=PBMSAPop11, colour=NA) +
  geom_raster(data=rast(development_change) %>% na.omit, 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
  labs(title="2011-2019 Development Land Use Change", caption ="Figure 1.1") +
  mapTheme

Fishnet Creation

In order to conduct future analysis, we also build a fishnet (or a grid of cells) that cover the Pittsburgh MSA. The fishnet is the basis of this further analysis as it allows us to assign values of various variables in the cells.

PBMSA_fishnet <- 
  st_make_grid(PBMSAPop11, 4000) %>%
  st_sf()

PBMSA_fishnet <-
  PBMSA_fishnet[PBMSAPop11,]

ggplot() +
  geom_sf(data=PBMSA_fishnet) +
  labs(title="Fishnet, 4000 Foot Resolution", caption="Figure 1.2") +
  mapTheme

Once the fishnet is created, we join the changed areas (now developed areas) to the fishnet. Including this dependent variable will allow us to build, train, test, and predict a model.

changePoints <-
  rasterToPoints(development_change) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(PBMSA_fishnet))

fishnet <- 
  aggregate(changePoints, PBMSA_fishnet, sum) %>%
  mutate(layer = ifelse(is.na(layer),0,1),
         layer = as.factor(layer)) #layer is basically lc_change in the rmd
 
ggplot() +
  geom_sf(data=PBMSAPop11) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=layer)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land Cover Development Change", subtitle = "As fishnet centroids") +
  mapTheme

Land Cover in 2011 + 2019

This is a plot showing land cover in the Pittsburgh MSA in 2011. Plot can run, but commented out for sake of time.

#ggplot() +
  #geom_sf(data=PBMSAPop11) +
  #geom_raster(data=rast(PBMSA_lc2011) %>% na.omit %>% filter(value > 0), 
              #aes(x,y,fill=as.factor(value))) +
  #scale_fill_viridis(discrete=TRUE, name ="") +
  #labs(title = "Pittsburgh MSA Land Cover, 2011") +
  #theme(legend.direction="horizontal") +
  #mapTheme

This is a plot showing land cover in the Pittsburgh MSA in 2019. Plot can run, but commented out for sake of time.

#ggplot() +
  #geom_sf(data=PBMSAPop19) +
  #geom_raster(data=rast(PBMSA_lc2019) %>% na.omit %>% filter(value > 0), 
              #aes(x,y,fill=as.factor(value))) +
  #scale_fill_viridis(discrete=TRUE, name ="") +
  #labs(title = "Pittsburgh MSA Land Cover, 2019") +
  #theme(legend.direction="horizontal") +
  #mapTheme

Assigning Land Cover Categories

To refine the analysis further and see which types of lands get developed, we also need to assign the traditional land classes to the land cover.

developed <- PBMSA_lc2011 == 21 | PBMSA_lc2011 == 22 | PBMSA_lc2011 == 23 | PBMSA_lc2011 == 24
forest <- PBMSA_lc2011 == 41 | PBMSA_lc2011 == 42 | PBMSA_lc2011 == 43 
farm <- PBMSA_lc2011 == 81 | PBMSA_lc2011 == 82 
wetlands <- PBMSA_lc2011 == 90 | PBMSA_lc2011 == 95 
otherUndeveloped <- PBMSA_lc2011 == 52 | PBMSA_lc2011 == 71 | PBMSA_lc2011 == 31 
water <- PBMSA_lc2011 == 11

names(developed) <- "developed"
names(forest) <- "forest"
names(farm) <- "farm"
names(wetlands) <- "wetlands"
names(otherUndeveloped) <- "otherUndeveloped"
names(water) <- "water"

This plot displays all of the land classes side by side. As seen from the plot, in 2011, most of the developed area is centered around the City of Pittsburgh–this information will later inform a variable we add for the model. The farmland map is pretty close to opposite that of the developed map, while forest land is covers a large portion of the MSA.

theRasterList <- c(developed,forest,farm,wetlands,otherUndeveloped,water)

aggregatedRasters <-
  aggregateRaster(theRasterList, PBMSA_fishnet) %>%
  dplyr::select(developed,forest,farm,wetlands,otherUndeveloped,water) %>%
  mutate_if(is.numeric,as.factor)

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%    #just to make sure no weird geometries slipped in
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
    geom_sf(data=PBMSAPop11) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("No","Yes"),
                        name = "") +
    labs(title = "Land Cover Types, 2011",
         subtitle = "As fishnet centroids", caption = "Figure 1.3") +
   mapTheme

Creating Other Features for the Model

The following section of code includes feature creation for the independent variables we will include in the model. The following variables are: total population, population growth, highway distance, city boundary distance, and spatial lag of development.

Census Data

To help determine development and development demand, we will include census population variables. The following chunk of code pulls in and the plot shows census total population data for 2011 and 2019.

PBMSAPop2011 <- get_acs(geography = "tract", 
                        variables = "B02001_001E", 
                        year = 2011,
                        state = 42, 
                        geometry = TRUE, 
                        county=c("Allegheny", "Armstrong", "Beaver","Butler","Fayette","Washington",
                         "Westmoreland")) %>%
  rename(pop_2011 = estimate) %>%
  st_transform(st_crs(PBMSAPop11))

PBMSAPop2019 <- get_acs(geography = "tract", 
                        variables = "B02001_001E", 
                        year = 2019,
                        state = 42, 
                        geometry = TRUE, 
                        county=c("Allegheny", "Armstrong", "Beaver","Butler","Fayette","Washington",
                         "Westmoreland")) %>%
  rename(pop_2019 = estimate) %>%
  st_transform(st_crs(PBMSAPop19)) %>%
  st_buffer(-1)

grid.arrange(
ggplot() +
  geom_sf(data = PBMSAPop2011, aes(fill=factor(ntile(pop_2011,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(PBMSAPop2011,"pop_2011"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Pittsburgh MSA: 2011", caption = "Figure 2.1") +
  mapTheme,

ggplot() +
  geom_sf(data = PBMSAPop2019, aes(fill=factor(ntile(pop_2019,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(PBMSAPop2019,"pop_2019"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Pittsburgh MSA: 2019", caption = "Figure 2.1") +
  mapTheme, ncol=2)

Following a similar workflow to when we joined land change points to the fishnet, we must do the same for census population. The following line of code aggregates the census population as census tracts to the fishnet.

PBMSA_fishnet <-
  PBMSA_fishnet %>%
  rownames_to_column("fishnetID") %>% 
  mutate(fishnetID = as.numeric(fishnetID)) %>%
  dplyr::select(fishnetID)

fishnetPopulation11 <-
  st_interpolate_aw(PBMSAPop2011["pop_2011"], PBMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(PBMSA_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(pop_2011 = replace_na(pop_2011,0)) %>%
  dplyr::select(pop_2011)

fishnetPopulation19 <-
  st_interpolate_aw(PBMSAPop2019["pop_2019"], PBMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  rownames_to_column(var = "fishnetID") %>%
  left_join(PBMSA_fishnet %>%
              mutate(fishnetID = as.character(fishnetID)),
            ., by=c("fishnetID"='fishnetID')) %>% 
  mutate(pop_2019 = replace_na(pop_2019,0)) %>%
  dplyr::select(pop_2019)

fishnetPopulation <- 
  cbind(fishnetPopulation11,fishnetPopulation19) %>%
  dplyr::select(pop_2011,pop_2019) %>%
  mutate(pop_Change = pop_2019 - pop_2011)

This plot shows the difference in aggregating the data from a census tract to the fishnet.

grid.arrange(
ggplot() +
  geom_sf(data=PBMSAPop2011, aes(fill=factor(ntile(pop_2011,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=substr(quintileBreaks(PBMSAPop2011,"pop_2011"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Pittsburgh MSA: 2011",
       subtitle="Represented as tracts; Boundaries omitted", caption = "Figure 2.2") +
  mapTheme,

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(pop_2011,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"pop_2011"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Pittsburgh MSA: 2011",
       subtitle="Represented as fishnet gridcells; Boundaries omitted", caption = "Figure 2.2") +
  mapTheme, ncol=2)

Highway Distance

As one of the variables to determine growth, we are including distance from highways. We hypothesize that development will occur nearer to a highway development for ease of accessibility. The following chunk of code pulls in a highway layer from: https://catalog.data.gov/dataset/tiger-line-shapefile-2016-nation-u-s-primary-roads-national-shapefile

PBHighways <- st_read("highway/tl_2016_us_primaryroads.shp") %>%
  st_transform(st_crs(PBMSAPop2011)) %>%
  st_intersection(PBMSAPop2011)


ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=layer),size=1.5) +
  geom_sf(data=PBHighways) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids", caption = "Figure 2.3a") +
  mapTheme

Now we calculate distance from highways and aggregate that to the fishnet. The plot shows the variable we will include in a future model.

fishnetHW <- fishnet %>%
 mutate(uniqueID = as.character(row_number()))

fishnet_centroid <- fishnetHW %>%
 st_centroid()

highway_dist <- fishnet_centroid %>%
 st_distance(PBHighways %>%
 st_transform(st_crs(fishnet_centroid))) %>%
 as.data.frame() %>%
 mutate(uniqueID = as.character(row_number())) %>%
 gather(-uniqueID, key = "variable", value = "value") %>%
  dplyr::select(-variable) %>%
  group_by(uniqueID) %>%
 summarize(highway_dist = min(value))

fishnet <- left_join(fishnetHW, highway_dist)

ggplot() +
  geom_point(data=fishnet, aes(x=xyC(fishnet)[,1], 
                                y=xyC(fishnet)[,2], 
                 colour=factor(ntile(highway_dist,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(highway_dist,"highway_dist"),1,8),
                      name="Quintile\nBreaks") +
  geom_sf(data=PBHighways, colour = "#DD614A") +
  labs(title = "Distance to Highways",
       subtitle = "As fishnet centroids; Highways visualized in red", caption = "Figure 2.3b") +
  mapTheme

A Note on Distance from Past/Existing Development

Another feature we wanted to include in the model is distance from past/existing development. We hypothesize that this is an important variable as historic spatial patterns show that there is some sort of spatial relationship of distance bertween old development vs.Ā new development. For example, in many cases of suburbanization, that pattern often looks radial from a city center. To capture the phenomenon about distance from past development, we include two variables: city boundary distance and spatial lag. City boundary distance relates to the distance from the City of Pittsburgh. We hypothesize that new development will be located outside of the city boundaries, but close to the boundaries (suburbanization) and this variable hopefully captures that pattern. The spatial lag variable included uses k-nearest neighbor analysis to see what new development has popped up next to old development, encapsulating the hypothesis that new development will occur near old development. This may be because of certain amenities (like shopping centers or schools) or infrastructure (like sewer or water lines).

City Boundary Distance

These two chunks of code relate to the process of setting up the city boundary distance variable as described above. The city boundary data is pulled from: https://data.wprdc.org/dataset/pittsburgh-city-boundary

PBBoundary <- st_read("Pittsburgh_City_Boundary/City_Boundary.shp") %>%
  st_transform(st_crs(PBMSAPop2011)) %>%
  st_intersection(PBMSAPop2011)


ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=layer),size=1.5) +
  geom_sf(data=PBBoundary, colour=NA, fill="#DD614A") +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Pittsburgh City Bounds",
       subtitle = "As fishnet centroids", caption = "Figure 2.4a") +
  mapTheme

The plot shows the variable we will include in the model.

fishnetCB <- fishnet %>%
 mutate(uniqueID = as.character(row_number()))

fishnet_centroid1 <- fishnetCB %>%
 st_centroid()

city_dist <- fishnet_centroid1 %>%
 st_distance(PBBoundary %>%
 st_transform(st_crs(fishnet_centroid1))) %>%
 as.data.frame() %>%
 mutate(uniqueID = as.character(row_number())) %>%
 gather(-uniqueID, key = "variable", value = "value") %>%
  dplyr::select(-variable) %>%
  group_by(uniqueID) %>%
 summarize(city_dist = min(value))

fishnet <- left_join(fishnetCB, city_dist)

ggplot() +
  geom_point(data=fishnet, aes(x=xyC(fishnet)[,1], 
                                y=xyC(fishnet)[,2], 
                 colour=factor(ntile(city_dist,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(city_dist,"city_dist"),1,8),
                      name="Quintile\nBreaks") +
  geom_sf(data=PBHighways, colour = "#DD614A") +
  labs(title = "Distance to Pittsburgh City Center",
       subtitle = "As fishnet centroids; Highways visualized in red", caption = "Figure 2.4b") +
  mapTheme

Spatial Lag of Development

The final variable we will is spatial lag of development. As mentioned above, spatial lag is conducted with the k-nearest neighbor statistic and the function has been hard-coded for this analysis. The plot shows distance from past development, or spatial lag and the values to be included in the model.

fishnet$lagDevelopment <-
    nn_function(xyC(fishnet),
                xyC(filter(aggregatedRasters,developed==1)),
                2)

ggplot() +
  geom_sf(data=PBMSAPop11) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
                 colour = factor(ntile(lagDevelopment,5))), size=1.5) +
  scale_colour_manual(values = palette5,
                     labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
                     name="Quintile\nBreaks") +
  labs(title = "Spatial Lag to 2011 Development",
       subtitle = "As fishnet centroids", "Figure 2.5") +
  mapTheme

Supply-Side Scenario

As mentioned above, this analysis creates a prediction based on the model for a demand-side scenario (based off population and past growth) and a supply-side scenario. All of the code above as been preparing the model for the demand-side scenario, yet to streamline the coding process, we aggregate the supply side scenario now and model it later. Our supply-side scenario includes the creation of a new ā€œedge cityā€ called Michael’s Place, PA. An edge city is traditionally defined as a smaller urban area located a number of miles from a main city, and the edge city usually feeds off the development of the main city. A classic example of an edge city is King of Prussia, PA located near Philadelphia. This edge city development, Michael’s Place, would be similar to that of King of Prussia, a large commercial and retail build to attract those in the suburb, but also serve as a regional shopping hub.

New Edge City: Michael’s Place, PA

The geography of Michael’s Place was drawn within the MSA boundaries, but relatively close to the main city of Pittsburgh. Based off the development demand and environmental senstitivity analysis below, we wanted the new edge city to be located in Westmoreland County. The geography was drawn in ArcGIS Pro and imported into R. The following chunks of code show Michael’s Place and a new city boundary distance calculation where Michael’s Place is included.

NEC <- st_read("newedgecity/newedgecity.shp") %>%
  st_transform(st_crs(PBMSAPop2011)) %>%
  st_intersection(PBMSAPop2011)

PBNEC <- st_read("newedgecity/PBnewcitymerge.shp") %>%
  st_transform(st_crs(PBMSAPop2011)) %>%
  st_intersection(PBMSAPop2011)

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=layer),size=1.5) +
  geom_sf(data=PBNEC, fill="#DD614A", colour=NA) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Pittsburgh and Michael's Place Bounds",
       subtitle = "As fishnet centroids", caption = "Figure 2.6a") +
  mapTheme

fishnetPBNEC <- fishnet %>%
 mutate(uniqueID = as.character(row_number()))

fishnet_centroid2 <- fishnetPBNEC %>%
 st_centroid()

PBNEC_dist <- fishnet_centroid2 %>%
 st_distance(PBNEC %>%
 st_transform(st_crs(fishnet_centroid2))) %>%
 as.data.frame() %>%
 mutate(uniqueID = as.character(row_number())) %>%
 gather(-uniqueID, key = "variable", value = "value") %>%
  dplyr::select(-variable) %>%
  group_by(uniqueID) %>%
 summarize(PBNEC_dist = min(value))

fishnet <- left_join(fishnetPBNEC, PBNEC_dist)

ggplot() +
  geom_point(data=fishnet, aes(x=xyC(fishnet)[,1], 
                                y=xyC(fishnet)[,2], 
                 colour=factor(ntile(PBNEC_dist,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(PBNEC_dist,"PBNEC_dist"),1,8),
                      name="Quintile\nBreaks") +
  geom_sf(data=PBHighways, colour = "#DD614A") +
  labs(title = "Distance to Pittsburgh and Michael's Place",
       subtitle = "As fishnet centroids; Highways visualized in red", caption = "Figure 2.6b") +
  mapTheme

The Final Dataset

MSA Counties

Lastly, as another step of data preparation, we clip the vector values of the fishnet one more time to the study area (the metropolitan region) for accuracy.

options(tigris_class = "sf")

studyAreaCounties <- 
  counties("Pennsylvania") %>%
  st_transform(st_crs(PBMSAPop11)) %>%
  dplyr::select(NAME) %>%
  .[st_buffer(PBMSAPop11,-1000), , op=st_intersects]

ggplot() +
  geom_sf(data=studyAreaCounties) +
  labs(title = "Study Area Counties", caption = "Figure 2.7") +
  mapTheme

Final Dataset

Finally, now all of our data wrangling, data cleaning, and feature creation is complete. In this chunk of code, we create a compiled dataset of all of the above for the model.

dat <- 
  cbind(
    fishnet, fishnetPopulation, aggregatedRasters) %>% #no more highway points?
  dplyr::select(layer, developed, forest, farm, wetlands, otherUndeveloped, water,
                pop_2011, pop_2019, pop_Change, highway_dist,city_dist, PBNEC_dist, lagDevelopment) %>%
  st_join(studyAreaCounties) %>%
  mutate(developed19 = ifelse(layer == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0) 

Exploratory Analysis

Next, we explore how these variables impact existing or new development through a series of plots.

Distance/Continuous Variables

This plot shows the three continuous variables, which also happen to be our distance from infrastructure or past development variables. This plot shows that new development is closer to the city boundary, highways, and past development than cells that did not change.

dat %>%
  dplyr::select(highway_dist, city_dist,lagDevelopment,layer) %>%
  gather(Variable, Value, -layer, -geometry) %>%
  ggplot(., aes(layer, Value, fill=layer)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of the Continuous Variables", caption = "Figure 3.1") +
    plotTheme 

Population/Factor Variables

This plot shows the three population, or factor-based, variables. New development seems to occur in places where the population is higher–proving there is a demand for development where there are more people.

dat %>%
  dplyr::select(pop_2011,pop_2019,pop_Change,layer) %>%
  gather(Variable, Value, -layer, -geometry) %>%
  ggplot(., aes(layer, Value, fill=layer)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New Development as a Function of Factor Variables", caption = "Figure 3.2") +
    plotTheme

Table of Land Conversion Rates

This is a table that shows which types of land cover have been developed… NEED MORE INFO THIS PLOT DIDNT COME UP!

dat %>%
  dplyr::select(layer:otherUndeveloped,developed) %>%
  gather(Land_Cover_Type, Value, -layer, -geometry) %>%
   st_set_geometry(NULL) %>%
     group_by(layer, Land_Cover_Type) %>%
     summarize(n = sum(as.numeric(Value))) %>%
     ungroup() %>%
    mutate(Conversion_Rate = paste0(round(100 * n/sum(n), 2), "%")) %>%
    filter(layer == 1) %>%
  dplyr::select(Land_Cover_Type,Conversion_Rate) %>%
  kable(caption = "Figure 3.3") %>% kable_styling(full_width = F)
Figure 3.3
Land_Cover_Type Conversion_Rate
developed 10.03%
farm 23.52%
forest 34.03%
otherUndeveloped 2.39%
wetlands 0.44%

Models + Predictions

The next part of this analysis will cover the models and predictions. We will first model and predict out our demand scenario for both 2019 (evaluate the accuracy of the model) and then predict into the future for 2030.

Splitting the Dataset

This following chunk of code splits the dataset to be trained and tested.

set.seed(3456)
trainIndex <- 
  createDataPartition(dat$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
datTrain <- dat[ trainIndex,]
datTest  <- dat[-trainIndex,]

nrow(dat)

Models

We created 7 variations of models for this analysis exercise. Each model added a variable in. The model that we decided to pursue was model 7, which included: land cover (whether the land was wetlands, forest, farm, other undeveloped), spatial lag of development, population change, highway distance, and city boundary distance.

Model1 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2011, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2011 + 
              pop_2019, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              highway_dist, 
              family="binomial"(link="logit"), data = datTrain) 

Model7 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + highway_dist + city_dist, 
              family="binomial"(link="logit"), data = datTrain) 

#summary(Model7)  

Model 7

Below is a table outlining the results of Model 7. This table indicates that farm, other undeveloped, spatial lag of development, highway distance, and city boundary distance are all significant in explaining development patterns.

stargazer(Model7, type="html", digits=1, title="Linear Model 7 (Figure 4)", out = "Training LM.txt")
Linear Model 7 (Figure 4)
Dependent variable:
layer
wetlands1 0.3
(0.3)
forest1 0.3
(0.2)
farm1 -0.1*
(0.1)
otherUndeveloped1 -0.2**
(0.1)
lagDevelopment -0.000***
(0.000)
pop_Change -0.000
(0.001)
highway_dist -0.000***
(0.000)
city_dist -0.000***
(0.000)
Constant 2.3***
(0.2)
Observations 4,633
Log Likelihood -2,511.0
Akaike Inf. Crit. 5,040.0
Note: p<0.1; p<0.05; p<0.01

Model Evaluation

McFadden

As a method to evaluate the model, we also conducted a McFadden score, or a pseudo-r-squared. All of the McFadden scores are on the lower side, but model 7 has the highest score, meaning it is the best model that we created for explaining variance in land patterns.

modelList <- paste0("Model", 1:7)

map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:7)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model", caption = "Figure 5.1") +
    plotTheme

Histogram of Test Set Predicted Probabilities

This plot depicts a histogram of test set predicted probabilities. The rates of our predicted probabilities are rather high–indicating that our model might be quick to predict that land changed and developed over the 8 year period. As part of the evaluation below, we chose a new probability threshold between 0.5-0.7.

testSetProbs <- 
  data.frame(class = datTest$layer,
             probs = predict(Model7, datTest, type="response")) 
  
ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density", caption = "Figure 5.2") +
  plotTheme

Sensitivity/Specificity Table

Our sensitivity, our the rate of true positives, is quite high as our model is better at predicting developed areas, whereas our specificity is low, meaning our model struggles to predict areas that did not change. That being said, the accuracy of our models at these two thresholds is relatively high, 73 and 77% overall.

options(yardstick.event_first = FALSE)

testSetProbs <- 
  testSetProbs %>% 
  mutate(predClass_50 = as.factor(ifelse(testSetProbs$probs >= 0.5 ,1,0)),
         predClass_70 = as.factor(ifelse(testSetProbs$probs >= 0.7 ,1,0))) 

testSetProbs %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable(caption = "Figure 5.3") %>%
  kable_styling(full_width = F)
Figure 5.3
Variable Sensitivity Specificity Accuracy
predClass_50 0.91 0.34 0.73
predClass_70 0.71 0.69 0.70

Development Predictions based on Threshold

As mentioned, our model is quick to predict new development. This plot shows that when we increase the threshold, our analysis becomes more refined.

predsForMap <-         
  dat %>%
    mutate(probs = predict(Model7, dat, type="response") ,
           Threshold_50_Pct = as.factor(ifelse(probs >= 0.5 ,1,0)),
           Threshold_70_Pct =  as.factor(ifelse(probs >= 0.7 ,1,0))) %>%
    dplyr::select(layer,Threshold_50_Pct,Threshold_70_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")

ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development Predictions - Low Threshold", caption = "Figure 5.4") + 
  mapTheme

Development Predictions based on Confusion Matrix

Similarly, our plot indicates that our model is more often going to predict development, and therefore, we have many incorrect true positives.

ConfusionMatrix.metrics <-
  dat %>%
    mutate(probs = predict(Model7, dat, type="response") ,
           Threshold_50_Pct = as.factor(ifelse(probs >= 0.5 ,1,0)),
           Threshold_70_Pct =  as.factor(ifelse(probs >= 0.7 ,1,0))) %>%
    mutate(TrueP_50 = ifelse(layer  == 1 & Threshold_50_Pct == 1, 1,0),
           TrueN_50 = ifelse(layer  == 0 & Threshold_50_Pct == 0, 1,0),
           TrueP_70 = ifelse(layer  == 1 & Threshold_70_Pct == 1, 1,0),
           TrueN_70 = ifelse(layer  == 0 & Threshold_70_Pct == 0, 1,0)) %>%
    dplyr::select(., starts_with("True")) %>%
    gather(Variable, Value, -geometry) %>%
    st_cast("POLYGON") 

ggplot(data=ConfusionMatrix.metrics) +
  geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1], 
                 y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("Correct","Incorrect"),
                       name="") +
  labs(title="Development Predictions - Low Threshold", caption = "Figure 5.5") + mapTheme

Spatial Cross-Validation

Another method for evaluating our model is seeing how it performs across counties. Our model does the best at performing in Allegheny County (94%) and this may be because there is one variable entirely dedicated to the City of Pittsburgh, which is located in Allegheny County. The model is not as accurate for other counties hovering around 60-70% accuracies, with the lowest being Butler at 58%.

spatialCV_counties <-
  spatialCV(dat,"NAME","layer", Model7) %>%
  mutate(predClass = as.factor(ifelse(probs >= 0.70 ,1,0)))

spatialCV_metrics <-
  spatialCV_counties %>% 
    group_by(county) %>% 
    summarize(Observed_Change = sum(as.numeric(as.character(class))),
              Sensitivity = round(yardstick::sens_vec(class,predClass),2),
              Specificity = round(yardstick::spec_vec(class,predClass),2),
              Accuracy = round(yardstick::accuracy_vec(class,predClass),2)) 

spatialCV_metrics %>%
  kable(caption = "Figure 5.6") %>%
  kable_styling(full_width = F)
Figure 5.6
county Observed_Change Sensitivity Specificity Accuracy
Allegheny 1161 1.00 0.00 0.94
Armstrong 496 0.11 0.98 0.61
Beaver 590 0.88 0.12 0.72
Butler 901 0.54 0.65 0.58
Fayette 690 0.60 0.87 0.73
Washington 1127 0.78 0.33 0.67
Westmoreland 1361 0.67 0.61 0.65

Predicting Land Cover Demand for 2030

As part of this analysis, we want to employ the model to predict into the future. While there are limitations to the model (which will be discussed more further), as explained in the use case, this type of analysis would be incredibly helpful to practitioners to prepare for the future. This line of code makes a new dataset in order to use the 2019 predictions to predict for 2030.

dat1 <-
  dat %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed19 == 1)),2))

Population Projections

As a part of this new model, we will also include population projections for the area. The projections have been pulled from: https://files.dep.state.pa.us/Water/Division%20of%20Planning%20and%20Conservation/2010_2040PopulationProjections.pdf and show growth in all of the Pittsburgh MSA counties, which the highest growth being Allegheny and Westmoreland counties.

countyPopulation_2030 <- 
  data.frame(
   NAME = 
     c("Allegheny", "Armstrong", "Beaver","Butler","Fayette","Washington",
                         "Westmoreland"),
   county_projection_2030 = 
     c(1155460,64823,157895,205865,127240, 213722, 363832)) %>%
   left_join(
     dat %>%
       st_set_geometry(NULL) %>%
       group_by(NAME) %>%
       summarize(county_population_2019 = round(sum(pop_2019))))


countyPopulation_2030 %>%
  gather(Variable,Value, -NAME) %>%
  ggplot(aes(reorder(NAME,-Value),Value)) +
  geom_bar(aes(fill=Variable), stat = "identity", position = "dodge") +
  scale_fill_manual(values = palette2,
                    labels=c("2019","2030"),
                    name="Population") +
  labs(title="Population Change by County: 2019 - 2030",
       x="County", y="Population", caption = "Figure 6.1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme

Development Demand in 2030

Based on our model and the predict function, this plot shows development demand predicted for 2030. This map shows the highest development demand located in Allegheny County, surrounding the City of Pittsburgh. The development demand radiates out from this central county as well.

dat_infill <-
  dat1 %>%
  #calculate population change
    left_join(countyPopulation_2030) %>%
    mutate(proportion_of_county_pop = pop_2019 / county_population_2019,
           pop_2030.infill = proportion_of_county_pop * county_projection_2030,
           pop_Change = round(pop_2030.infill - pop_2019),2) %>%
    dplyr::select(-county_projection_2030, -county_population_2019, 
                  -proportion_of_county_pop, -pop_2030.infill) %>%
  #predict for 2030
    mutate(predict_2030.infill = predict(Model7,. , type="response"))

dat_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat_infill)[,1], y=xyC(dat_infill)[,2], colour = factor(ntile(predict_2030.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill,"predict_2030.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=studyAreaCounties, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2030: Predicted Probabilities", caption = "Figure 6.2") +
  mapTheme

Comparing Predicted Development Demand & Environmental Sensitivity

To further evaluate our predictions, we wanted to explore the relationship between development demand and environmental sensitivity.

2019 Land Cover Data

The plot below mimics the 2011 raster plots, but for 2019. This plots are largely similar to that in 2011.

developed19_1 <- PBMSA_lc2019 == 21 | PBMSA_lc2019 == 22 | PBMSA_lc2019 == 23 | PBMSA_lc2019 == 24
forest19 <- PBMSA_lc2019 == 41 | PBMSA_lc2019 == 42 | PBMSA_lc2019 == 43 
farm19 <- PBMSA_lc2019 == 81 | PBMSA_lc2019 == 82 
wetlands19 <- PBMSA_lc2019 == 90 | PBMSA_lc2019 == 95 
otherUndeveloped19 <- PBMSA_lc2019 == 52 | PBMSA_lc2019 == 71 | PBMSA_lc2019 == 31 
water19 <- PBMSA_lc2019 == 11

names(developed19_1) <- "developed19_1"
names(forest19) <- "forest19"
names(farm19) <- "farm19"
names(wetlands19) <- "wetlands19"
names(otherUndeveloped19) <- "otherUndeveloped19"
names(water19) <- "water19"
theRasterList19 <- c(developed19_1,forest19,farm19,wetlands19,otherUndeveloped19,water19)

dat2 <-
  aggregateRaster(theRasterList19, dat1) %>%
  dplyr::select(developed19_1,forest19,farm19,wetlands19,otherUndeveloped19,water19) %>%
  st_set_geometry(NULL) %>% 
  bind_cols(.,dat1) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat2 %>%
  gather(var,value,developed19_1:water19) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=PBMSAPop11) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land Cover Types, 2019",
         subtitle = "As fishnet centroids", caption = "Figure 7.1") +
   mapTheme

Sensitive Land Cover Lost

From the raster aggregation above though, we can create a plot of sensitive areas lost in the 8 year period. We defined sensitive areas as farmland and wetlands. We used these two classificatons as wetlands should not be build on to protect from flooding and farmland is often thought of as greenfield land (easy for development), but greenfield land should be protected for agricultural use and to mitigate the effects of climate change.

dat2 <-
  dat2 %>%
   mutate(sensitive_lost19 = ifelse(farm == 1 & farm19 == 0 |
                                    wetlands == 1 & wetlands19 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost19))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2011 - 2019",
       subtitle = "As fishnet centroids", caption = "Figure 7.2") +
  mapTheme

#sensitiveRegions <- 
 # raster::clump(wetlands19 + forest19) %>%
  #rasterToPolygons() %>%
  #st_as_sf() %>%
  #group_by(clumps) %>% 
  #summarize() %>%
   # mutate(Acres = as.numeric(st_area(.) * 0.0000229568)) %>%
    #filter(Acres > 3954)  %>%
  #dplyr::select() %>%
  #raster::rasterize(.,emptyRaster) 

#sensitiveRegions[sensitiveRegions > 0] <- 1  
#names(sensitiveRegions) <- "sensitiveRegions"

#dat2 <-
 # aggregateRaster(c(sensitiveRegions), dat2) %>%
  #dplyr::select(sensitiveRegions) %>%
  #st_set_geometry(NULL) %>%
  #bind_cols(.,dat2) %>%
  #st_sf()

#ggplot() +
 # geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitiveRegions))) +
  #scale_colour_manual(values = palette2,
   #                   labels=c("Other","Sensitive Regions"),
    #                  name="") +
  #labs(title = "Sensitive regions",
   #    subtitle = "Continous areas of either wetlands or forests\ngreater than 1 acre") +
  #mapTheme

Summarize by County

Lastly, as part of the development demand/environmental sensitivity analysis, we will compare the counties together. Overall, the counties all have high rates of development demand and high rates of unsuitable land for development. For further analysis, we chose to look into the counties of Allegheny and Westmoreland. We chose Allegheny County as it has the highest development demand and the City of Pittsburgh is located within county boundaries, so this model will help identify build-able lands versus lands a planner may want to protect. We also wanted to look at a county with similar statistics, but no major city, so we explored Westmoreland.

county_specific_metrics <- 
  dat2 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model7, dat2, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(NAME) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(NAME) %>%
  summarize(Total_Farmland = sum(farm19) / max(count),
            Total_Forest = sum(forest19) / max(count),
            Total_Wetlands = sum(wetlands19) / max(count),
            Total_Undeveloped = sum(otherUndeveloped19) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost19) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(countyPopulation_2030 %>% 
            mutate(Population_Change = county_projection_2030 - county_population_2019,
                   Population_Change_Rate = Population_Change / county_projection_2030) %>%
            dplyr::select(NAME,Population_Change_Rate))
county_specific_metrics %>%
  gather(Variable, Value, -NAME, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost","Sensitive_Regions",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~NAME, ncol=4) +
    coord_flip() +
    scale_y_continuous(breaks = seq(.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate", caption = "Figure 8") +
    plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")

County Allocation Analysis

As described above, chose two counties to conduct an allocation analysis on–essentially creating a map where development can happen (not in sensitive areas) over a layer of development demand.

Allegheny County

From this analysis, we can see that much of Allegheny is already developed and there are not many areas to be developed that would not impact the environment. This must be considered in the further growth of the Pittsburgh surrounding area. That being said, based on this analysis, there are a few pockets along the river that could be built out more in the future.

allegheny <-
  dat2 %>%
    mutate(Development_Demand = predict(Model7, dat2, type="response")) %>%
    filter(NAME == "Allegheny") 

allegheny_landUse <- rbind(
  filter(allegheny, farm19 == 1 | wetlands19 == 1 ) %>%
  dplyr::select() %>% mutate(Land_Use = "Not Suitable"),
  filter(allegheny, developed19_1 == 1) %>%
  dplyr::select() %>% mutate(Land_Use = "Developed"))

grid.arrange(
ggplot() +
  geom_sf(data=allegheny, aes(fill=factor(ntile(Development_Demand,5))), colour=NA) +
  geom_point(data=allegheny_landUse, aes(x=xyC(allegheny_landUse)[,1], 
                                        y=xyC(allegheny_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  geom_sf(data=st_intersection(PBHighways,filter(studyAreaCounties, NAME=="Allegheny")), size=2) +
  scale_fill_manual(values = palette5, name="Development\nDemand",
                    labels=substr(quintileBreaks(allegheny,"Development_Demand"),1,5)) +
  scale_colour_manual(values = c("black","#DD614A")) + 
  labs(title = "Development Potential, \n2030: Allegheny County", caption = "Figure 9.1") + mapTheme +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)),

ggplot() +
  geom_sf(data=allegheny, aes(fill=factor(ntile(pop_Change,5))), colour=NA) +
  geom_point(data=allegheny_landUse, aes(x=xyC(allegheny_landUse)[,1], 
                                        y=xyC(allegheny_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  geom_sf(data=st_intersection(PBHighways,filter(studyAreaCounties, NAME=="Allegheny")), size=2) +
  scale_fill_manual(values = palette5, name="Population\nChange",
                    labels=substr(quintileBreaks(allegheny,"pop_Change"),1,5)) +
  scale_colour_manual(values = c("black","#DD614A")) + 
  labs(title = "Projected Population, \n2030: Allegheny County", caption = "Figure 9.1") + mapTheme +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)), ncol=2)

Westmoreland County

Based on our environmental sensitivity characteristics, there is a lot of Westmoreland County that should not be developed. This needs to be thought through when considering the trade offs of the environment for development. That being said, there are pockets of areas next to already developed land or a large swath of land to the east that could be developed. It would be interesting to move Michael’s Place to the swath of land in the east to see how the predictions may have changed.

westmoreland <-
  dat2 %>%
    mutate(Development_Demand = predict(Model7, dat2, type="response")) %>%
    filter(NAME == "Westmoreland") 

westmoreland_landUse <- rbind(
  filter(westmoreland, farm19 == 1 | wetlands19 == 1 ) %>%
  dplyr::select() %>% mutate(Land_Use = "Not Suitable"),
  filter(westmoreland, developed19_1 == 1) %>%
  dplyr::select() %>% mutate(Land_Use = "Developed"))

grid.arrange(
ggplot() +
  geom_sf(data=westmoreland, aes(fill=factor(ntile(Development_Demand,5))), colour=NA) +
  geom_point(data=westmoreland_landUse, aes(x=xyC(westmoreland_landUse)[,1], 
                                        y=xyC(westmoreland_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  geom_sf(data=st_intersection(PBHighways,filter(studyAreaCounties, NAME=="Westmoreland")), size=2) +
  scale_fill_manual(values = palette5, name="Development\nDemand",
                    labels=substr(quintileBreaks(westmoreland,"Development_Demand"),1,5)) +
  scale_colour_manual(values = c("black","#DD614A")) + 
  labs(title = "Development Potential, \n2030: Westmoreland County", caption = "Figure 9.2") + mapTheme +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)),

ggplot() +
  geom_sf(data=westmoreland, aes(fill=factor(ntile(pop_Change,5))), colour=NA) +
  geom_point(data=westmoreland_landUse, aes(x=xyC(westmoreland_landUse)[,1], 
                                        y=xyC(westmoreland_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  geom_sf(data=st_intersection(PBHighways,filter(studyAreaCounties, NAME=="Westmoreland")), size=2) +
  scale_fill_manual(values = palette5, name="Population\nChange",
                    labels=substr(quintileBreaks(westmoreland,"pop_Change"),1,5)) +
  scale_colour_manual(values = c("black","#DD614A")) + 
  labs(title = "Projected Population, \n2030: Westmoreland County", caption = "Figure 9.2") + mapTheme +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)), ncol=2)

Supply-Side Scenario: Predicting for 2030 with Michael’s Place

As mentioned above, the previous parts of code relate to predicting development on demand. To stimulate a scenario where we would change the supply amount needed, we built the new edge city of Michael’s Place. The following chunks of code show building a model using Michael’s Place as a new feature within the city boundary distance layer. Overall, from the plots, we do not see that much of a change in development needed around Michael’s Place as this area was already predicted to need development. This may be because we needed more refined characteristics about Michael’s Place (higher population, coding all of the land developed), but either way, this seems like an area in the MSA that should be focused on so development can be approached appropriately.

set.seed(3456)
trainIndex2 <- 
  createDataPartition(dat2$developed19_1, p = .50,
                                  list = FALSE,
                                  times = 1)
datTrain2 <- dat2[ trainIndex,]
datTest2  <- dat2[-trainIndex,]
Model8 <- glm(layer ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + highway_dist + city_dist + PBNEC_dist, 
              family="binomial"(link="logit"), data = datTrain2) 
stargazer(Model8, type="html", digits=1, title="Linear Model 8 (Figure 10.1)", out = "Training LM.txt")
Linear Model 8 (Figure 10.1)
Dependent variable:
layer
wetlands1 0.3
(0.3)
forest1 0.9***
(0.3)
farm1 0.1
(0.1)
otherUndeveloped1 -0.2
(0.1)
lagDevelopment 0.000***
(0.000)
pop_Change 0.001
(0.002)
highway_dist -0.000***
(0.000)
city_dist -0.000***
(0.000)
PBNEC_dist 0.000
(0.000)
Constant 0.4
(0.3)
Observations 4,633
Log Likelihood -2,475.2
Akaike Inf. Crit. 4,970.3
Note: p<0.1; p<0.05; p<0.01
dat3 <-
  dat2 %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed19_1 == 1)),2))
dat_infill3 <-
  dat3 %>%
  #calculate population change
    left_join(countyPopulation_2030) %>%
    mutate(proportion_of_county_pop = pop_2019 / county_population_2019,
           pop_2030.infill = proportion_of_county_pop * county_projection_2030,
           pop_Change = round(pop_2030.infill - pop_2019),2) %>%
    dplyr::select(-county_projection_2030, -county_population_2019, 
                  -proportion_of_county_pop, -pop_2030.infill) %>%
  #predict for 2030
    mutate(predict_2030.infill = predict(Model8,. , type="response"))

grid.arrange(ncol=2,
  ggplot(dat_infill3) +  
  geom_point(aes(x=xyC(dat_infill3)[,1], y=xyC(dat_infill3)[,2], colour = factor(ntile(predict_2030.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill3,"predict_2030.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=studyAreaCounties, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2030: Predicted Probabilities", caption = "Figure 10.2") +
  mapTheme,

  ggplot(dat_infill3) +  
  geom_point(aes(x=xyC(dat_infill3)[,1], y=xyC(dat_infill3)[,2], colour = factor(ntile(predict_2030.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill3,"predict_2030.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=studyAreaCounties, fill=NA, colour="black", size=1) +
    geom_sf(data=NEC, fill = "#DD614A", colour = "#DD614A") +
  labs(title= "Westmoreland County Development Demand \n in 2030:  Predicted Probabilities \n With Michael's Place, PA", caption = "Figure 10.2") +
  mapTheme)

dat_infillwestmoreland <-
  dat3 %>%
  #calculate population change
    left_join(countyPopulation_2030) %>%
    mutate(proportion_of_county_pop = pop_2019 / county_population_2019,
           pop_2030.infill = proportion_of_county_pop * county_projection_2030,
           pop_Change = round(pop_2030.infill - pop_2019),2) %>%
    dplyr::select(-county_projection_2030, -county_population_2019, 
                  -proportion_of_county_pop, -pop_2030.infill) %>%
  #predict for 2030
    mutate(predict_2030.infill = predict(Model8,. , type="response")) %>% filter(NAME == "Westmoreland")

grid.arrange(ncol=2,
  ggplot(dat_infillwestmoreland) +  
  geom_point(aes(x=xyC(dat_infillwestmoreland)[,1], y=xyC(dat_infillwestmoreland)[,2], colour = factor(ntile(predict_2030.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infillwestmoreland,"predict_2030.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=NEC, fill = "NA", colour = "NA") +
  labs(title= "Westmoreland County Development Demand \n in 2030: Predicted Probabilities", caption = "Figure 10.3") +
  mapTheme,

  ggplot(dat_infillwestmoreland) +  
  geom_point(aes(x=xyC(dat_infillwestmoreland)[,1], y=xyC(dat_infillwestmoreland)[,2], colour = factor(ntile(predict_2030.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infillwestmoreland,"predict_2030.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=NEC, fill = "#DD614A", colour = "#DD614A") +
  labs(title= "Westmoreland County Development Demand \n in 2030:  Predicted Probabilities \n With Michael's Place, PA", caption = "Figure 10.3") +
  mapTheme)

Conclusion + Implementation

Predicting future urban growth, while tricky, will be imperative for getting the land development patterns desired and needed. As seen above, the Pittsburgh MSA area is expected to grow between now and 2030 and that development must happen somewhere. Modeling scenarios on both the demand and supply side are important to see development demand both due to population change, but also other development factors like proximity to amenities or infrastructure.

It was also interesting to model Pittsburgh in light of the recent news that Pittsburgh considered and then decided not to annex the Borough of Wilkinsburg. While currently, Pittsburgh is not annexing this outer borough, it may be a conversation revisited in the future. A model like this could help city officials in Pittsburgh understand development demand of an area they might annex to provide insight on what services would be needed for this area, but also what growth and tax base increase potential there is. (https://www.wesa.fm/politics-government/2022-02-02/pittsburgh-city-council-will-vote-down-wilkinsburg-annexation-but-its-not-over-yet)

Further, for urban areas in general, this is an important model to consider in efforts to reduce sprawl. While this model is not intended to predict in fill development and mostly predicts sprawl, it can highlight areas that need interventions to curb sprawl, which is important for the long-term vitality of cities and the mitigation of climate change as sprawl is overall worse for the environment than city living. (https://www.britannica.com/explore/savingearth/urban-sprawl)

Accuracy and Generalizability

While there are major limitations to this analysis, this methodology for urban growth modeling is thought-provoking and could be improved upon. Based off the results of our model currently, the model does not do the best in explaining variance, but using high thresholds, does have a fairly high rate of accuracy, around 70%. That being said, from the analysis, it is clear that our model over-predicts development. In terms of being generalizable, our model predicts the best on the county of Allegheny (94%). This is probably due to the fact that one variables is centered around activity in this county, city boundary distance. That being said, a majority of the counties hovered in a range of 60-70% accuracy, with our lowest county being Butler at 58%. Adding more features pertaining to other counties, like maybe sewer or electric utility lines, could improve their accuracy.

Ways Forward

There is so much more potential for this model. Some suggestions for refinement include: 1. Feature-engineering existing variables for them to be better predictors 2. Inclusion of missing variables, like utility infrastructure or certain amenities 3. More analysis could also be done on the land suitability/environmental sensitivity side as in this case, environmentally sensitive lands only included farmland and wetlands, but some degree forest land should also be included. Slope could also be included as a variable.