Introduction + Use Case + Info Video

We must begin to explore mitigation and adaptation strategies in light of extreme weather events. As climate change worsens, there will be a direct impact on human settlement. Climate change is causing more severe flooding and we believe that mitigation and adaptation efforts must be informed by observing and predicting flooding through data. The process of predicting locations of floods will allow people, cities, and government agencies protect environmental, built, and social assets.

Currently, our model is based on two sites. The first site is Calgary, CA–the third largest city in the country with warm summers and dry, cold winters. Based on available data, we used Calgary to train (figure out what features relate to flooding) and test (experiment with the results) our model. Once we created a relatively accurate, generalizable, and satisfactory model, we predicted flooding for the cities of Portland, OR and Vancouver, WA. We chose Portland and Vancouver as combined they are more similar to Calgary–they are larger cities in the US, have a river running through them, and have similar climates.

The goal of this exercise is to determine what variables are most prominent in predicting flooding so this model can be widely used and help protect assets. Planners may find this analysis useful to understand how

Check out markdown and linked informational video below to learn more about our model: https://www.youtube.com/watch?v=Jtj3mype5xM

Set Up

The following lines of code show our process to setup our R Markdown document.

knitr::opts_chunk$set(echo = TRUE)

library(caret)
library(pscl)
library(plotROC)
library(pROC)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(tigris)
library(viridis)
library(gridExtra)
library(dplyr)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(mapview)
library(tidycensus)
library(ggplot2)

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())

palette2 <- c("#981FAC","#FF006A")

Loading and Cleaning Data + ArcGIS Pro Workflow

The following lines of code take data that was created in ArcGIS Pro for Calgary and Portland-Vancouver and manipulate it for our model building process in R.

The data comes from multiple sources that are below:

In ArcGIS Pro, we built out 5 variables for our model:

  1. Average Elevation (took average elevation straight from DEM raster)

  2. Average Distance from Streams (created streams layer from ArcGIS hydrology workflow and used distance accumulation)

  3. Average Flow Accumlation (created from ArcGIS hydrology workflow)

  4. Average Slope (found using the slope tool)

  5. Number of Impervious Cells (after reclassifying the land over data, we added the number of impervious cells)

Our workflow to join those variables to a fishnet of the respective sites was:

  1. Conduct zonal statistics on each variable

  2. Join them to an already created fishnet covering each site – this is to aggregate the data to a fine grain level

Calgary Data & Portland-Vancouver Data

setwd("C:/Users/ammar/OneDrive/Desktop/CPLN675/MIDTERM/R_Code/midterm/midterm")

calgaryboundary <- st_read("FE_calgshp/calg_boundary.shp") %>% st_transform(crs = 3400)  #city boundary 
cal_streams <- st_read("FE_calgshp/calg_streams.shp") %>% st_transform(crs = 3400) #streams
cal_avgel <- st_read("FE_calgshp/calg_fishnet_avgel.shp") %>% st_transform(crs = 3400) #avg elevation
cal_fac <- st_read("FE_calgshp/calg_fishnet_fac.shp") %>% st_transform(crs = 3400) #flow accumulation
cal_impervious <- st_read("FE_calgshp/calg_fishnet_imperv.shp") %>% st_transform(crs = 3400)
cal_inundation <- st_read("FE_calgshp/calg_fishnet_inun.shp") %>% st_transform(crs = 3400)#inundation
cal_slope <- st_read("FE_calgshp/calg_fishnet_slope.shp") %>% st_transform(crs = 3400) #slope in degrees
cal_strdist <- st_read("FE_calgshp/calg_fishnet_strdist.shp") %>% st_transform(crs = 3400) #distance from stream
#creating dataframe
AREA <- c("")
MAJORITY <- c("")
avgel <- c("") #average elevation
facMean <- c("") #mean column of cal_fac
impervSum <- c("") #sum column of cal_impervious
slopeMean <- c("") #max column of cal_slope
strDist <- c("") #zsclagst_4 of cal_strdist
geometry <- c("")
calgary <- data.frame (AREA, MAJORITY, avgel, facMean, impervSum, slopeMean, strDist, geometry)

geometry <- cal_inundation$geometry

#create new inundation without geometry to populate some variables
cal_inundation1 <- cal_inundation %>% st_drop_geometry() 
AREA <- cal_inundation1[, c("zscalgin_3")]
MAJORITY <- cal_inundation1[, c("zscalgin_4")]

#create new feature engineered datasets without geometry to populate some variables
cal_avgel1 <- cal_avgel %>% st_drop_geometry()
avgel <- cal_avgel1[, c("zscalgav_4")] #mean

cal_fac1 <- cal_fac %>% st_drop_geometry()
facMean <- cal_fac1[, c("zscalgfa_3")] #mean

cal_impervious1 <- cal_impervious %>% st_drop_geometry()
impervSum <- cal_impervious1[, c("zscalgim_4")] #sum

cal_slope1 <- cal_slope %>% st_drop_geometry()
slopeMean <- cal_slope1[, c("zscalgsl_4")] #mean

cal_strdist1 <- cal_strdist %>% st_drop_geometry()
strDist <- cal_strdist1[, c("zscalgst_4")] #mean

calgary <- data.frame (AREA, MAJORITY, avgel, facMean, impervSum, slopeMean, strDist, geometry) %>% st_as_sf() %>% st_transform(crs = 3400)
plvcboundary <- st_read("FE_plvcshp/PLVC_bounds.shp") %>% st_transform(crs = 8756)  #city boundary 
plvc_fishnet <- st_read("FE_plvcshp/plvc_fishnet_clear.shp") %>% st_transform(crs = 8756) #streams
plvc_streams <- st_read("FE_plvcshp/PLVC_streams.shp") %>% st_transform(crs = 8756) #streams
plvc_avgel <- st_read("FE_plvcshp/PLVC_fishnet_avgel.shp") %>% st_transform(crs = 8756) #avg elevation
plvc_fac <- st_read("FE_plvcshp/plvc_fishnet_fac.shp") %>% st_transform(crs = 8756) #flow accumulation
plvc_impervious <- st_read("FE_plvcshp/plvc_fishnet_imperv.shp") %>% st_transform(crs = 8756)
plvc_slope <- st_read("FE_plvcshp/plvc_fishnet_slope.shp") %>% st_transform(crs = 8756) #slope in degrees
plvc_strdist <- st_read("FE_plvcshp/plvc_fishnet_strdist.shp") %>% st_transform(crs = 8756) #distance from stream
#creating dataframe
p_AREA <- c("")
p_avgel <- c("") #average elevation
p_facMean <- c("") #mean column of cal_fac
p_impervSum <- c("") #sum column of cal_impervious
p_slopeMean <- c("") #max column of cal_slope
p_strDist <- c("") #zsclagst_4 of cal_strdist
p_geometry <- c("")
plvc <- data.frame (p_AREA, p_avgel, p_facMean, p_impervSum, p_slopeMean, p_strDist, p_geometry)

p_geometry <- plvc_fishnet$geometry

#create new inundation without geometry to populate some variables
p_fishnet1 <- plvc_fishnet %>% st_drop_geometry() 
p_AREA <- p_fishnet1[, c("Area")]

#create new feature engineered datasets without geometry to populate some variables
plvc_avgel1 <- plvc_avgel %>% st_drop_geometry()
p_avgel <- plvc_avgel1[, c("zsplvcav_4")] #mean

plvc_fac1 <- plvc_fac %>% st_drop_geometry()
p_facMean <- plvc_fac1[, c("zsplvcfa_3")] #mean

plvc_impervious1 <- plvc_impervious %>% st_drop_geometry()
p_impervSum <- plvc_impervious1[, c("zsplvcim_4")] #sum

plvc_slope1 <- plvc_slope %>% st_drop_geometry()
p_slopeMean <- plvc_slope1[, c("zsplvcsl_4")] #mean

plvc_strdist1 <- plvc_strdist %>% st_drop_geometry()
p_strDist <- plvc_strdist1[, c("zsplvcst_4")] #mean

plvc <- data.frame (p_AREA, p_avgel, p_facMean, p_impervSum, p_slopeMean, p_strDist, p_geometry) %>% st_as_sf() %>% st_transform(crs = 8756)

plvc <- plvc %>%
  rename(AREA = p_AREA)
plvc <- plvc %>%
  rename(avgel = p_avgel)
plvc <- plvc %>%
  rename(facMean = p_facMean)
plvc <- plvc %>%
  rename(impervSum = p_impervSum)
plvc <- plvc %>%
  rename(slopeMean = p_slopeMean)
plvc <- plvc %>%
  rename(strDist = p_strDist)

Feature Engineering

To better suit the scales of our variables to both sites, we also did some minor feature engineering-categorizing the appropriate variables and the code is below.

calgary <- calgary %>%
  mutate(elevation = case_when(
    avgel <= 1077.99999 ~ "low",
    avgel >= 1078 & avgel <= 1189.99999 ~ "average",
    TRUE ~ "high"))

calgary <- calgary %>%
  mutate(flow = case_when(
    facMean <= 33016.99999 ~ "low",
    facMean >= 33017 & facMean <=66030.99999 ~ "average",
    TRUE ~ "high"))

calgary <- calgary %>%
  mutate(distance = case_when(
    strDist <= 1356.99999 ~ "nearest",
    strDist >= 1357 & strDist <=2673.99999 ~ "near",
    strDist >= 2674 & strDist <=3990.99999 ~ "average",
    strDist >= 3991 & strDist <=5307.99999 ~ "far",
    TRUE ~ "furthest"))

calgary <- calgary %>%
  mutate(slope = case_when(
    slopeMean < 3 ~ "flat",
    TRUE ~ "steep"))

calgary <- calgary %>%
  mutate(pervious = case_when(
    impervSum <= 5 ~ "yes",
    TRUE ~ "no"))

plvc <- plvc %>%
  mutate(elevation = case_when(
    avgel <= 117.99999 ~ "low",
    avgel >= 118 & avgel <= 226.99999 ~ "average",
    avgel >=227 ~ "high"))

plvc <- plvc %>%
  mutate(flow = case_when(
    facMean <= 24519.99999 ~ "low",
    facMean >= 24520 & facMean <=49031.999 ~ "average",
    TRUE ~ "high"))

plvc <- plvc %>%
  mutate(distance = case_when(
    strDist <= 774.99999 ~ "nearest",
    strDist >= 775 & strDist <=1517.99999 ~ "near",
    strDist >= 1518 & strDist <=2260.99999 ~ "average",
    strDist >= 2261 & strDist <=3003.99999 ~ "far",
    TRUE ~ "furthest"))

plvc <- plvc %>%
  mutate(slope = case_when(
    slopeMean < 3 ~ "flat",
    TRUE ~ "steep"))

plvc <- plvc %>%
  mutate(pervious = case_when(
    impervSum <= 5 ~ "yes",
    TRUE ~ "no"))

Exploratory Analysis

Maps

Below are maps of all our variables. Later in the model, it is shown that average elevation, flow accumulation, distance from stream, and impervious surfaces will be significant (in this case, a p value below 0.5 and weighing heavily on what accounts for flooding).

ggplot() + 
  geom_sf(data=calgary, fill = "dark blue", color = "dark blue") +
  geom_sf(data = calgary, aes(fill=as.factor(MAJORITY)), color = "light blue")+
  scale_fill_manual(values = c("white", "light blue"),
                    labels = c("No Flooding","Flooding"),
                    name = "") +
  labs(title="Observed Flooding in Calgary (Fishnet)") +
  mapTheme

grid.arrange(ncol = 3,
ggplot() +
  geom_sf(data = calgary) +
  geom_sf(data = cal_avgel,
          aes(fill=zscalgav_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans="reverse",
                     name = "Mean") +
  labs(title = "Average \n Elevation  ") +
  mapTheme,

ggplot() +
  geom_sf(data = calgary) +
  geom_sf(data = cal_fac,
          aes(fill=zscalgfa_3), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Flow \n Accumulation") +
  mapTheme,

ggplot() +
  geom_sf(data = calgary) +
  geom_sf(data = cal_slope,
          aes(fill=zscalgsl_4), 
          colour=NA,) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Slope Degrees") +
  mapTheme,

ggplot() +
  geom_sf(data = calgary) +
  geom_sf(data = cal_strdist,
          aes(fill=zscalgst_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Distance From \n Stream") +
  mapTheme,

ggplot() +
  geom_sf(data = calgary) +
  geom_sf(data = cal_impervious,
          aes(fill=zscalgim_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Sum") +
  labs(title = "Impervious \n Surface") +
  mapTheme)

grid.arrange(ncol = 3,
ggplot() +
  geom_sf(data = plvc) +
  geom_sf(data = plvc_avgel,
          aes(fill=zsplvcav_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans="reverse",
                     name = "Mean") +
  labs(title = "Average \n Elevation") +
  mapTheme,

ggplot() +
  geom_sf(data = plvc) +
  geom_sf(data = plvc_fac,
          aes(fill=zsplvcfa_3), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Flow \n Accumulation") +
  mapTheme,

ggplot() +
  geom_sf(data = plvc) +
  geom_sf(data = plvc_slope,
          aes(fill=zsplvcsl_4), 
          colour=NA,) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Slope Degrees") +
  mapTheme,

ggplot() +
  geom_sf(data = plvc) +
  geom_sf(data = plvc_strdist,
          aes(fill=zsplvcst_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Distance \n From Stream") +
  mapTheme,

ggplot() +
  geom_sf(data = plvc) +
  geom_sf(data = plvc_impervious,
          aes(fill=zsplvcim_4), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans = "reverse",
                     name = "Mean") +
  labs(title = "Impervious \n Surface") +
  mapTheme)

Plots

Below are plots of our variables in Calgary with average values for flooded or not flooded areas. Notably, flow accumulation is much higher in flooded areas, aligning with our hypothesis that areas with more water collection would flood easier. Impervious surfaces are more frequent in not flooded areas–potentially contrary to our hypothesis that inability to drain would cause more flooding. Lastly, areas that did not flood have a farther distance from streams, which aligns with our hypothesis that nearness would increase flooding potential.

calgaryPlotVariables <- 
  calgary %>%
  as.data.frame() %>%
  dplyr::select(MAJORITY,facMean,avgel,slopeMean,strDist,impervSum) %>%
  gather(variable, value, -MAJORITY)

ggplot(calgaryPlotVariables %>%
         group_by(MAJORITY, variable) %>%
         summarize(mean = mean(value))) + 
     geom_bar(aes(as.factor(MAJORITY), 
                  mean, 
                  fill=as.factor(MAJORITY)),
              stat="identity") + 
     facet_wrap(~variable, ncol = 5, scales = "free") +
    labs(title = "Exploratory Analysis of Features", 
          caption="Without Feature Engineer") +
     scale_fill_manual(values = c("light blue", "dark blue"),
                      labels = c("Not Flooded","Flooded"),
                      name = "") +
    labs(x="Preserved", y="Value")

Model Building

After familiarizing ourselves to the variables, we can now begin to build a model. To start testing on the probability of flooding in Calgary, we partition the dataset into a 70/30 split to train and test on.

set.seed(3456)
trainIndex <- createDataPartition(calgary$MAJORITY, p = .70,
                                  list = FALSE,
                                  times = 1)
calgaryTrain <- calgary[ trainIndex,]
calgaryTest  <- calgary[-trainIndex,]

Binomal Model

For this exercise, we used a gml linear model. Below is our final iteration and what we believe to be our version of the model. The model incorporates all our feature engineered variables above. Echoing the plots above, the model found that the nearest and far values of distance from streams; flow accumulation at both high and low levels; impervious surfaces; and low elevation were found to be statistically significant.

options(scipen=999)
calgaryModel <- glm(MAJORITY ~ distance + slope + flow + pervious + elevation,
                    family="binomial"(link="logit"), data = calgaryTrain %>%
                      as.data.frame %>%
                      dplyr::select(-AREA, -geometry, -strDist, -facMean, -slopeMean, -impervSum, -avgel))

summary(calgaryModel)
## 
## Call:
## glm(formula = MAJORITY ~ distance + slope + flow + pervious + 
##     elevation, family = binomial(link = "logit"), data = calgaryTrain %>% 
##     as.data.frame %>% dplyr::select(-AREA, -geometry, -strDist, 
##     -facMean, -slopeMean, -impervSum, -avgel))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8717  -0.3969  -0.2776  -0.1258   3.1183  
## 
## Coefficients:
##                    Estimate Std. Error z value         Pr(>|z|)    
## (Intercept)        -2.32979    0.60484  -3.852         0.000117 ***
## distancefar       -14.84497  644.11876  -0.023         0.981613    
## distancefurthest  -15.58523 2142.90731  -0.007         0.994197    
## distancenear        0.09435    0.37325   0.253         0.800449    
## distancenearest     1.74493    0.32081   5.439 0.00000005352677 ***
## slopesteep         -0.01934    0.18481  -0.105         0.916644    
## flowhigh            3.00010    1.12777   2.660         0.007809 ** 
## flowlow            -2.59918    0.49572  -5.243 0.00000015778565 ***
## perviousyes         1.00958    0.14470   6.977 0.00000000000302 ***
## elevationhigh     -14.63870  469.67655  -0.031         0.975136    
## elevationlow        0.68223    0.16818   4.056 0.00004983316554 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2335.6  on 5562  degrees of freedom
## Residual deviance: 1813.5  on 5552  degrees of freedom
## AIC: 1835.5
## 
## Number of Fisher Scoring iterations: 18

Model Validation

The following section uses several visualizations and tools to evaluate the accuracy and generalizability of our model.

Histogram Plot

In this distribution plot, a negative or 0 value means that an area did not flood, while a positive or 1 value means that an area did flood. Strong models will have a peak closer to 0 for the negatives (no flooding), and a peak closer to 1 for the positives (flooding).For our not flooded indicator, there is clustering towards 0–meaning our model is generally good at predicting places that did not flood. Looking at our flooded indicator, there is a larger peak at the 0 value (would not be predicted correctly), but there is a small peak at the 1 value (predicted correctly). Overall, this plot tells us that our model is better at predicting non-flooded areas than flooded areas. This may be due to the fact that in Calgary, more land is not flooded than flooded and our model was built off of this dataset. This does not mean the model cannot be used, it is just a factor to be considered in its efficacy.

Based off of the distribution above we change the threshold to 1 or 0.1 moving forward

classProbs <- predict(calgaryModel, calgaryTest, type="response")

hist(classProbs)

testProbs <- data.frame(obs = as.numeric(calgaryTest$MAJORITY),
                        pred = classProbs)

ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) + 
  geom_density() +
  facet_grid(obs ~ .) + 
  xlab("Probability") + 
  geom_vline(xintercept = .5) +
  scale_fill_manual(values = c("light blue", "dark blue"),
                      labels = c("Not Flooded","Flooded"),
                      name = "")+
  plotTheme

Confusion Matrix

Using our new threshold as a guide, this confusion matrix helps us better evaluate our model in lens of true positives, false positives, false negatives, and false positives. In this case we had: 1. 47 true positives (meaning we predicted a flood and there was a flood) 2. 85 false positives (meaning we predicted no flood and there was flood) 3. 2119 true negatives (meaning we predicted no flood and there was no flood) 4. 113 false negatives (meaning predicted a flood and there was no flood)

In our evaluation, we must keep an eye on the false positives as those were instances of flood in areas that may have not been protected.

testProbs$predClass  = ifelse(testProbs$pred > .1 ,1,0)

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2119   85
##          1  133   47
##                                           
##                Accuracy : 0.9086          
##                  95% CI : (0.8963, 0.9198)
##     No Information Rate : 0.9446          
##     P-Value [Acc > NIR] : 1.000000        
##                                           
##                   Kappa : 0.2536          
##                                           
##  Mcnemar's Test P-Value : 0.001456        
##                                           
##             Sensitivity : 0.35606         
##             Specificity : 0.94094         
##          Pos Pred Value : 0.26111         
##          Neg Pred Value : 0.96143         
##              Prevalence : 0.05537         
##          Detection Rate : 0.01971         
##    Detection Prevalence : 0.07550         
##       Balanced Accuracy : 0.64850         
##                                           
##        'Positive' Class : 1               
## 

ROC Curve

The ROC curve is used to plot the true positive rate against the false positive rate. From the curve, the AUC value is generated. Our AUC value is 0.83, meaning that our model performs excellently at distinguishing classes (flooded/not flooded). Our high AUC is a great indicator that our model might be sufficient enough to use.

ggplot(testProbs, aes(d = obs, m = pred)) + 
  geom_roc(n.cuts = 50, labels = FALSE, color = 'blue') + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') 

auc(testProbs$obs, testProbs$pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8349

Cross Validation

We continue to evaluate our model with cross-validation. This plot shows clustering around a high value of our model’s accuracy.

ctrl <- trainControl(method = "cv", 
                     number = 100, 
                     savePredictions = TRUE)

cvFit <- train(as.factor(MAJORITY) ~ .,  data = calgary %>% 
                                                as.data.frame() %>%
                                                dplyr::select(-AREA, -geometry, -strDist, -facMean, -slopeMean, -impervSum, -avgel), 
               method="glm", family="binomial",
               trControl = ctrl)

cvFit
ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count")+
  plotTheme

Predictions

Now that we have evaluated our model (conclusions on it below), we will use it to predict on our other city. This is a step in evaluating if our model is generalizable–meaning if it can move across space and time to predict as well on other conditions. As mentioned above, this exercise will use the cities of Portland and Vancouver. The two code chunks below predict on Calgary and Portland-Vancouver for further analysis.

allPredictions <- 
  predict(cvFit, calgary, type="prob")[,2]
  
calgary <- 
  cbind(calgary,allPredictions) %>%
  mutate(allPredictions = round(allPredictions * 100))

calgary1 <- calgary %>%
  mutate(PredClass = ifelse(allPredictions > 10, 1, 0))

calgary1 <- calgary1 %>%
  mutate(Correct = ifelse(PredClass == MAJORITY, "1", "0"),
         Incorrect = ifelse(PredClass != MAJORITY, "1", "0"))
plvc_allPredictions <- 
  predict(cvFit, plvc, type="prob")[,2] 

plvc <- 
  cbind(plvc,plvc_allPredictions) %>% 
  mutate(plvc_allPredictions = round(plvc_allPredictions * 100))

Spatial Cross Validation

As part of more validation of our model, we conducted a spatial cross validation method. In this code chunk, we built a map that displays average probability of flooding by neighborhood in Calgary. This aggregates the predicted grid cells up to the neighborhood level. The series of maps shows average probability of flooding by sector/neighborhood. The higher probabilities are closer to water features–validating our hypothesis and showing that space/proximity is a major factor in both flooding and this model.

cal_nhood <- st_read("nhood/geo_export_657fd344-2823-4047-abad-f7d6906b7c1a.shp") %>% st_transform(crs = 3400)
cal_nhood <- st_intersection(calgary1, cal_nhood)
cal_nhood <- cal_nhood %>% mutate(Correct = as.numeric(Correct),
                                  Incorrect = as.numeric(Incorrect))
cal_nhood <- cal_nhood %>% dplyr::select(name, sector, MAJORITY, allPredictions, PredClass, Correct, Incorrect, -geometry) %>% 
  group_by(name) %>% 
  summarise(meanPrediction = mean(allPredictions),
            accuracy = sum(Correct) / sum(Correct) + sum(Incorrect))

grid.arrange(ncol = 2,
ggplot() +
  geom_sf(data = cal_nhood,
          aes(fill=accuracy), 
          colour=NA) +
  geom_sf(data=calgary  %>% 
               filter(MAJORITY == 1), 
               fill="gold",colour="NA") +
  scale_fill_viridis(option="mako", trans="reverse") +
  labs(title = "Average Probability of Flooding by Sector with Observed Flooding in Gold") +
  mapTheme,

ggplot() +
  geom_sf(data = cal_nhood)+
  geom_sf(data = cal_nhood,
          aes(fill=accuracy), 
          colour=NA) +
  scale_fill_viridis(option="mako", trans="reverse") +
  labs(title = "Accuracy of Flooding (Average by Neighborhood)") +
  mapTheme)

Maps

PLVC Prediction Maps

This is a map of predicted flooding in Portland-Vancouver.

ggplot() + 
    geom_sf(data=calgary, aes(fill=factor(ntile(allPredictions,5))), 
            colour=NA) +
    scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                      labels=as.character(quantile(calgary$allPredictions,
                                                 c(0.1,.2,.4,.6,.8),
                                                 na.rm=T)),
                      name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
  mapTheme +
   labs(title="Predicted Flooded Areas in Calgary")

ggplot() + 
  geom_sf(data=plvc, aes(fill=factor(ntile(plvc_allPredictions,5))), 
          colour=NA) +
  scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(plvc$plvc_allPredictions,
                                                 c(0.1,.2,.4,.6,.8),
                                                 na.rm=T)),
                    name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
  mapTheme +
  labs(title="Predicted Flooded Areas in Portland-Vancouver")

Calgary Predicted vs Observed Map

This map shows predicted flooding overlaid with observed flooding. Generally, areas that are darker/more likely to flood seem to overlay with previous observed flooding. This speaks again to the overall high accuracy of our model.

 ggplot() + 
  geom_sf(data=calgary, aes(fill=factor(ntile(allPredictions,5))), colour=NA) +
  scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(calgary$allPredictions,
                                                 c(0.1,.2,.4,.6,.8),
                                                 na.rm=T)),
                    name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
  geom_sf(data=calgary  %>% 
               filter(MAJORITY == 1), 
               fill="gold",colour="NA") +
  mapTheme +
  labs(title="Observed and Predicted Flooded Areas",
       subtitle="Calgary; Observed flooded land in gold; Predicted flooded land in gradient ")

Calgary Confusion Metrics Map

This is a map of our confusion metrics to explore their distribution in space. Our true positives run along observed flooding and waterways. Our true negatives take up most of the map (as most of Calgary did not flood). As mentioned above, our false positives are something to hone in on in the model, as those areas would be flooded and would nee assistance.

calgary %>%
  mutate(confResult=case_when(allPredictions < 50 & MAJORITY==0 ~ "True_Negative",
                              allPredictions >= 50 & MAJORITY==1 ~ "True_Positive",
                              allPredictions < 50 & MAJORITY==1 ~ "False_Negative",
                              allPredictions >= 50 & MAJORITY==0 ~ "False_Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF"),
                    name="Outcomes")+
  labs(title="Confusion Metrics For Calgary") +
  mapTheme

Conclusion

Modeling extreme weather events is only going to become more imperative. This exercise sets out not only build a model to predict flooding, but more generally, identify variables that explain flooding potential, so that this can be replicated on many scales in many different areas.

Accuracy and Generalizability

Overall, our model is incredibly strong. Looking to just Calgary, the model has extremely high accuracy rates–this is shown in distribution plots, a majority of either true positives/negatives (meaning correct predictions), and a very high AUC rate (0.83). All of these indicators support the model being used from an accuracy standpoint. In terms of generalizability, there were two main metrics used. The spatial cross-validation allowed to see the distribution of predictions across the city of Calgary and overall, the model seemed to predict well in differnt neighborhoods through the variables chosen. Our model also reacted well to predicting on Portland-Vancouver. One of the biggest limitations to this exercise is that we have not yet “ground-truthed” our predictions in Portland-Vancouver, but they follow similar patterns as they do in Calgary. Overall, we are supportive of this model framework being used.

Ways Forward

There are three major takeaways or limitations to this model. 1. The model does not take into account flooding severity–this only gives a binary answer of if an area flooded. A flood reaching 1 foot of undesired water versus 10 feet of undesired water have very different impacts. Expanding this model to include severity would make it more complex and useful (if accurate). 2.As mentioned above, we need to do more exploration on generalizability. If we could compare the predictions in Portland-Vancouver to observed flooding, we could further support or disprove our model. As is, we believe in our output! 3. If the model is accurate and generalizable as is, we need infrastructural flooding support for Portland-Vancouver! We predicted that much of the city is susceptible for flooding and action should be taken!

Implementation

Implementing built changes to combat environmental issues is challenging as many environmental problems expand past political boundaries, however, climate change is here and must be addressed. For example, Portland-Vancouver should potentially work together (versus separately) on armoring their city from flooding since they share a river. This model is only the beginning of important mititgation and adaptation work.