We must begin to explore mitigation and adaptation strategies in light of extreme weather events. As climate change worsens, there will be a direct impact on human settlement. Climate change is causing more severe flooding and we believe that mitigation and adaptation efforts must be informed by observing and predicting flooding through data. The process of predicting locations of floods will allow people, cities, and government agencies protect environmental, built, and social assets.
Currently, our model is based on two sites. The first site is Calgary, CA–the third largest city in the country with warm summers and dry, cold winters. Based on available data, we used Calgary to train (figure out what features relate to flooding) and test (experiment with the results) our model. Once we created a relatively accurate, generalizable, and satisfactory model, we predicted flooding for the cities of Portland, OR and Vancouver, WA. We chose Portland and Vancouver as combined they are more similar to Calgary–they are larger cities in the US, have a river running through them, and have similar climates.
The goal of this exercise is to determine what variables are most prominent in predicting flooding so this model can be widely used and help protect assets. Planners may find this analysis useful to understand how
Check out markdown and linked informational video below to learn more about our model: https://www.youtube.com/watch?v=Jtj3mype5xM
The following lines of code show our process to setup our R Markdown document.
knitr::opts_chunk$set(echo = TRUE)
library(caret)
library(pscl)
library(plotROC)
library(pROC)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(tigris)
library(viridis)
library(gridExtra)
library(dplyr)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(mapview)
library(tidycensus)
library(ggplot2)
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.75),
axis.ticks=element_blank())
palette2 <- c("#981FAC","#FF006A")
The following lines of code take data that was created in ArcGIS Pro for Calgary and Portland-Vancouver and manipulate it for our model building process in R.
The data comes from multiple sources that are below:
Calgary Open Data: https://data.calgary.ca/ (calgary boundary, calgary DEM)
USA + CA Land Cover: http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2015-landsat-and-rapideye/ (used for impervious surfaces)
USGS 3DEP: https://www.usgs.gov/3d-elevation-program/about-3dep-products-services (Portland-Vancouver DEM)
Portland Open Data: https://gis-pdx.opendata.arcgis.com/ (Portland-Vancouver boundary)
In ArcGIS Pro, we built out 5 variables for our model:
Average Elevation (took average elevation straight from DEM raster)
Average Distance from Streams (created streams layer from ArcGIS hydrology workflow and used distance accumulation)
Average Flow Accumlation (created from ArcGIS hydrology workflow)
Average Slope (found using the slope tool)
Number of Impervious Cells (after reclassifying the land over data, we added the number of impervious cells)
Our workflow to join those variables to a fishnet of the respective sites was:
Conduct zonal statistics on each variable
Join them to an already created fishnet covering each site – this is to aggregate the data to a fine grain level
setwd("C:/Users/ammar/OneDrive/Desktop/CPLN675/MIDTERM/R_Code/midterm/midterm")
calgaryboundary <- st_read("FE_calgshp/calg_boundary.shp") %>% st_transform(crs = 3400) #city boundary
cal_streams <- st_read("FE_calgshp/calg_streams.shp") %>% st_transform(crs = 3400) #streams
cal_avgel <- st_read("FE_calgshp/calg_fishnet_avgel.shp") %>% st_transform(crs = 3400) #avg elevation
cal_fac <- st_read("FE_calgshp/calg_fishnet_fac.shp") %>% st_transform(crs = 3400) #flow accumulation
cal_impervious <- st_read("FE_calgshp/calg_fishnet_imperv.shp") %>% st_transform(crs = 3400)
cal_inundation <- st_read("FE_calgshp/calg_fishnet_inun.shp") %>% st_transform(crs = 3400)#inundation
cal_slope <- st_read("FE_calgshp/calg_fishnet_slope.shp") %>% st_transform(crs = 3400) #slope in degrees
cal_strdist <- st_read("FE_calgshp/calg_fishnet_strdist.shp") %>% st_transform(crs = 3400) #distance from stream
#creating dataframe
AREA <- c("")
MAJORITY <- c("")
avgel <- c("") #average elevation
facMean <- c("") #mean column of cal_fac
impervSum <- c("") #sum column of cal_impervious
slopeMean <- c("") #max column of cal_slope
strDist <- c("") #zsclagst_4 of cal_strdist
geometry <- c("")
calgary <- data.frame (AREA, MAJORITY, avgel, facMean, impervSum, slopeMean, strDist, geometry)
geometry <- cal_inundation$geometry
#create new inundation without geometry to populate some variables
cal_inundation1 <- cal_inundation %>% st_drop_geometry()
AREA <- cal_inundation1[, c("zscalgin_3")]
MAJORITY <- cal_inundation1[, c("zscalgin_4")]
#create new feature engineered datasets without geometry to populate some variables
cal_avgel1 <- cal_avgel %>% st_drop_geometry()
avgel <- cal_avgel1[, c("zscalgav_4")] #mean
cal_fac1 <- cal_fac %>% st_drop_geometry()
facMean <- cal_fac1[, c("zscalgfa_3")] #mean
cal_impervious1 <- cal_impervious %>% st_drop_geometry()
impervSum <- cal_impervious1[, c("zscalgim_4")] #sum
cal_slope1 <- cal_slope %>% st_drop_geometry()
slopeMean <- cal_slope1[, c("zscalgsl_4")] #mean
cal_strdist1 <- cal_strdist %>% st_drop_geometry()
strDist <- cal_strdist1[, c("zscalgst_4")] #mean
calgary <- data.frame (AREA, MAJORITY, avgel, facMean, impervSum, slopeMean, strDist, geometry) %>% st_as_sf() %>% st_transform(crs = 3400)
plvcboundary <- st_read("FE_plvcshp/PLVC_bounds.shp") %>% st_transform(crs = 8756) #city boundary
plvc_fishnet <- st_read("FE_plvcshp/plvc_fishnet_clear.shp") %>% st_transform(crs = 8756) #streams
plvc_streams <- st_read("FE_plvcshp/PLVC_streams.shp") %>% st_transform(crs = 8756) #streams
plvc_avgel <- st_read("FE_plvcshp/PLVC_fishnet_avgel.shp") %>% st_transform(crs = 8756) #avg elevation
plvc_fac <- st_read("FE_plvcshp/plvc_fishnet_fac.shp") %>% st_transform(crs = 8756) #flow accumulation
plvc_impervious <- st_read("FE_plvcshp/plvc_fishnet_imperv.shp") %>% st_transform(crs = 8756)
plvc_slope <- st_read("FE_plvcshp/plvc_fishnet_slope.shp") %>% st_transform(crs = 8756) #slope in degrees
plvc_strdist <- st_read("FE_plvcshp/plvc_fishnet_strdist.shp") %>% st_transform(crs = 8756) #distance from stream
#creating dataframe
p_AREA <- c("")
p_avgel <- c("") #average elevation
p_facMean <- c("") #mean column of cal_fac
p_impervSum <- c("") #sum column of cal_impervious
p_slopeMean <- c("") #max column of cal_slope
p_strDist <- c("") #zsclagst_4 of cal_strdist
p_geometry <- c("")
plvc <- data.frame (p_AREA, p_avgel, p_facMean, p_impervSum, p_slopeMean, p_strDist, p_geometry)
p_geometry <- plvc_fishnet$geometry
#create new inundation without geometry to populate some variables
p_fishnet1 <- plvc_fishnet %>% st_drop_geometry()
p_AREA <- p_fishnet1[, c("Area")]
#create new feature engineered datasets without geometry to populate some variables
plvc_avgel1 <- plvc_avgel %>% st_drop_geometry()
p_avgel <- plvc_avgel1[, c("zsplvcav_4")] #mean
plvc_fac1 <- plvc_fac %>% st_drop_geometry()
p_facMean <- plvc_fac1[, c("zsplvcfa_3")] #mean
plvc_impervious1 <- plvc_impervious %>% st_drop_geometry()
p_impervSum <- plvc_impervious1[, c("zsplvcim_4")] #sum
plvc_slope1 <- plvc_slope %>% st_drop_geometry()
p_slopeMean <- plvc_slope1[, c("zsplvcsl_4")] #mean
plvc_strdist1 <- plvc_strdist %>% st_drop_geometry()
p_strDist <- plvc_strdist1[, c("zsplvcst_4")] #mean
plvc <- data.frame (p_AREA, p_avgel, p_facMean, p_impervSum, p_slopeMean, p_strDist, p_geometry) %>% st_as_sf() %>% st_transform(crs = 8756)
plvc <- plvc %>%
rename(AREA = p_AREA)
plvc <- plvc %>%
rename(avgel = p_avgel)
plvc <- plvc %>%
rename(facMean = p_facMean)
plvc <- plvc %>%
rename(impervSum = p_impervSum)
plvc <- plvc %>%
rename(slopeMean = p_slopeMean)
plvc <- plvc %>%
rename(strDist = p_strDist)
To better suit the scales of our variables to both sites, we also did some minor feature engineering-categorizing the appropriate variables and the code is below.
calgary <- calgary %>%
mutate(elevation = case_when(
avgel <= 1077.99999 ~ "low",
avgel >= 1078 & avgel <= 1189.99999 ~ "average",
TRUE ~ "high"))
calgary <- calgary %>%
mutate(flow = case_when(
facMean <= 33016.99999 ~ "low",
facMean >= 33017 & facMean <=66030.99999 ~ "average",
TRUE ~ "high"))
calgary <- calgary %>%
mutate(distance = case_when(
strDist <= 1356.99999 ~ "nearest",
strDist >= 1357 & strDist <=2673.99999 ~ "near",
strDist >= 2674 & strDist <=3990.99999 ~ "average",
strDist >= 3991 & strDist <=5307.99999 ~ "far",
TRUE ~ "furthest"))
calgary <- calgary %>%
mutate(slope = case_when(
slopeMean < 3 ~ "flat",
TRUE ~ "steep"))
calgary <- calgary %>%
mutate(pervious = case_when(
impervSum <= 5 ~ "yes",
TRUE ~ "no"))
plvc <- plvc %>%
mutate(elevation = case_when(
avgel <= 117.99999 ~ "low",
avgel >= 118 & avgel <= 226.99999 ~ "average",
avgel >=227 ~ "high"))
plvc <- plvc %>%
mutate(flow = case_when(
facMean <= 24519.99999 ~ "low",
facMean >= 24520 & facMean <=49031.999 ~ "average",
TRUE ~ "high"))
plvc <- plvc %>%
mutate(distance = case_when(
strDist <= 774.99999 ~ "nearest",
strDist >= 775 & strDist <=1517.99999 ~ "near",
strDist >= 1518 & strDist <=2260.99999 ~ "average",
strDist >= 2261 & strDist <=3003.99999 ~ "far",
TRUE ~ "furthest"))
plvc <- plvc %>%
mutate(slope = case_when(
slopeMean < 3 ~ "flat",
TRUE ~ "steep"))
plvc <- plvc %>%
mutate(pervious = case_when(
impervSum <= 5 ~ "yes",
TRUE ~ "no"))
Below are maps of all our variables. Later in the model, it is shown that average elevation, flow accumulation, distance from stream, and impervious surfaces will be significant (in this case, a p value below 0.5 and weighing heavily on what accounts for flooding).
ggplot() +
geom_sf(data=calgary, fill = "dark blue", color = "dark blue") +
geom_sf(data = calgary, aes(fill=as.factor(MAJORITY)), color = "light blue")+
scale_fill_manual(values = c("white", "light blue"),
labels = c("No Flooding","Flooding"),
name = "") +
labs(title="Observed Flooding in Calgary (Fishnet)") +
mapTheme
grid.arrange(ncol = 3,
ggplot() +
geom_sf(data = calgary) +
geom_sf(data = cal_avgel,
aes(fill=zscalgav_4),
colour=NA) +
scale_fill_viridis(option="mako", trans="reverse",
name = "Mean") +
labs(title = "Average \n Elevation ") +
mapTheme,
ggplot() +
geom_sf(data = calgary) +
geom_sf(data = cal_fac,
aes(fill=zscalgfa_3),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Flow \n Accumulation") +
mapTheme,
ggplot() +
geom_sf(data = calgary) +
geom_sf(data = cal_slope,
aes(fill=zscalgsl_4),
colour=NA,) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Slope Degrees") +
mapTheme,
ggplot() +
geom_sf(data = calgary) +
geom_sf(data = cal_strdist,
aes(fill=zscalgst_4),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Distance From \n Stream") +
mapTheme,
ggplot() +
geom_sf(data = calgary) +
geom_sf(data = cal_impervious,
aes(fill=zscalgim_4),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Sum") +
labs(title = "Impervious \n Surface") +
mapTheme)
grid.arrange(ncol = 3,
ggplot() +
geom_sf(data = plvc) +
geom_sf(data = plvc_avgel,
aes(fill=zsplvcav_4),
colour=NA) +
scale_fill_viridis(option="mako", trans="reverse",
name = "Mean") +
labs(title = "Average \n Elevation") +
mapTheme,
ggplot() +
geom_sf(data = plvc) +
geom_sf(data = plvc_fac,
aes(fill=zsplvcfa_3),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Flow \n Accumulation") +
mapTheme,
ggplot() +
geom_sf(data = plvc) +
geom_sf(data = plvc_slope,
aes(fill=zsplvcsl_4),
colour=NA,) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Slope Degrees") +
mapTheme,
ggplot() +
geom_sf(data = plvc) +
geom_sf(data = plvc_strdist,
aes(fill=zsplvcst_4),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Distance \n From Stream") +
mapTheme,
ggplot() +
geom_sf(data = plvc) +
geom_sf(data = plvc_impervious,
aes(fill=zsplvcim_4),
colour=NA) +
scale_fill_viridis(option="mako", trans = "reverse",
name = "Mean") +
labs(title = "Impervious \n Surface") +
mapTheme)
Below are plots of our variables in Calgary with average values for flooded or not flooded areas. Notably, flow accumulation is much higher in flooded areas, aligning with our hypothesis that areas with more water collection would flood easier. Impervious surfaces are more frequent in not flooded areas–potentially contrary to our hypothesis that inability to drain would cause more flooding. Lastly, areas that did not flood have a farther distance from streams, which aligns with our hypothesis that nearness would increase flooding potential.
calgaryPlotVariables <-
calgary %>%
as.data.frame() %>%
dplyr::select(MAJORITY,facMean,avgel,slopeMean,strDist,impervSum) %>%
gather(variable, value, -MAJORITY)
ggplot(calgaryPlotVariables %>%
group_by(MAJORITY, variable) %>%
summarize(mean = mean(value))) +
geom_bar(aes(as.factor(MAJORITY),
mean,
fill=as.factor(MAJORITY)),
stat="identity") +
facet_wrap(~variable, ncol = 5, scales = "free") +
labs(title = "Exploratory Analysis of Features",
caption="Without Feature Engineer") +
scale_fill_manual(values = c("light blue", "dark blue"),
labels = c("Not Flooded","Flooded"),
name = "") +
labs(x="Preserved", y="Value")
After familiarizing ourselves to the variables, we can now begin to build a model. To start testing on the probability of flooding in Calgary, we partition the dataset into a 70/30 split to train and test on.
set.seed(3456)
trainIndex <- createDataPartition(calgary$MAJORITY, p = .70,
list = FALSE,
times = 1)
calgaryTrain <- calgary[ trainIndex,]
calgaryTest <- calgary[-trainIndex,]
For this exercise, we used a gml linear model. Below is our final iteration and what we believe to be our version of the model. The model incorporates all our feature engineered variables above. Echoing the plots above, the model found that the nearest and far values of distance from streams; flow accumulation at both high and low levels; impervious surfaces; and low elevation were found to be statistically significant.
options(scipen=999)
calgaryModel <- glm(MAJORITY ~ distance + slope + flow + pervious + elevation,
family="binomial"(link="logit"), data = calgaryTrain %>%
as.data.frame %>%
dplyr::select(-AREA, -geometry, -strDist, -facMean, -slopeMean, -impervSum, -avgel))
summary(calgaryModel)
##
## Call:
## glm(formula = MAJORITY ~ distance + slope + flow + pervious +
## elevation, family = binomial(link = "logit"), data = calgaryTrain %>%
## as.data.frame %>% dplyr::select(-AREA, -geometry, -strDist,
## -facMean, -slopeMean, -impervSum, -avgel))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8717 -0.3969 -0.2776 -0.1258 3.1183
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32979 0.60484 -3.852 0.000117 ***
## distancefar -14.84497 644.11876 -0.023 0.981613
## distancefurthest -15.58523 2142.90731 -0.007 0.994197
## distancenear 0.09435 0.37325 0.253 0.800449
## distancenearest 1.74493 0.32081 5.439 0.00000005352677 ***
## slopesteep -0.01934 0.18481 -0.105 0.916644
## flowhigh 3.00010 1.12777 2.660 0.007809 **
## flowlow -2.59918 0.49572 -5.243 0.00000015778565 ***
## perviousyes 1.00958 0.14470 6.977 0.00000000000302 ***
## elevationhigh -14.63870 469.67655 -0.031 0.975136
## elevationlow 0.68223 0.16818 4.056 0.00004983316554 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2335.6 on 5562 degrees of freedom
## Residual deviance: 1813.5 on 5552 degrees of freedom
## AIC: 1835.5
##
## Number of Fisher Scoring iterations: 18
The following section uses several visualizations and tools to evaluate the accuracy and generalizability of our model.
In this distribution plot, a negative or 0 value means that an area did not flood, while a positive or 1 value means that an area did flood. Strong models will have a peak closer to 0 for the negatives (no flooding), and a peak closer to 1 for the positives (flooding).For our not flooded indicator, there is clustering towards 0–meaning our model is generally good at predicting places that did not flood. Looking at our flooded indicator, there is a larger peak at the 0 value (would not be predicted correctly), but there is a small peak at the 1 value (predicted correctly). Overall, this plot tells us that our model is better at predicting non-flooded areas than flooded areas. This may be due to the fact that in Calgary, more land is not flooded than flooded and our model was built off of this dataset. This does not mean the model cannot be used, it is just a factor to be considered in its efficacy.
Based off of the distribution above we change the threshold to 1 or 0.1 moving forward
classProbs <- predict(calgaryModel, calgaryTest, type="response")
hist(classProbs)
testProbs <- data.frame(obs = as.numeric(calgaryTest$MAJORITY),
pred = classProbs)
ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) +
geom_density() +
facet_grid(obs ~ .) +
xlab("Probability") +
geom_vline(xintercept = .5) +
scale_fill_manual(values = c("light blue", "dark blue"),
labels = c("Not Flooded","Flooded"),
name = "")+
plotTheme
Using our new threshold as a guide, this confusion matrix helps us better evaluate our model in lens of true positives, false positives, false negatives, and false positives. In this case we had: 1. 47 true positives (meaning we predicted a flood and there was a flood) 2. 85 false positives (meaning we predicted no flood and there was flood) 3. 2119 true negatives (meaning we predicted no flood and there was no flood) 4. 113 false negatives (meaning predicted a flood and there was no flood)
In our evaluation, we must keep an eye on the false positives as those were instances of flood in areas that may have not been protected.
testProbs$predClass = ifelse(testProbs$pred > .1 ,1,0)
caret::confusionMatrix(reference = as.factor(testProbs$obs),
data = as.factor(testProbs$predClass),
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2119 85
## 1 133 47
##
## Accuracy : 0.9086
## 95% CI : (0.8963, 0.9198)
## No Information Rate : 0.9446
## P-Value [Acc > NIR] : 1.000000
##
## Kappa : 0.2536
##
## Mcnemar's Test P-Value : 0.001456
##
## Sensitivity : 0.35606
## Specificity : 0.94094
## Pos Pred Value : 0.26111
## Neg Pred Value : 0.96143
## Prevalence : 0.05537
## Detection Rate : 0.01971
## Detection Prevalence : 0.07550
## Balanced Accuracy : 0.64850
##
## 'Positive' Class : 1
##
The ROC curve is used to plot the true positive rate against the false positive rate. From the curve, the AUC value is generated. Our AUC value is 0.83, meaning that our model performs excellently at distinguishing classes (flooded/not flooded). Our high AUC is a great indicator that our model might be sufficient enough to use.
ggplot(testProbs, aes(d = obs, m = pred)) +
geom_roc(n.cuts = 50, labels = FALSE, color = 'blue') +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey')
auc(testProbs$obs, testProbs$pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8349
We continue to evaluate our model with cross-validation. This plot shows clustering around a high value of our model’s accuracy.
ctrl <- trainControl(method = "cv",
number = 100,
savePredictions = TRUE)
cvFit <- train(as.factor(MAJORITY) ~ ., data = calgary %>%
as.data.frame() %>%
dplyr::select(-AREA, -geometry, -strDist, -facMean, -slopeMean, -impervSum, -avgel),
method="glm", family="binomial",
trControl = ctrl)
cvFit
ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Accuracy",
y="Count")+
plotTheme
Now that we have evaluated our model (conclusions on it below), we will use it to predict on our other city. This is a step in evaluating if our model is generalizable–meaning if it can move across space and time to predict as well on other conditions. As mentioned above, this exercise will use the cities of Portland and Vancouver. The two code chunks below predict on Calgary and Portland-Vancouver for further analysis.
allPredictions <-
predict(cvFit, calgary, type="prob")[,2]
calgary <-
cbind(calgary,allPredictions) %>%
mutate(allPredictions = round(allPredictions * 100))
calgary1 <- calgary %>%
mutate(PredClass = ifelse(allPredictions > 10, 1, 0))
calgary1 <- calgary1 %>%
mutate(Correct = ifelse(PredClass == MAJORITY, "1", "0"),
Incorrect = ifelse(PredClass != MAJORITY, "1", "0"))
plvc_allPredictions <-
predict(cvFit, plvc, type="prob")[,2]
plvc <-
cbind(plvc,plvc_allPredictions) %>%
mutate(plvc_allPredictions = round(plvc_allPredictions * 100))
As part of more validation of our model, we conducted a spatial cross validation method. In this code chunk, we built a map that displays average probability of flooding by neighborhood in Calgary. This aggregates the predicted grid cells up to the neighborhood level. The series of maps shows average probability of flooding by sector/neighborhood. The higher probabilities are closer to water features–validating our hypothesis and showing that space/proximity is a major factor in both flooding and this model.
cal_nhood <- st_read("nhood/geo_export_657fd344-2823-4047-abad-f7d6906b7c1a.shp") %>% st_transform(crs = 3400)
cal_nhood <- st_intersection(calgary1, cal_nhood)
cal_nhood <- cal_nhood %>% mutate(Correct = as.numeric(Correct),
Incorrect = as.numeric(Incorrect))
cal_nhood <- cal_nhood %>% dplyr::select(name, sector, MAJORITY, allPredictions, PredClass, Correct, Incorrect, -geometry) %>%
group_by(name) %>%
summarise(meanPrediction = mean(allPredictions),
accuracy = sum(Correct) / sum(Correct) + sum(Incorrect))
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data = cal_nhood,
aes(fill=accuracy),
colour=NA) +
geom_sf(data=calgary %>%
filter(MAJORITY == 1),
fill="gold",colour="NA") +
scale_fill_viridis(option="mako", trans="reverse") +
labs(title = "Average Probability of Flooding by Sector with Observed Flooding in Gold") +
mapTheme,
ggplot() +
geom_sf(data = cal_nhood)+
geom_sf(data = cal_nhood,
aes(fill=accuracy),
colour=NA) +
scale_fill_viridis(option="mako", trans="reverse") +
labs(title = "Accuracy of Flooding (Average by Neighborhood)") +
mapTheme)
This is a map of predicted flooding in Portland-Vancouver.
ggplot() +
geom_sf(data=calgary, aes(fill=factor(ntile(allPredictions,5))),
colour=NA) +
scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
labels=as.character(quantile(calgary$allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
mapTheme +
labs(title="Predicted Flooded Areas in Calgary")
ggplot() +
geom_sf(data=plvc, aes(fill=factor(ntile(plvc_allPredictions,5))),
colour=NA) +
scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
labels=as.character(quantile(plvc$plvc_allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
mapTheme +
labs(title="Predicted Flooded Areas in Portland-Vancouver")
This map shows predicted flooding overlaid with observed flooding. Generally, areas that are darker/more likely to flood seem to overlay with previous observed flooding. This speaks again to the overall high accuracy of our model.
ggplot() +
geom_sf(data=calgary, aes(fill=factor(ntile(allPredictions,5))), colour=NA) +
scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
labels=as.character(quantile(calgary$allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
geom_sf(data=calgary %>%
filter(MAJORITY == 1),
fill="gold",colour="NA") +
mapTheme +
labs(title="Observed and Predicted Flooded Areas",
subtitle="Calgary; Observed flooded land in gold; Predicted flooded land in gradient ")
This is a map of our confusion metrics to explore their distribution in space. Our true positives run along observed flooding and waterways. Our true negatives take up most of the map (as most of Calgary did not flood). As mentioned above, our false positives are something to hone in on in the model, as those areas would be flooded and would nee assistance.
calgary %>%
mutate(confResult=case_when(allPredictions < 50 & MAJORITY==0 ~ "True_Negative",
allPredictions >= 50 & MAJORITY==1 ~ "True_Positive",
allPredictions < 50 & MAJORITY==1 ~ "False_Negative",
allPredictions >= 50 & MAJORITY==0 ~ "False_Positive")) %>%
ggplot()+
geom_sf(aes(fill = confResult), color = "transparent")+
scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF"),
name="Outcomes")+
labs(title="Confusion Metrics For Calgary") +
mapTheme
Modeling extreme weather events is only going to become more imperative. This exercise sets out not only build a model to predict flooding, but more generally, identify variables that explain flooding potential, so that this can be replicated on many scales in many different areas.
Overall, our model is incredibly strong. Looking to just Calgary, the model has extremely high accuracy rates–this is shown in distribution plots, a majority of either true positives/negatives (meaning correct predictions), and a very high AUC rate (0.83). All of these indicators support the model being used from an accuracy standpoint. In terms of generalizability, there were two main metrics used. The spatial cross-validation allowed to see the distribution of predictions across the city of Calgary and overall, the model seemed to predict well in differnt neighborhoods through the variables chosen. Our model also reacted well to predicting on Portland-Vancouver. One of the biggest limitations to this exercise is that we have not yet “ground-truthed” our predictions in Portland-Vancouver, but they follow similar patterns as they do in Calgary. Overall, we are supportive of this model framework being used.
There are three major takeaways or limitations to this model. 1. The model does not take into account flooding severity–this only gives a binary answer of if an area flooded. A flood reaching 1 foot of undesired water versus 10 feet of undesired water have very different impacts. Expanding this model to include severity would make it more complex and useful (if accurate). 2.As mentioned above, we need to do more exploration on generalizability. If we could compare the predictions in Portland-Vancouver to observed flooding, we could further support or disprove our model. As is, we believe in our output! 3. If the model is accurate and generalizable as is, we need infrastructural flooding support for Portland-Vancouver! We predicted that much of the city is susceptible for flooding and action should be taken!
Implementing built changes to combat environmental issues is challenging as many environmental problems expand past political boundaries, however, climate change is here and must be addressed. For example, Portland-Vancouver should potentially work together (versus separately) on armoring their city from flooding since they share a river. This model is only the beginning of important mititgation and adaptation work.