Introduction + Use Case

Our firm was contracted by New Jersey Transit to create a model-based dashboard that would aid NJT in their Capital Improvements Program (CIP) decision-making by analyzing systematic delays. Unfortunately, New Jersey Transit has frequent issues with delayed trains, but there are plans to spend more money on infrastructure improvements that will make the system run more smoothly. In 2019, and for the second year in a row, NJT Trains broke down more than any other system in the county (according to the Federal Transit Administration). In January 2021, NJT ordered 25 new dual-powered locomotives that are in the process of being distributed and in 2023, 113 new multi-level rail cars will arrive to replace the 40-year old previous fleet. The mechanical failure rate of the new cars is a fraction of the previous fleet’s. Our model-based dashboard sets out to help NJT track where delays are the longest and prioritize lines that are experiencing frequent delays with the most profit potential by predicting delays into the future for CIP planning. Currently, the model focuses on the 5 lines that go directly into NYC. Check out the further report and model below:

Project Pitch + Dashboard Mockups

See our project pitch video at this link: https://www.youtube.com/watch?v=o9lDDhZ5rvc

Project Pitch Summary

The user of this dashboard will hopefully be a NJT employee who works in logistics and does capital improvements planning. Our project pitch includes 3 wireframes of our potential dashboard. The first wireframe is an interactive map that will be colored by delay times. The NJT employee users will be able to select the line and timeframe to see which lines are struggling most with delays. The second wireframe will present summarized tabular data, where a NJT employee can explore the line, time, avg delay, and ridership. The third and last wireframe is a CIP planning calculator. Once inputting relevant information like the line, timeframe, and intensity of repair, the dashboard will deduce costs spent, but also costs gained from consistent and solid service leaving a net value.

Methodology

Our methodology for this analysis can be categorized in three different processes: data collection + manipulation, model building, and evaluation ## Data Collection + Manipulation - Pull in relevant NJT data (both training and test time periods) - Filter and clean data to be most fitting and specific (ie. lines going into NYC, removing outliers) - Pull in our external variable, weather data, and clean the variable - Create our ride panel, or spatial time-series, dataset that is crucial to the following analysis - Feature engineer variables (manipulate them to perform the best in the model) - Conduct exploratory analysis to see the most significant variables to include in the model ## Model Building - Create a baseline regression model with minimal variables for comparison sake - Run iterations of models to see which produces the best results (Model 4) ## Evaluation - Once the models are completed, run an evaluation of the baseline and best-suited regression that includes charting and summary statistics - Run validation by line to get more nuanced results for policy recommendations

Set Up

Below is the setup for the analysis, including packages necessary and color palettes.

#knitr::opts_chunk$set(echo=TRUE, warning=FALSE, messages=FALSE, results = 'hide', include=TRUE, fig.keep = 'all')

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(mapview)
library(tidycensus)
library(ggcorrplot)
library(RColorBrewer)
library(stargazer)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"


palette5 <- c("#580E44","#47A6FF","#F5853E","#B61D8E","#004F99")
palette2 <- c("#F5853E","#B61D8E")
palettechloro <-c("#fef0d9", "#fdcc8a", "#fc8d59", "#e34a33", "#b30000")

Importing + Filtering the Data

The code chunk below imports the New Jersey Transit data and completes the initial, necessary cleaning. Data from September and October 2019 is used to train on. These two months were chosen under the pretense that they would illustrate regular ridership patterns. November 2019 data will be predicted on. NJT data is filtered for lines going directly into New York City, for which there are 5. Time intervals, especially lines running every hour, are also created for later analysis.

setwd("C:/Users/ammar/OneDrive/Desktop/CPLN592/Final/MUSA_508_Final")

Sep19 <- read_csv("September2019.csv") #training data

Oct19 <- read_csv("October2019.csv") #training data

Nov19 <- read_csv("November2019.csv") #data to predict on

AllData <- rbind(Sep19, Oct19, Nov19) #merging them all together

AllData <- #filtering for NJT
  AllData %>%
  filter(type == "NJ Transit" ) 

AllData <- #filtering for lines going directly into NYC
  AllData %>%
  filter(line == "Montclair-Boonton" | line == "Gladstone Branch" | line == "Morristown Line" | line == "Northeast Corrdr" |line == "No Jersey Coast" ) 

# fixing time data
AllData <- #not sure about actual_time #copied from the book but cut out some of the code
  AllData %>% 
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  na.omit()

NJT Context Map

Below is a map of the lines selected for prediction analysis.

NJTRoutes <- 
  st_read("https://opendata.arcgis.com/datasets/e6701817be974795aecc7f7a8cc42f79_0.geojson") %>%
  st_transform('ESRI:102711') %>%
  filter(LINE_CODE == "MC" | LINE_CODE == "GL" | LINE_CODE == "ME" | LINE_CODE == "NE" |LINE_CODE == "NC") 

NJTStations <- 
  st_read("https://opendata.arcgis.com/datasets/4809dada94c542e0beff00600ee930f6_0.geojson") %>%
  st_transform('ESRI:102711')  %>%
  filter(LINE_CODE == "MC" | LINE_CODE == "GL" | LINE_CODE == "ME" | LINE_CODE == "NE" |LINE_CODE == "NC") 

census_api_key("05b9c101eb2ee7dc7abb88140da527ce637ac07f", overwrite = TRUE)

ddACS19_5 <- load_variables(year = 2019, dataset = "acs5", cache = TRUE)

#pulling in total population, # in labor force, # of commuters by commuter rail

vars19 <- 
  c(totalpop = "B01003_001", 
    laborcount = "B23025_002", 
    commuters = "B08301_013")

censustracts <-  
  get_acs(geography = "tract", variables = vars19, 
                year=2019, state=34, output = "wide", geometry=T) %>% 
  st_transform('ESRI:102711')

# Map 
ggplot() + 
  geom_sf(data = censustracts, color = "grey 89", fill = "transparent") +
  geom_sf(data = NJTRoutes, color = "#F5853E") +
  geom_sf(data = NJTStations, color = "#B61D8E") +
  labs(title = "Chosen NJ Transit Routes",
       caption = "Figure 1") +
  mapTheme(title_size = 11) + theme(legend.position = "none")

Weather Data

The main external variable for this analysis is weather data. Weather has been known to significantly impact transit flow and decision making. The weather attributes pulled for this analysis are temperature, precipitation, and wind-speed. The Newark airport was used for analysis has it is central to the lines chosen. Weather data will be further analyzed throughout the code.

ymd_hms("2019-10-01 15:48:00")
## [1] "2019-10-01 15:48:00 UTC"
weather.Data <- 
  riem_measures(station = "EWR", date_start = "2019-09-01", date_end = "2019-11-30")

weather.Panel <-  
  weather.Data %>%
    mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
    replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Percipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Creating The Ride Panel

The two biggest lenses of this analysis are spatial and temporal. That being said, a time-series needs to be created for the following analysis. The following chunk of code is the process of creating a ride panel, or an observation of every space/time combinations. Since, our use case is geared toward long-range CIP planning and long delays, we are interested in having our ride panel relate to ā€œtripsā€ or the total completion of a train line within a certain hour. Therefore, each entry in our ride panel corresponds to how many times a line runs to completion, every hour of our analysis period. Our last step in creating our ride panel is to remove outlier delays, in this case delays that are over 3 hours.

ride.template <- 
  filter(AllData, week %in% c(35:48))

length(unique(ride.template$interval60)) * length(unique(ride.template$line)) #10920
## [1] 10920
study.panel <- 
  expand.grid(interval60 = unique(ride.template$interval60), 
              line = unique(ride.template$line))

nrow(study.panel) #10920
## [1] 10920
trip.count <- 
  ride.template %>%
  group_by(line) %>%
  mutate(max_stops = max(stop_sequence)) #creating a max sequence

ride.panel <- #train id and hour. find max stop sequence 
  trip.count %>%
    mutate(Trip_Counter = 1) %>%
      group_by(interval60, line, train_id, max_stops) %>%
      summarize(Trip_Count = sum(Trip_Counter, na.rm=T),
                delay_minutes = sum(delay_minutes, na.rm=T)) %>%
          mutate(week = week(interval60),
                 dotw = wday(interval60, label = TRUE))
## `summarise()` has grouped output by 'interval60', 'line', 'train_id'. You can override using the `.groups` argument.
ride.panel <- merge(ride.panel, weather.Panel, by="interval60") #adding all of our data with weather

ride.panel <- #filtering for delays less than 3 hours
  ride.panel %>%
  filter(delay_minutes < 180) 

Feature Engineering

We now conduct feature engineering to work on the significance of our variables for the model. The feature engineering includes creating delay lags, peak/non peak timing, week day or weekend indicators, and finally weather analysis.

# Time Lag
ride.panel <- 
  ride.panel %>% 
    arrange(line, interval60) %>% 
    group_by(line) %>% 
    mutate(lag1Hours = dplyr::lag(delay_minutes,1), # 1hour
           lag2Hours = dplyr::lag(delay_minutes,2),# 6hour
           lag3Hours = dplyr::lag(delay_minutes,3), # 12hour
           lag6Hours = dplyr::lag(delay_minutes,6), # 24hour
           lag24Hours = dplyr::lag(delay_minutes,24), # 48hour
           lag168Hours = dplyr::lag(delay_minutes,168))  %>%  # 1week
   ungroup()


# Peak 
ride.panel$time <- format(as.POSIXct(ride.panel$interval60), #Getting time column
                    format = "%H:%M:%S")

ride.panel <-
  ride.panel %>% 
  mutate(Peak = case_when(time >= "06:00:00" & time <= "10:00:00" ~ "AM",
                          time >= "16:00:00" & time <= "19:00:00" ~ "PM",
                          TRUE ~ "Non-Peak"))

# Day
ride.panel <-
  ride.panel %>%
  mutate(Day = case_when(
    dotw == "Mon" |dotw == "Tue" | dotw == "Wed" | dotw == "Thu"  | dotw == "Fri" ~ "Weekday",
    dotw == "Sat" |dotw == "Sun" ~ "Weekend"))

# Train ID
ride.panel <-
  ride.panel %>%
  mutate(TrainID = case_when(train_id == "0427" | train_id == "7220" | train_id == "6635" | train_id == "6919" | train_id == "3847" | train_id == "6621" | train_id == "0882" | train_id == "3825" | train_id == "6619" | train_id == "6640" | train_id == "6674" | train_id == "6611" | train_id == "6631" | train_id == "0439" | train_id == "6279" | train_id == "7273" | train_id == "0421" | train_id == "3861" | train_id == "4753" | train_id == "6295" ~ "Offender",
                             TRUE ~ "Non-Offender"))

ride.panel <- 
  ride.panel %>%
  mutate(Offender = case_when(TrainID == "Offender" ~ 1,
                              TrainID == "Non-Offender" ~ 0))

# Weather
ride.panel <-
  ride.panel %>%
  mutate(Weather = case_when(Temperature < 70 ~ "Cold",
                             Temperature >= 70 ~ "Hot"))

Splitting the Data

In order to run the model and conduct predictions, we must split the data. To do this, we assign all data in September and October as our train set and November data as our test set.

ride.Train <- filter(ride.panel, week < 45)
ride.Test <- filter(ride.panel, week >= 45)

Exploratory Analysis

Our feature engineering and model creation was built on the following analysis. This analysis section includes a wealth of visuals relating delays to the various features of the model. - Figures 2.1a/b/c chart weather over the 3-month period. - Figure 2.2 shows delay minutes as a function of temperature - Figure 2.3 shows delay minutes as a function of various categorical variables - Figure 2.4a/b charts lines and delay minutes of data from Sept & Oct 2019 - Figure 2.5 depicts delay minutes per day over each week - Figure 2.6 shows delay minutes as a function of the spatial lags - Figure 2.6 is a histogram of log(delays) - Figure 2.7 is a correlation plot of numeric variables in the model - Figure 2.8 depicts delay minutes as a function of max stops (how many stops on a line)

Highlights of our exploratory analysis are: - The NJ Coast and Morristown line have the longest summary of delays on the training set - The 2, 3, and 6 hour spatial lags seem to effect delays the most - Many of our numeric variables are not colinear, which is a good sign for model building - Longer delay minutes are correlated with more stops on a line - Longer delays occur in hotter weather - Longer delays happen in PM peak service times (4-7 PM) - Delays have a cyclical pattern, often occurring during weekend service

# charting weather over the 3 month period
grid.arrange(top = "Weather Data - NYC & North Jersey - September to November 2019",
  ggplot(weather.Panel, aes(interval60,Percipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Percipitation", caption = "Figure 2.1a") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed", caption = "Figure 2.1b") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature", caption = "Figure 2.1c") + plotTheme())

# Weather EA
ride.panel %>%
  group_by(interval60) %>% 
  summarize(DM = mean(delay_minutes),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, DM)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE, color = "#F5853E") +
    facet_wrap(~week, ncol=3) + 
    labs(title="Delay minutes as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count", 
         caption = "Figure 2.2") +
    plotTheme()
## `geom_smooth()` using formula 'y ~ x'

# Categorical Vars
ride.panel %>%
  dplyr::select(delay_minutes, dotw, Peak, Day, Weather) %>%
  gather(Variable, Value, -delay_minutes) %>% 
   ggplot(aes(Value, delay_minutes)) +
     geom_bar(position = "dodge", stat = "summary", fill = "#F5853E", fun.y = "mean") +
     facet_wrap(~Variable, ncol = 1, scales = "free") +
     labs(title = "Delay minutes as a Function of\nCategorical Variables", y = "Delay Minutes", 
          caption="Figure 2.3") +
     plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

# charting delays by line
ride.Train %>% #This is here just t see delays. we could takeit out
  group_by(line) %>% 
  summarise(Delay = sum(delay_minutes)) %>%  kable(caption = "Lines by Delays (Figure 2.4a)") %>%
  kable_styling("striped",full_width = F) 
Lines by Delays (Figure 2.4a)
line Delay
Gladstone Branch 122510.1
Montclair-Boonton 162401.8
Morristown Line 308008.7
No Jersey Coast 308429.4
Northeast Corrdr 287148.7
group_by(ride.Train, line) %>% #Delay by line
  ggplot(aes(reorder(line, delay_minutes, FUN = max), 
             log(delay_minutes), fill = line)) +
    geom_col() + coord_flip() +
    scale_fill_manual(values = palette5) +
    labs(x = "Line", y = "Delay Minutes", title= "Delay minutes by line", 
         caption = "Figure 2.4b") +
    plotTheme() + theme(legend.position = "none") 

# Trip Count Serial Autocorrelation EA
mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 

spooky   <- as.POSIXct("2019-10-31 01:00:00 UTC")

rbind( #change this?
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>%
    group_by(Legend, interval60) %>% 
      summarize(DM = sum(delay_minutes)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, DM, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        geom_vline(xintercept = spooky, linetype = "dotted") +
        geom_vline(data = mondays, aes(xintercept = monday)) +
        labs(title="Delay minutes by week: September-November 2019",
             subtitle="Dotted line for Halloween", 
             x="Day", y="Delay Minutes", caption = "Figure 2.5") +
        plotTheme() + theme(panel.grid.major = element_blank()) 
## `summarise()` has grouped output by 'Legend'. You can override using the `.groups` argument.

# Time Lags
plotData.lag <-
  filter(as.data.frame(ride.panel), week == 42) %>%
  dplyr::select(starts_with("lag"), delay_minutes) %>%
  gather(Variable, Value, -delay_minutes) %>%
  mutate(Variable = fct_relevel(Variable, "lag1Hour","lag2Hours","lag3Hours", "lag6hours", "lag24hours", "lag168hours"))

correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, delay_minutes)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#F5853E") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Delay minutes as a Function of Time Lag",
       subtitle="One week in October 2019", 
       caption = "Figure 2.6") +
  plotTheme()
## `geom_smooth()` using formula 'y ~ x'

# Bar Plot
options(scipen=999)

ggplot(ride.panel, aes(x = log(delay_minutes))) + labs(y = "Count", x = "Log(Delay Minutes)", title = "Histogram of Delays", caption = "Figure 2.6") + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Correlation Matrix  
numericVars <- 
  select_if(ride.panel, is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation Across Numeric Variables", caption = "Figure 2.7") 

# Max Stops
ride.panel %>% 
  dplyr::select(delay_minutes, max_stops) %>% 
  gather(Variable, Value, -delay_minutes) %>% 
   ggplot(aes(Value, delay_minutes)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#F5853E") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Delay minutes as a function of max stops", 
          caption="Figure 2.8") +
     plotTheme()
## `geom_smooth()` using formula 'y ~ x'

Regression Models

The code below are the 5 iterations of our predictive model. Regression 1 is the baseline regression with no feature engineered or overly descriptive variables included. Regression 4 is our best-suited model and includes all feature engineered variables. Regression 4 has an adjusted r-squared value of 34%. This means that the model explains 34% of variation in the data. While we wish this number were to be higher, it is an immense improvement from the baseline regression, which is about 3%. A major factor that helped our model was conducting a log on our delay minutes. Prior to the log, our delay minutes variable was incredibly left-skewed (meaning lots of small delays). The process of logging our delay minutes allowed for the data to be more normally-distributed, so that the model could perform better. Later in the code, we exponentiate delay minutes, so it will read as a unit of minutes again.

Regression Model Summaries

Below is a regression summary of our best-fit model. This model includes: day of the week, the line, peak hours, all of our spatial lays, weather analysis, and number of trips the line takes in an hour.

#---Polished LM table of training dataset
stargazer(test.reg.linear4, type="html", digits=1, title="Linear Model of Training Dataset (Figure 3)", out = "Training LM.txt")
Linear Model of Training Dataset (Figure 3)
Dependent variable:
log(delay_minutes)
hour(interval60) 0.02***
(0.002)
dotw.L 0.01
(0.02)
dotw.Q 0.4***
(0.03)
dotw.C 0.02
(0.02)
dotw4 0.2***
(0.02)
dotw5 -0.01
(0.02)
dotw6 0.1***
(0.02)
lineMontclair-Boonton 0.4***
(0.1)
lineMorristown Line 0.2***
(0.05)
lineNo Jersey Coast 0.3***
(0.05)
lineNortheast Corrdr 0.4***
(0.05)
PeakNon-Peak -0.6***
(0.1)
PeakPM -0.1**
(0.1)
Offender 0.4***
(0.04)
lag1Hours 0.002***
(0.000)
lag2Hours 0.002***
(0.000)
lag3Hours 0.002***
(0.000)
lag6Hours 0.002***
(0.000)
lag24Hours -0.000
(0.000)
lag168Hours 0.000*
(0.000)
Temperature -0.004***
(0.001)
Wind_Speed 0.01***
(0.002)
Percipitation 0.04
(0.03)
Trip_Count 0.3***
(0.002)
lineMontclair-Boonton:PeakNon-Peak 0.2***
(0.1)
lineMorristown Line:PeakNon-Peak 0.1*
(0.1)
lineNo Jersey Coast:PeakNon-Peak 0.5***
(0.1)
lineNortheast Corrdr:PeakNon-Peak 0.3***
(0.1)
lineMontclair-Boonton:PeakPM -0.05
(0.1)
lineMorristown Line:PeakPM 0.01
(0.1)
lineNo Jersey Coast:PeakPM -0.03
(0.1)
lineNortheast Corrdr:PeakPM 0.03
(0.1)
Constant 0.002
(0.1)
Observations 49,471
R2 0.3
Adjusted R2 0.3
Residual Std. Error 1.6 (df = 49438)
F Statistic 813.1*** (df = 32; 49438)
Note: p<0.1; p<0.05; p<0.01

Validation

Validation by Regression

Below are regression validations by time (both delay minutes and weeks). It is clear in these models that our featured engineered, or ā€œbest(-suited) regression,ā€ predicts the best. The overall error of the best regression is a 16 minute MAE. Figure 4.2 depicts how much better the best regression predicts as it follows the variation of the observed delays much better.

# Validation by Regression
ride.Test.weekNest <- 
  as.data.frame(ride.Test) %>%
  nest(-week) 

model_pred <- function(dat, fit)
  {
   pred <- exp(predict(fit, newdata = dat))
}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(BaselineRegression = map(.x = data, fit = test.reg.linear1, .f = model_pred),
           BestRegression = map(.x = data, fit = test.reg.linear4, .f = model_pred))

week_predictions <- week_predictions %>%  
    gather(Regression, Prediction, -data, -week) %>% 
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean),
           sd_AE = map_dbl(Absolute_Error, sd))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette2) +
    labs(title = "Mean Absolute Errors by model specification and week", caption = "Figure 4.1") +
  plotTheme()

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60)) %>%
  dplyr::select(interval60, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed delay minutes by hourly interval", 
           x = "Hour", y= "Delay Minutes", caption = "Figure 4.2") +
  plotTheme()
## `summarise()` has grouped output by 'Regression', 'Variable'. You can override using the `.groups` argument.

Validation By Line

As another step of our validation process, we wanted to explore how our best-suited regression performs for each of the 5 individual lines selected. We conducted cross-validation by line (Figure 5.1) and while this table may not be entirely useful, it does explore how our highest MAE is slightly above 17.5. Figure 5.2 depicts how well our best-suited regression lines up with observed delays on each line. Our best-suited regression follows the variation on the Gladstone Branch and NJ Coast lines well. It misses from variation on the Montclair-Boonton and Morristown lines and underpredicts on the Northeast Corridor line. Looking across Figure 5.3 (avg delay times by line) and 5.6 (total predicted delay minutes in November), we notice that Morristown Line has the second highest avg. delay times and has the longest total delay minutes.

# Validation by Line
wk45 <- unnest(week_predictions[5,])
wk46 <- unnest(week_predictions[6,])
wk47 <- unnest(week_predictions[7,])
wk48 <- unnest(week_predictions[8,])

test <- rbind(wk45, wk46, wk47, wk48)

ggplot(test, aes(x=MAE)) + 
  geom_histogram(fill = "#F5853E") +
  facet_wrap(~line, ncol=1) +
  labs(title = "Cross Validation Tests in Mean Average Error", caption = "Figure 5.1") +
  plotTheme()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

test %>%  
  gather(Variable, Value, Prediction, Observed, -line, -interval60) %>%
    group_by(line, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~line, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed Delay Minutes", 
           subtitle = "By line by hourly interval",
           x = "Hour", y= "Delay Minutes", caption = "Figure 5.2") +
      plotTheme()
## `summarise()` has grouped output by 'line', 'Variable'. You can override using the `.groups` argument.

test %>% 
  group_by(line) %>%
  summarize(MAE = mean(MAE, na.rm = T),
            AvgDelay = mean(delay_minutes, na.rm=T)) %>%
  kable(title = "MAE and average delay time by Line", caption = "MAE and Avg. Delay Time by Line (Figure 5.3)") %>%
  kable_styling("striped",full_width = F)
MAE and Avg. Delay Time by Line (Figure 5.3)
line MAE AvgDelay
Gladstone Branch 16.68238 26.13473
Montclair-Boonton 16.71962 27.83047
Morristown Line 16.71751 26.11905
No Jersey Coast 16.71918 23.43239
Northeast Corrdr 16.72242 18.23488
test %>% # TRAIN ID
  group_by(line, train_id) %>%
  summarize(AvgDelay = mean(delay_minutes, na.rm=T)) %>%
  filter(AvgDelay > 58.5) %>%
  kable(title = "Train ID and average delay time by Line", caption = "Top 20 Train IDs with Highest Delays by Avg Delay Time (Figure 5.4)") %>% kable_styling("striped",full_width = F) %>% scroll_box(width = "800px", height = "250px")
## `summarise()` has grouped output by 'line'. You can override using the `.groups` argument.
Top 20 Train IDs with Highest Delays by Avg Delay Time (Figure 5.4)
line train_id AvgDelay
Gladstone Branch 0427 77.79259
Gladstone Branch 0429 63.69722
Gladstone Branch 0434 63.84352
Gladstone Branch 0707 68.73333
Montclair-Boonton 6227 64.05370
Morristown Line 6674 123.68871
Morristown Line 6676 61.05440
Morristown Line 6677 62.23194
Morristown Line 6927 61.62500
No Jersey Coast 3201 65.48561
No Jersey Coast 4748 60.75714
No Jersey Coast 4760 60.79286
No Jersey Coast 7201 95.90625
No Jersey Coast 7264 64.18333
No Jersey Coast 7265 65.10714
No Jersey Coast 7273 59.11905
Northeast Corrdr 3800 76.67000
Northeast Corrdr 3897 90.13864
Northeast Corrdr 7800 79.17500
Northeast Corrdr 7853 67.98571
#Adding delay during peak time 
test_peakdelay <- 
  test %>% 
  group_by(line) %>%  
  filter(Peak == "AM" | Peak == "PM")

test_AMdelay <- # filtering for AM speak
  test %>%
  group_by(line) %>%
  filter(Peak == "AM")

test_AMdelay <- # gives you predicted avg MORNING delay time during peak by line
test_AMdelay %>% 
  group_by(line) %>%
  mutate(MorningDelay = mean(Prediction))

test_PMdelay <- # repeated the process for the evening
  test %>%
  group_by(line) %>%
  filter(Peak == "PM")

test_PMdelay <- # predicted EVENING delay time during peak by line
test_PMdelay %>% 
  group_by(line) %>%
  mutate(EveningDelay = mean(Prediction))

test_delays <- rbind(test_AMdelay, test_PMdelay) # brought them together

test_delays %>% 
  group_by(line) %>%
  summarize(MorningDelay = mean(MorningDelay, na.rm=T),
            EveningDelay = mean(EveningDelay, na.rm=T)) %>%
  kable(title = "Average delays during peak time by Line", caption = "Avg Delays during Peak Time by Line (Figure 5.5)") %>%
  kable_styling("striped",full_width = F)
Avg Delays during Peak Time by Line (Figure 5.5)
line MorningDelay EveningDelay
Gladstone Branch 28.82356 32.85413
Montclair-Boonton 23.44744 29.49824
Morristown Line 19.61743 19.22449
No Jersey Coast 12.69883 15.49335
Northeast Corrdr 8.32281 10.35879
group_by(test, line) %>% #Delay by line
  ggplot(aes(reorder(line, Prediction, FUN = max), 
             delay_minutes, fill = line)) +
    geom_col() + coord_flip() +
    scale_fill_manual(values = palette5) +
    labs(x = "Line", y = "Delay Minutes", title= "Predicted Delay minutes by line", caption = "Figure 5.6") +
    plotTheme() + theme(legend.position = "none")

Final Context Map

Our final figure is a draft mockup of our first wireframe. Our hopes are that lines can be color coded by their delay times for future analysis.

NovPrediction <- test %>%
  group_by(line) %>%
  summarize(totaldelay = sum(Prediction))

RoutesPredictions <- cbind(NJTRoutes, NovPrediction)

ggplot() + 
  geom_sf(data = censustracts, color = "grey 89", fill = "transparent") +
  geom_sf(data = RoutesPredictions, aes(color = as.factor(totaldelay)), show.legend = "delay times") +
  scale_color_manual(name = "Sum Delay Minutes", values = palettechloro) +
  labs(title = "Chosen NJ Transit Routes: November 2019 Predicted Delays",
       caption = "Figure 6.1", fill = "line") +
  mapTheme(title_size = 11) + theme(legend.position = "right")

Conclusion

Overall, we believe that our use case frame, methodology, and analysis offer important insights to New Jersey Transit. As noted in the Introduction, NJT is addressing their system problems with new equipment and our dashboard will aid them in assigning where the new engines and cars would be best suited over long-term delays. Our methodology is a good basis for this analysis, even though our models sometimes fall short (more on that below). Despite our models having some shortcomings, our analysis definitely shows clear recommendations on which lines need more attention.

Accuracy + Generalizability

Again, our model is trained off of September and October 2019 data and predicts delay times for November 2019. Our best-suited model, on a high-level, is overall accurate and generalizable. Since our use case is high-level, we want to be thinking about the longest and most-inconvenient delays. That being said, our MAE on average is around 16 minutes. Ideally, we would like this value to be lower, but for a high-level view, it can suffice. Our model is generalizble to an extent as it predicts moderately well on the November 2019 dataset. But as seasons change, we would need to re-evaluate the model to consider more systematic patterns of ridership and delays that could scale across longer time intervals. That being said, our model is very NJT specific, so it would probably not generalize well on other transit lines. Our methodology could be adopted for other transit lines.

How to Improve

We have identified some features that would improve both the methodology and results of our model. Training on more months and further predicting on more months may have been better suited for our use case. As our use case is about long-term CIP planning and ideally we would have liked to be predicting 6 months out, this might help with that cause. In terms of variables for our model, we would like to consider weather averages. Predicting the weather by hour by day is virtually an impossible task and historic averages would be more realistic for our model. There were also a few variables we needed to feature engineer better-including max stops as it relates to delay minutes and most importantly, train ID which is the individual set of train cars that run on a while. Train ID was a very significant variable, however, since there are so many, it could not be validated on appropriately. Lastly, extra data that could have improved our model are: ridership numbers, employees on a car, other trains (like Amtrak) running on the same tracks.

Policy Recommendations

As hinted at in our validation by line, our overall recommendation is that the Morristown Line is in desperate need of attention to mitigate delay times. That line had the second highest average delay time and the total most predicted delay minutes. That being said, we also believe that all of these lines require improvements, at some interval. For example, in our exploratory analysis, the Northeast Corridor was the biggest offender of delays. All of these lines see a high concentration of ridership and are incredibly important to the NJT system, and therefore, improvements to achieve perfect service (minimal delays) could increase ticket sales and consistency across ridership.

Authors Note

There was some difficulty in exporting this analysis as an R Markdown as options would either show results or hide tables. Please excuse unnecessary warnings and results in the markdown, especially in the Import Data section.