1 Data

1.1 A rich dataset from the airport of Zurich

glimpse(flights)
## Observations: 223,699
## Variables: 34
## $ date                    <date> 2017-01-01, 2017-01-01, 2017-01-01, 2...
## $ effective_time          <dttm> 2017-01-01 12:48:20, 2017-01-01 22:33...
## $ planed_time             <dttm> 2017-01-01 12:40:00, 2017-01-01 22:30...
## $ diff_in_secs            <int> 500, 190, 573, 673, 691, -103, -128, -...
## $ airline_code            <fct> 4T, 4T, 4T, 4T, 4U, 4U, 4U, 4U, 4U, 4U...
## $ airline_name            <fct> Belair Airlines AG, Belair Airlines AG...
## $ flightnr                <fct> 4T2094, 4T2095, 4T2352, 4T2353, 4U762,...
## $ start_landing           <fct> Starting, Landing, Starting, Landing, ...
## $ airplane_type           <fct> A320, A320, A320, A320, A319, A319, A3...
## $ origin_destination_code <fct> HRG, HRG, PRN, PRN, CGN, CGN, CGN, CGN...
## $ origin_destination_name <fct> Hurghada International Airport, Hurgha...
## $ airport_type            <fct> large_airport, large_airport, large_ai...
## $ distance_km             <dbl> 3146.9403, 3146.9403, 1122.4708, 1122....
## $ iso_country             <fct> EG, EG, XK, XK, DE, DE, DE, DE, DE, DE...
## $ iso_region              <fct> EG-BA, EG-BA, XK-01, XK-01, DE-NW, DE-...
## $ municipality            <fct> Hurghada, Hurghada, Prishtina, Prishti...
## $ continent               <fct> AF, AF, EU, EU, EU, EU, EU, EU, EU, EU...
## $ schengen                <fct> Non-Schengen, Non-Schengen, Non-Scheng...
## $ lightnings_hour_n       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lightnings_hour_f       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ winddir_h               <int> 208, 172, 197, 193, 180, 169, 217, 180...
## $ windspeed_avg_h         <dbl> 8.6, 9.7, 9.7, 3.6, 7.9, 7.6, 4.3, 4.3...
## $ windspeed_peak_h        <dbl> 15.5, 15.5, 16.9, 6.5, 15.5, 14.0, 10....
## $ global_rad_avg_h        <int> 253, 3, 158, 3, 2, 2, 29, 81, 312, 253...
## $ airpres                 <dbl> 1021.2, 1022.0, 1020.8, 1021.5, 1021.9...
## $ precip                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ sunshine_dur_min        <int> 60, 0, 60, 0, 0, 0, 0, 0, 60, 60, 0, 0...
## $ temp_avg                <dbl> 0.1, -2.6, 0.3, -4.6, -4.6, -3.4, -4.0...
## $ temp_min                <dbl> -0.3, -2.7, 0.1, -5.4, -5.5, -4.3, -4....
## $ temp_max                <dbl> 0.3, -2.5, 0.5, -3.4, -4.2, -2.8, -3.8...
## $ rel_humid               <dbl> 72.8, 93.1, 72.4, 94.3, 93.6, 93.6, 93...
## $ delayed                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hour                    <fct> 12, 22, 13, 18, 19, 20, 8, 8, 12, 13, ...
## $ month                   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
  • All flights from the year 2017 (?) – A ton of data!
  • Hourly weather data
  • Expected and actual arrival/departure times
  • Additional information about the flights (airline, airplane-type, etc.)

2 EDA

2.1 Distribution of the differences between expected and actual departure is skewed with extremely long tails

  • The distribution of the difference in seconds has many extreme values and also seems to be skewed. As expected the difference in seconds is more skewed for starting flights than for landing flights.

2.2 Seasonal and daily patterns visible

  • Looking at the distribution of the difference in seconds for the different months over the year, one can clearly see that there are some seasonal patterns. The summer and winter holiday seasons are associated with higher difference in seconds.

  • Similar periodic patterns are visibile when looking at the the distribution of the difference in seconds at the different hours of the day. More delays occur in the morning compared to lunchtime, the afternoon, and in the evening.

2.3 Some airlines have a higher occurence of delays than others

  • Looking at the difference in seconds distributions conditioned on the different airlines, it’s clearly visible that some of them have more and longer delays than others. Thinking about possible explanations for the more frequent delays of different airlines we hypothesize that low-cost airlines (e.g. Air Berlin) try to minimize the time on the ground because of monetary reasons. Also we think that higher security standards could explain the more frequent delays of other airlines (e.g. El Al Israel).

  • Also visually visible are delay differences between the different airplane types. We hypothesize that the airlines have different airplane fleets and therefore some airplane types are not uniformly represented over all airlines. That’s why we think that the delays associated with certain airlines propogate to the airplane types.

2.4 Unsurprisingly weather variables are correlated

  • Dimension reduction methods, such as PCA, could be useful

3 Models

3.1 Linear models don’t fit the data well

model <- as.formula(diff_in_secs ~ lightnings_hour_n + lightnings_hour_f 
                         + windspeed_avg_h + windspeed_peak_h + global_rad_avg_h
                         + airpres + precip + sunshine_dur_min + temp_avg + temp_min 
                         + temp_max + rel_humid + distance_km + winddir_h + month + hour)
fit_linear <- lm(model, data = subset(flights, start_landing == "Starting"))
summary(fit_linear)$r.squared
## [1] 0.0291839

  • We cannot explain the many extreme observations with the available covariates
  • We hypothesized that the extreme delays are due to factors such as strikes or political events for which we don’t have data available
  • Probably robust statistical method could help in this situation

3.2 Logistic regression models (dichotomize outcome)

model <- as.formula(delayed ~ lightnings_hour_n + lightnings_hour_f 
                   + windspeed_avg_h + windspeed_peak_h + global_rad_avg_h
                   + airpres + precip + sunshine_dur_min + temp_avg + temp_min 
                   + temp_max + rel_humid + distance_km + winddir_h + month + hour)
fit_logist <- glm(model, data = subset(flights, start_landing == "Starting"), family = "binomial")
  • Classification performance was not good (90%, but this is just the average of non-delays)
  • Especially classification of delays is bad (much worse than classification of non-delays)
  • But the model shows some interesting associations between some of the covariates and delays
  • The model with the weather covariates is still significantly better when comparing the likelihood, rather than classification performance
  • We also fitted a Bayesian logistic regression model that turned out very similar to this one

3.3 Some of the model-predictions plotted (keeping other covariates fixed)

3.4 Random Forest Classification

## 
## Call:
##  randomForest(formula = model_form, data = starting_flights, ntree = 500,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.26%
## Confusion matrix:
##        0    1 class.error
## 0 100468 1169  0.01150172
## 1   9191 1081  0.89476246

  • Also a random forest with the same input covariates cannot decrease the error rate further. Probably the weather covariates cannot improve prediction performance
  • The model is especially bad at predicting a delay (it’s much better at predicting a non-delay), which is an indication that there are some other unobserved covariates that might be responsible for the delays

4 What could be done in the future