1 Data

1.1 A rich dataset from the airport of Zurich

glimpse(flights)

## Observations: 223,699
## Variables: 34
## $ date                    <date> 2017-01-01, 2017-01-01, 2017-01-01, 2...
## $ effective_time          <dttm> 2017-01-01 12:48:20, 2017-01-01 22:33...
## $ planed_time             <dttm> 2017-01-01 12:40:00, 2017-01-01 22:30...
## $ diff_in_secs            <int> 500, 190, 573, 673, 691, -103, -128, -...
## $ airline_code            <fct> 4T, 4T, 4T, 4T, 4U, 4U, 4U, 4U, 4U, 4U...
## $ airline_name            <fct> Belair Airlines AG, Belair Airlines AG...
## $ flightnr                <fct> 4T2094, 4T2095, 4T2352, 4T2353, 4U762,...
## $ start_landing           <fct> Starting, Landing, Starting, Landing, ...
## $ airplane_type           <fct> A320, A320, A320, A320, A319, A319, A3...
## $ origin_destination_code <fct> HRG, HRG, PRN, PRN, CGN, CGN, CGN, CGN...
## $ origin_destination_name <fct> Hurghada International Airport, Hurgha...
## $ airport_type            <fct> large_airport, large_airport, large_ai...
## $ distance_km             <dbl> 3146.9403, 3146.9403, 1122.4708, 1122....
## $ iso_country             <fct> EG, EG, XK, XK, DE, DE, DE, DE, DE, DE...
## $ iso_region              <fct> EG-BA, EG-BA, XK-01, XK-01, DE-NW, DE-...
## $ municipality            <fct> Hurghada, Hurghada, Prishtina, Prishti...
## $ continent               <fct> AF, AF, EU, EU, EU, EU, EU, EU, EU, EU...
## $ schengen                <fct> Non-Schengen, Non-Schengen, Non-Scheng...
## $ lightnings_hour_n       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lightnings_hour_f       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ winddir_h               <int> 208, 172, 197, 193, 180, 169, 217, 180...
## $ windspeed_avg_h         <dbl> 8.6, 9.7, 9.7, 3.6, 7.9, 7.6, 4.3, 4.3...
## $ windspeed_peak_h        <dbl> 15.5, 15.5, 16.9, 6.5, 15.5, 14.0, 10....
## $ global_rad_avg_h        <int> 253, 3, 158, 3, 2, 2, 29, 81, 312, 253...
## $ airpres                 <dbl> 1021.2, 1022.0, 1020.8, 1021.5, 1021.9...
## $ precip                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ sunshine_dur_min        <int> 60, 0, 60, 0, 0, 0, 0, 0, 60, 60, 0, 0...
## $ temp_avg                <dbl> 0.1, -2.6, 0.3, -4.6, -4.6, -3.4, -4.0...
## $ temp_min                <dbl> -0.3, -2.7, 0.1, -5.4, -5.5, -4.3, -4....
## $ temp_max                <dbl> 0.3, -2.5, 0.5, -3.4, -4.2, -2.8, -3.8...
## $ rel_humid               <dbl> 72.8, 93.1, 72.4, 94.3, 93.6, 93.6, 93...
## $ delayed                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hour                    <fct> 12, 22, 13, 18, 19, 20, 8, 8, 12, 13, ...
## $ month                   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

All flights from the year 2017 (?) – A ton of data!
Hourly weather data
Expected and actual arrival/departure times
Additional information about the flights (airline, airplane-type, etc.)

2 EDA

2.1 Distribution of the differences between expected and actual departure is skewed with extremely long tails

The distribution of the difference in seconds has many extreme values and also seems to be skewed. As expected the difference in seconds is more skewed for starting flights than for landing flights.

2.2 Seasonal and daily patterns visible

Looking at the distribution of the difference in seconds for the different months over the year, one can clearly see that there are some seasonal patterns. The summer and winter holiday seasons are associated with higher difference in seconds.
Similar periodic patterns are visibile when looking at the the distribution of the difference in seconds at the different hours of the day. More delays occur in the morning compared to lunchtime, the afternoon, and in the evening.

2.3 Some airlines have a higher occurence of delays than others

Looking at the difference in seconds distributions conditioned on the different airlines, it’s clearly visible that some of them have more and longer delays than others. Thinking about possible explanations for the more frequent delays of different airlines we hypothesize that low-cost airlines (e.g. Air Berlin) try to minimize the time on the ground because of monetary reasons. Also we think that higher security standards could explain the more frequent delays of other airlines (e.g. El Al Israel).
Also visually visible are delay differences between the different airplane types. We hypothesize that the airlines have different airplane fleets and therefore some airplane types are not uniformly represented over all airlines. That’s why we think that the delays associated with certain airlines propogate to the airplane types.

2.4 Unsurprisingly weather variables are correlated

Dimension reduction methods, such as PCA, could be useful

3 Models

3.1 Linear models don’t fit the data well

model <- as.formula(diff_in_secs ~ lightnings_hour_n + lightnings_hour_f 
                         + windspeed_avg_h + windspeed_peak_h + global_rad_avg_h
                         + airpres + precip + sunshine_dur_min + temp_avg + temp_min 
                         + temp_max + rel_humid + distance_km + winddir_h + month + hour)
fit_linear <- lm(model, data = subset(flights, start_landing == "Starting"))
summary(fit_linear)$r.squared

## [1] 0.0291839

We cannot explain the many extreme observations with the available covariates
We hypothesized that the extreme delays are due to factors such as strikes or political events for which we don’t have data available
Probably robust statistical method could help in this situation

3.2 Logistic regression models (dichotomize outcome)

model <- as.formula(delayed ~ lightnings_hour_n + lightnings_hour_f 
                   + windspeed_avg_h + windspeed_peak_h + global_rad_avg_h
                   + airpres + precip + sunshine_dur_min + temp_avg + temp_min 
                   + temp_max + rel_humid + distance_km + winddir_h + month + hour)
fit_logist <- glm(model, data = subset(flights, start_landing == "Starting"), family = "binomial")

Classification performance was not good (90%, but this is just the average of non-delays)
Especially classification of delays is bad (much worse than classification of non-delays)
But the model shows some interesting associations between some of the covariates and delays
The model with the weather covariates is still significantly better when comparing the likelihood, rather than classification performance
We also fitted a Bayesian logistic regression model that turned out very similar to this one

3.3 Some of the model-predictions plotted (keeping other covariates fixed)

3.4 Random Forest Classification

## 
## Call:
##  randomForest(formula = model_form, data = starting_flights, ntree = 500,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.26%
## Confusion matrix:
##        0    1 class.error
## 0 100468 1169  0.01150172
## 1   9191 1081  0.89476246

Also a random forest with the same input covariates cannot decrease the error rate further. Probably the weather covariates cannot improve prediction performance
The model is especially bad at predicting a delay (it’s much better at predicting a non-delay), which is an indication that there are some other unobserved covariates that might be responsible for the delays

4 What could be done in the future

Try to find data about strikes and similar events that might explain extreme delays
Not categorize the delay time and try to transform it / use more robust methods
Use dimension reduction techniques like PCA to have less covariates and then regress the outcome on them
Use more complex models (boosting, random forests, neural networks) to improve predicitive performance (at the cost of interpretability)

Twist 2018: Airtraffic Challenge

1 Data

1.1 A rich dataset from the airport of Zurich

2 EDA

2.1 Distribution of the differences between expected and actual departure is skewed with extremely long tails

2.2 Seasonal and daily patterns visible

2.3 Some airlines have a higher occurence of delays than others

2.4 Unsurprisingly weather variables are correlated

3 Models

3.1 Linear models don’t fit the data well

3.2 Logistic regression models (dichotomize outcome)

3.3 Some of the model-predictions plotted (keeping other covariates fixed)

3.4 Random Forest Classification

4 What could be done in the future