Flight Delay Prediction - Milestone Report
Huiting Yu

Progress Overview:
Data preprocessing is almost done. The range of flight data ( involving certain carrier and airports in a specific period of time) to be used among the large dataset is chosen. Seven attributes of flights, which seem most relevant to the length of flight delay, are selected to be used to predict the flight delay. However, how to represent these attributes in numeral form worth more thoughts. More experiments and results to be done in the future may indicates the changes to the current representations of the flight attributes.
As far as the implementation of the machine learning algorithm, only the linear regression with regularization has been conducted. The results are poor and not acceptable. Further improvements on both the learning algorithm and the data preprocessing are needed.

Data preprocessing
The dataset on http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time has flight on-time data of more than 20+ years. Since the airline performance usually varies significantly over the years, we choose the flights in one month (2012, January) to study. Also, the performance of different carriers varies much, so I choose all the flights operated by American Airlines in that specific month as my dataset. The original dataset involves almost all airports in U.S, to decrease the complexity of the program, I choose only flights involving the top ten busiest airports in U.S, which are ATL, ORD, LAX, DFW, DEN, JFK, IAH, SFO, LAS, PHX. After the filtering, the final dataset I use in this project has 8000+ flight entries in total. About 1300+ flight are classified ( by whether delay time is larger than 15 minutes) as delayed .

The eight flight attributes we are interested in: (originally in online dataset, not processed yet)

Arributes	Description
Origin	Airport code
Dest	Airport code
Scheduled departure time	hhmm
Air time	duration of the flight, in minutes
Distance	The distance from Origin airport to Dest airport, in miles
Day of week	from 1 to 7, integers
Day of Month	from 1 to 31, integers
Delay	delay of flight, in minutes

The attributes that need to be processed:
Origin/Dest : the airport code needs to be represented in numeral form. Some number reflecting the overall performance of the airport probably is useful in the prediction of flight delay. So I replace the airport code by the average delay time of the airport. The average time is computed from the whole dataset I have chosen. The figure below shows the average departure delay (red) and average arrival delay(blue) of the ten airports involved in the dataset. On x-axis 0-9 represents 10 airports. y-axis is for delay time.

Scheduled departure time: the original time is divided by 100 to preserve only the hour number, since the minute number seems to have much less influence on the prediction than the hour number.
Day of Month: the days which are holidays ( in January, 2012) are labeled 1, otherwise they are labeled 0. I changed the day-of-month-number to the holiday label because whether a day is holiday matters much more in terms of flight delay.

Implementation of Algorithm
By far only linear regression with regularization has been implemented. The result is very poor, as average prediction error is more than 100 minutes. There might be some bugs in the coding. But aside from the bugs, the large error may have something to do with the followings:
1. The selected set of flight attributes and the representation of them.
Some attribute may have very little relation to the delay time, like the whether the day it is holiday. Some attribute may not be represented well, for example, I believe the departure time does give significant prediction about the flight delay, but the hour of the time matters less than whether the time falls in "busy hours".
2. The method implemented doesn't weight the attributes.

Possible improvements
1. The representations of the attributes might be changed to see whether better results can be obtained. Different selection of attributes might be experimented on.
2. The local weighted regression will be tried.
3. Other regression methods should be investigated and experimented.