Yi Li
This coding challenge is designed to test your skill and intuition about real world data. For the challenge, we will use data collected by the New York City Taxi and Limousine commission about “Green” Taxis. Green Taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan. We will use the data from September 2015. We are using NYC Taxi and Limousine trip record data: (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
Number of rows: 1494926; Number of columns: 21
When trip distance > 1 mile, as the trip distance increases, its count decreases; when trip distance < 1 mile, as the trip distance increases, its count increases. When 1 mile < trip distance < 40 miles, the relation between trip distance (TD) and its log(count) is almost linear. In other words, we can represent it as $TD = b - a*log(count)$
.
In reality, it makes sense that most taxi trips are short trips(aruound 1-2 miles). Becuase for long trips, it would be expensive to take a taxi, and people probably would choose public transportations like metro or bus to save their money. While for short trips, people probably prefer to choose taxi to save their time.
The median trip distance is always lower than the mean trip distance; the mean and median trip distances have similar trends, and we can separate the time into three parts: * trip distance appears to be the longest during 4am-7am; * trip distance appears to be the shortest during 7am-9pm; * trip distance appears to be medium during 9pm-4am.
I choose the JFK airport with RateCodeID = 2. After removing suspicous records that seem wrong. I get the following results:
Number of transactions of trips that originate or terminate at JFK Airport: 3414.
The average total amount of these transactions is: $61.28
.
The fare amount of JFK trips is a fixed value: $52
.
The main cause of different total amount is tolls amount and tip amount. Then I checked whether the tip amount has something to do with the hour of a day.
It is interesting that in most hours, the median of the tip amount is zero.
One peak time of high tip amount is around 5:00-8:00, and the other peak time is around 14:00-15:00. We also find that 14:00-16:00 is the most rush hour of a day. So, people probably give high tips because of heavy traffic.
Define tip rate as the percentage of tip amount over the total amount.
According to the handbook, tip amount is automatically populated for credit card tips. Cash tips are not included. Thus, for this question, I choose to only use the trips paid by credit card to build the predictive model.
Data Cleaning & Data Exploration:
Remove trips whose trip distance or trip duration or average trip speed is too large or negative; Remove trips whose total amount is less than the initial charge $2.5
;
Plot histograms and scatter plots to explore the relation between the features and the target trip rate.
About 13.4%
of trips (paid by credit card) has zero tip.
Feature Engineering:
Create features like trip duration, average trip speed, trip pickup hour, etc. Since only the feature fare amount is time-and-distance based, I integrated the fares EXCEPT FARE AMOUNT AND TIP AMOUNT into one variable called "other_amount".
Models & Results:
I split the oringal cleaned dataset into a training dataset and a test dataset (80% : 20%).
I first built a linear regression model as a benchmark.
Then I built a regression random forest model and got better results.
Since the results of the regression random forest model is better, we should use this model to predict tip rate of trips.
** If I have more time, I will probably try building an antoencoder (a neural network model) to see if I can get better results on this problem.
Define average speed over the course of a trip as Trip Distance/Trip Duration, and its unit is: mile per hour (mph).
From the density plot of the distributions of the verage trip speeds in each week, we can see that the average trip speeds of week 37, week 38 and week 40 are slightly slower than the ones of week 36 and week 39.The ANOVA test shows that the polulation mean of the average trip speed of these five weeks are significantly not the same.
I checked NYC historical weather in September 2015 and found that it was raining on Sep.10th (in week 37) and Sep.30th (in week 40). This probably explains the reason why the average trip speeds in week 37 and week 40 are slightly slower than the ones in week 36 and week 39. I do not find any reasonable explanations why the average trip speed in week 38 is slow.
The average trip speed during 21:00 - 7:00 is higher than the one during 7:00 - 21:00. I also used two-sample t-test to prove this.
## Environment: Python 3
## Import Packages
import os
import urllib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import warnings
warnings.simplefilter(action='ignore')
• Programmatically download and load into your favorite analytical tool the trip data for September 2015.
• Report how many rows and columns of data you have loaded.
## Download Data
if os.path.exists('green_tripdata_2015-09.csv'): # check if the file exists
pass
else: # download
url = 'https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv'
response = urllib.request.urlopen(url)
html = response.read()
with open('green_tripdata_2015-09.csv', 'wb') as f:
f.write(html)
## Load Data
data = pd.read_csv('green_tripdata_2015-09.csv')
print('Number of rows:', data.shape[0])
print('Number of columns:', data.shape[1])
## Explore data
print(data.columns)
print(data.head(3))
print(data.describe())
From the above description, we can notice that some problems exist in this raw dataset:
1. Some records' Pickup/dropoff longitude/latitude is 0
2. Some records' payment amount is negative
3. Some records' trip type is missing
4. The whole column 'Ehail fee' is empty
## Since the whole column "Ehail_fee" is NaN, we can drop that column
data.drop('Ehail_fee', axis=1, inplace=True)
## check if the records whose longitude/latitude is 0 have something to do with server connection
print('Pickup location is (0,0):')
print(data['Store_and_fwd_flag'][data['Pickup_longitude']==0].value_counts())
print('\n')
print('Dropoff location is (0,0):')
print(data['Store_and_fwd_flag'][data['Dropoff_longitude']==0].value_counts())
It seems that this problem is not related to "store and forward flag".
• Plot a histogram of the number of the trip distance (“Trip Distance”).
• Report any structure you find and any hypotheses you have about that structure.
print('Trip distance description (mile):')
print(data['Trip_distance'].describe())
## check the top 10 trip distance
print('The top 10 longest trip distances:')
print(sorted(data['Trip_distance'], reverse = True)[:10])
## The longest distance 603.1 is extraordinary and suspicious.
## Let's check the records with top 2 trip distance.
data[data['Trip_distance']>200]
The record with trip distance 246.28 seems normal. However, the record with trip distance 603.1 only took 11 minutes, so the value must be wrong! Remove this record and plot the histogram.
## Remove the outlier
print('Removing outliers ...')
data = data[data['Trip_distance'] < 250]
## Remove trip distance = 0
print('Removing trips whose trip distance = 0 ...')
data = data[data['Trip_distance'] > 0]
print('Remianing number of records:', data.shape[0])
## To make the figures clear, let's only see the ones with trip distance < some threshold.
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (15,10))
data['Trip_distance'][data['Trip_distance'] < 40].hist(ax=ax1, bins=50)
ax1.set_title('Histogram of The Number of The Trip Distance (<40)')
ax1.set_xlabel('Trip Distance')
ax1.set_ylabel('Count')
data['Trip_distance'][data['Trip_distance'] < 40].hist(ax=ax2, bins=50)
ax2.set_yscale('log')
ax2.set_title('Histogram of The Number of The Trip Distance (<40)')
ax2.set_xlabel('Trip Distance')
ax2.set_ylabel('Count')
data['Trip_distance'][data['Trip_distance'] < 5].hist(ax=ax3, bins=50)
ax3.set_title('Histogram of The Number of The Trip Distance (<5)')
ax3.set_xlabel('Trip Distance')
ax3.set_ylabel('Count')
data['Trip_distance'].hist(ax=ax4, bins=200)
ax4.set_yscale('log')
ax4.set_title('Histogram of The Number of The Trip Distance (All)')
ax4.set_xlabel('Trip Distance')
ax4.set_ylabel('Count')
From the above figures, we can see that:
When trip distance > 1, as the trip distance increases, its count decreases;
when trip distance < 1, as the trip distance increases, its count increases.
In reality, it makes sense that most taxi trips are short trips(aruound 1-2 miles). Becuase for long trips, it would be expensive to take a taxi, and people probably would choose public transportations like metro or bus to save their money. While for short trips, people probably prefer to choose taxi to save their time.
Hypotheses:
$TD = b - a*log(count)$
. • Report mean and median trip distance grouped by hour of day.
## There are only two variables are related to time in this dataset:
## 'lpep_pickup_datetime' and 'Lpep_dropoff_datetime'.
## extract the hour of day from pickup datetime and dropoff datetime
data['pickup_hour'] = pd.DatetimeIndex(data['lpep_pickup_datetime']).hour
data['dropoff_hour'] = pd.DatetimeIndex(data['Lpep_dropoff_datetime']).hour
## group by hour, get the mean
mean_dist_pickup = data['Trip_distance'].groupby(data['pickup_hour']).mean()
mean_dist_dropoff = data['Trip_distance'].groupby(data['dropoff_hour']).mean()
## group by hour, get the median
median_dist_pickup = data['Trip_distance'].groupby(data['pickup_hour']).median()
median_dist_dropoff = data['Trip_distance'].groupby(data['dropoff_hour']).median()
## create a dataframe to store them
df = pd.concat([mean_dist_pickup, mean_dist_dropoff,
median_dist_pickup, median_dist_dropoff], axis=1)
df.columns = ['mean_dist_pickup', 'mean_dist_dropoff',
'median_dist_pickup', 'median_dist_dropoff']
## visualize the dataframe
ax = df.plot(title='Mean & Median Trip Distance Grouped by Hour of Day')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
ax.set_xticks(np.arange(24))
ax.set_xlabel("Hour of Day")
ax.set_ylabel("Trip Distance")
From the above figure, we can see that
the trip distance grouped by hour of day calculated from pickup and dropoff are similar;
the median trip distance is always lower than the mean trip distance;
the mean and median trip distances have similar trends, and we can separate the time into three parts:
• We’d like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fare, and any other interesting characteristics of these trips.
# According to http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml,
# the initial charge is $2.50, so remove trips whose total amount < 2.5.
print('Removing trips whose total amount < $2.5 ...')
data = data[data['Total_amount'] > 2.5]
print('Remianing number of records:', data.shape[0])
## According to the document http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
## RateCodeID 2 represents JFK Airport, and RateCodeID 3 represents Newark Airport
## Here I choose to only consider the trips originate or terminate at JFK Airport
print('Number of transactions of trips that originate or terminate at JFK Airport: %d.'
% data[data['RateCodeID']==2].shape[0])
print('The average fare amount of these transactions is: $%.2f.'
% data['Fare_amount'][data['RateCodeID']==2].mean())
print('The average total amount of these transactions is: $%.2f. \n'
% data['Total_amount'][data['RateCodeID']==2].mean())
## Check the fare of JFK Airports
print('Fare amount:')
print(data['Fare_amount'][data['RateCodeID']==2].value_counts())
print('\nExtra amount:')
print(data['Extra'][data['RateCodeID']==2].value_counts())
print('\nMTA tax amount:')
print(data['MTA_tax'][data['RateCodeID']==2].value_counts())
print('\nImprove_surcharge:')
print(data['improvement_surcharge'][data['RateCodeID']==2].value_counts())
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (15,4))
data['Tolls_amount'][data['RateCodeID']==2].hist(ax=ax1, bins=100)
ax1.set_title('Tolls Amount Distribution')
data['Tip_amount'][data['RateCodeID']==2].hist(ax=ax2, bins=100)
ax2.set_title('Tip Amount Distribution')
The fare amount of JFK trips is a fixed value: $52
.
The main cause of different total amount is tolls amount and tip amount.
Let's check whether the tip amount has something to do with the hour of a day.
## Let's check whether the tip amount has something to do with the hour of a day.
## create a subset of JFK trips
df_jfk = data[data['RateCodeID']==2]
## Since most trips' pickup hour and dropoff hour are the same, we only use the pickup hour here.
## calculate the mean and median tip amount group by pickup hour of a day
mean_tip_jfk = df_jfk['Tip_amount'].groupby(df_jfk['pickup_hour']).mean()
median_tip_jfk = df_jfk['Tip_amount'].groupby(df_jfk['pickup_hour']).median()
mean_tip_jfk.name = 'mean_tip'
median_tip_jfk.name = 'median_tip'
## plot
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,4))
mean_tip_jfk.plot(ax=ax1)
median_tip_jfk.plot(ax=ax1)
lines, labels = ax1.get_legend_handles_labels()
ax1.legend(lines, labels, loc='best')
ax1.set_xticks(np.arange(24))
ax1.set_title("Mean & Median Tip Amount of JFK Trips Grouped by Hour of Day")
ax1.set_xlabel("Hour of Day")
ax1.set_ylabel("Tip Amount")
## plot
df_jfk['pickup_hour'].value_counts().plot.bar(ax=ax2)
ax2.set_title("Count of Trips Grouped by Hour of Day")
ax2.set_xlabel("Hour of Day")
ax2.set_ylabel("Trip Count")
From the above figure, we can conclude that the value of tip amount does have something to do with the hour of a day.
• Build a derived variable for tip as a percentage of the total fare.
• Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). Provide an estimate of performance using an appropriate sample, and show your work.
## Before building the model, let's create new features: trip duration and averge speed
## Create a new feature: trip duration
data['duration'] = pd.to_datetime(data['Lpep_dropoff_datetime']) - \
pd.to_datetime(data['lpep_pickup_datetime'])
data['duration'] = data['duration'] / np.timedelta64(1, 'h') # convert to minutes
print('Trip duration description (minutes):')
print(data['duration'].describe()[3:])
## it is abnormal that a trip's duration is more than 5 hours, so remove them
print('Remove trips whose duration > 5 hours and keep trips whose duration > 0 ...')
data = data[data['duration'] < 5]
data = data[data['duration'] > 0]
print('Remaining number of records: ', data.shape[0])
data['duration'].hist(bins=100)
plt.suptitle('Histogram of Trip Duration (hour)')
print('Define average speed over the course of a trip as Trip Distance/Trip Duration.')
data['avg_speed'] = data['Trip_distance']/data['duration']
print('Average speed description (mph):')
print(data['avg_speed'].describe()[3:])
# Remove trips whose avg speed > 80 mph
print('Remove trips whose avg speed > 80 mph')
data = data[data['avg_speed']<80]
print('Remaining number of records: ', data.shape[0])
data['avg_speed'].hist(bins=100)
plt.suptitle('Histogram of Trip Average Speed (mph)')
According to the handbook, tip amount is automatically populated for credit card tips. Cash tips are not included. Thus, for this question, I choose to only use the trips paid by credit card to build the predictive model.
## check the relationship between payment type and tip
ax = plt.subplot()
data.boxplot('Tip_amount','Payment_type', ax=ax)
ax.set_title('')
ax.set_ylabel('Tip Rate')
ax.set_xlabel('Payment (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown)')
From the figure above, we can see that the average tip amount of trips paid by credit card is bigger than zero, while others are zero. It makes sense since the tip amount is automatically populated for credit card tips, while cash tips are not included. So, let's only keep the trips paid by credit card.
# create a subset called df_tip with payment type = 1 (credit card)
print('Keep trips paid by credit card ...')
df_tip = data[data['Payment_type'] == 1]
print('Remianing number of records: ', df_tip.shape[0])
## Accoring to the handbook, fare amount is the time-and-distance fare calculated by the meter.
## Let's integrate other fare amount EXCEPT TIP AMOUNT into one variable called "other_amount".
df_tip['other_amount'] = df_tip['Extra'] + df_tip['MTA_tax'] + \
df_tip['Tolls_amount'] + df_tip['improvement_surcharge']
print('Other amount description ($):')
print(df_tip['other_amount'].describe()[3:])
# create a variable "tip rate", that is, the percentage of tip amount over the total amount
print('Define tip rate as the percentage of tip amount over the total amount. \n')
df_tip['tip_rate'] = 100*df_tip['Tip_amount']/df_tip['Total_amount']
print('Tip rate description (%):')
print(df_tip['tip_rate'].describe()[3:])
print('\n')
# check the number of trips without tip
print('Number of trips with zero tips: ', df_tip[df_tip['tip_rate']==0].shape[0])
print('The percentage of trips with zero tips: ',
round(100*df_tip[df_tip['tip_rate']==0].shape[0]/df_tip.shape[0],2), '%')
It is interesting that about 13.4% of trips' tip amount is zero in this subset.
## More exploration about the relation between features and target
fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2, figsize=(15,10))
df_tip.boxplot('tip_rate','Passenger_count', ax=ax1)
ax1.set_title('')
ax1.set_ylabel('Tip Rate')
ax1.set_xlabel('Number of Passengers')
df_tip.boxplot('tip_rate','Trip_type ', ax=ax2)
ax2.set_title('')
ax2.set_ylabel('Tip Rate')
ax2.set_xlabel('Trip Type (1 = Street-hail, 2 = Dispatch)')
df_tip.boxplot('tip_rate','RateCodeID', ax=ax3)
ax3.set_title('')
ax3.set_ylabel('Tip Rate')
ax3.set_xlabel('RateCodeID (1=Standard, 2=JFK, 3=Newark, 4==Nassau/Westchester, 5=Negotiated)')
df_tip.boxplot('tip_rate','pickup_hour', ax=ax4)
ax4.set_title('')
ax4.set_ylabel('Tip Rate')
ax4.set_xlabel('Pickup Hour')
It seems that the tip rate distributions of 1 to 6 passengers are similar. However, for 7-9 passengers, the distributions change dramatically, and this is caused by not enough records. According to the handbook, the number of passengers is a driver-entered value, and this explains why some values are zero since the driver did not enter it correctly. Since there is no strong evidence that the number of passengers effect tip rate, I am not going to use this feature for the following predictive model.
It seems that street-hail trips tend to have higher tip rate than dispatch trips.
It seems that standard trips and JFK trips have higher tip rate than others.
It seems that pickup hour does effect trip rate.
fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2, figsize=(15,10))
ax1.scatter(df_tip['Trip_distance'], df_tip['tip_rate'])
ax1.set_xlabel('Trip Distance')
ax1.set_ylabel('Trip Rate (%)')
ax1.set_title('Trip Rate vs Trip Distance')
ax2.scatter(df_tip['duration'], df_tip['tip_rate'])
ax2.set_xlabel('Trip Duration')
ax2.set_ylabel('Trip Rate (%)')
ax2.set_title('Trip Rate vs Trip Duration')
ax3.scatter(df_tip['Fare_amount'], df_tip['tip_rate'])
ax3.set_xlabel('Fare Amount')
ax3.set_ylabel('Trip Rate (%)')
ax3.set_title('Trip Rate vs Fare Amount')
ax4.scatter(df_tip['other_amount'], df_tip['tip_rate'])
ax4.set_xlabel('Other Amount')
ax4.set_ylabel('Trip Rate (%)')
ax4.set_title('Trip Rate vs Other Amount')
In the following part, I will first build a linear regression model as a benchmark. Then I will build a regression random forest model to see if we can get better results.
## Choose features
# numerical features
X_num = ['Trip_distance', 'duration','Fare_amount', 'other_amount', 'pickup_hour']
# one-hot encoding for categorical feature Trip_type
df_tip['trip_type_1'] = (df_tip['Trip_type '] == 1.0).astype(int)
df_tip['trip_type_2'] = (df_tip['Trip_type '] == 2.0).astype(int)
# one-hot encoding for categorical feature RateCodeID
df_tip['rate_code_1'] = (df_tip['RateCodeID'] == 1).astype(int)
df_tip['rate_code_2'] = (df_tip['RateCodeID'] == 2).astype(int)
df_tip['rate_code_3'] = (df_tip['RateCodeID'] == 3).astype(int)
df_tip['rate_code_4'] = (df_tip['RateCodeID'] == 4).astype(int)
df_tip['rate_code_5'] = (df_tip['RateCodeID'] == 5).astype(int)
# final categorical features
X_cat = ['trip_type_1', 'trip_type_2',
'rate_code_1', 'rate_code_2', 'rate_code_3', 'rate_code_4', 'rate_code_5']
# only use numerical features as X for linear regression model
X_lm = df_tip[X_num]
# target y
y = df_tip['tip_rate']
## Use the package "stats" since it provides p-values of coefficients automatically
import statsmodels.api as sm
from scipy import stats
est = sm.OLS(y, sm.add_constant(X_lm))
print(est.fit().summary())
The p-value of the variable of trip duration is too large (bigger than 0.05), so remove this variable and run the linear regression model again.
## remove the column of trip duration and run the model again
X_lm = X_lm[['Trip_distance','Fare_amount','other_amount','pickup_hour']]
est2 = sm.OLS(y, sm.add_constant(X_lm))
print(est2.fit().summary())
Now all the p-values of all four variables' coefficients are statistically significant. Next, use the package "sklearn" to split the dataset into a training dataset and a test dataset (80% : 20%), and calculate the MAE and RMSE for them.
## Split data into a training dataset and a test dataset (8:2)
X_train, X_test, y_train, y_test = train_test_split(X_lm, y, test_size=0.2, random_state=0)
## Linear Regression Model
lm = LinearRegression(normalize=True)
lm.fit(X_train, y_train)
df_feat_impt = pd.DataFrame(lm.coef_, X_lm.columns, columns=['Coefficient'])
df_feat_impt.plot.barh()
## results on the training dataset
scores_MAE = cross_val_score(lm, X_lm, y, cv=10, scoring='mean_absolute_error')
scores_MSE = cross_val_score(lm, X_lm, y, cv=10, scoring='mean_squared_error')
print('Cross validation (10-fold) on the training dataset:')
print("Mean Absolute Error (MAE): %0.2f" % (-scores_MAE).mean())
print("Root Mean Squared Error (RMSE): %0.2f" % np.sqrt(-scores_MSE).mean())
print('\n')
## results on the test dataset
y_pred_test = lm.predict(X_test)
print('Results on the test dataset:')
print("Mean Absolute Error (MAE): %0.2f" % metrics.mean_absolute_error(y_test, y_pred_test))
print("Root Mean Squared Error (RMSE): %0.2f" % np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
The training results calculated by 10-fold cross validation: MAE: 5.44, RMSE: 7.35.
The test results: MAE: 5.42, RMSE: 7.34.
Tip rate has a positive correlation with trip distance and other amount (total amount - tip amount - fare amount). Tip rate has a negative correlation with fare amount.
We have already explored the relation between the tip rate and the hour of day, and we have found that their relation is not linear. It makes sense that the coefficient of the variable "pickup hour" is so close to zero.
# use both numerical and categorical features as X for regression random forest model
X_feat = X_cat + X_num
X = df_tip[X_feat]
# target y
y = df_tip['tip_rate']
## Split data into a training dataset and a test dataset (8:2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
## Random Forest Regression Model
regressor = RandomForestRegressor(max_depth=8)
regressor.fit(X_train, y_train)
df_feat_impt = pd.DataFrame(regressor.feature_importances_, X.columns, columns=['Coefficient'])
df_feat_impt.plot.barh()
## results on the training dataset
from sklearn.model_selection import cross_val_score
scores_MAE = cross_val_score(regressor, X_train, y_train, cv=5, scoring='mean_absolute_error')
scores_MSE = cross_val_score(regressor, X_train, y_train, cv=5, scoring='mean_squared_error')
print('Cross validation (5-fold) on the training dataset:')
print("Mean Absolute Error (MAE): %0.2f" % (-scores_MAE).mean())
print("Mean Squared Error (MSE): %0.2f" % np.sqrt(-scores_MSE).mean())
print('\n')
## results on the test dataset
y_pred_test = regressor.predict(X_test)
print('Results on the test dataset:')
print("Mean Absolute Error (MAE)%0.2f" % metrics.mean_absolute_error(y_test, y_pred_test))
print("Root Mean Squared Error (RMSE): %0.2f" % np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
The results of this random forest regression model are better than the ones of the linear regression model, since the root mean squared errors (RMSE) and the mean absolute errors (MAE) are slightly lower than the ones of the linear regression model. So, we should use this model to predict the tip rate of trips.
The training results calculated by 10-fold cross validation: MAE: 5.40, RMSE: 7.20.
The test results: MAE: 5.37, RMSE: 7.18.
Fare amount and other amount are the top 2 features affect tip rate in this regression random forest model, and this correspond with the results of the linear regression model.
Duration, trip distance, and pickup hour also affect the trip rate.
It is interesting that negotiated trips (RodeCodeID=5) affect tip rate.
• Build a derived variable representing the average speed over the course of a trip.
At the beginning of question 4, We have already deefined average speed over the course of a trip as Trip Distance/Trip Duration, and its unit is mile per hour (mph).
• Can you perform a test to determine if the average trip speeds are materially the same in all weeks of September? If you decide they are not the same, can you form a hypothesis regarding why they differ?
# create the variable "week" extracted from trip datetime
data['week'] = pd.DatetimeIndex(data['lpep_pickup_datetime']).week
print("Count of trips in each week:")
print(data['week'].value_counts())
# draw density plot of avg speed of each week
fig, ax = plt.subplots(1,1,figsize=(10,6))
data['avg_speed'][data['week']==36].plot.kde()
data['avg_speed'][data['week']==37].plot.kde()
data['avg_speed'][data['week']==38].plot.kde()
data['avg_speed'][data['week']==39].plot.kde()
data['avg_speed'][data['week']==40].plot.kde()
lines = ax.get_legend_handles_labels()[0]
labels = ['Week 36', 'Week 37', 'Week 38', 'Week 39', 'Week 40']
ax.legend(lines, labels, loc='best')
ax.set_xlim([0,50])
ax.set_xlabel("Average Speed")
ax.set_ylabel("Density")
ax.set_title("Density Plot of Average Speed in Each Week")
From the above figure, we can see that the average trip speeds of week 37, week 38 and week 40 are slightly slower than the ones of week 36 and week 39. Next, I will perform an ANOVA test to see if the polulation mean of the average trip speed of these five weeks are the same.
Note: From the above figure, we can see that the distributions of the average trip speed are not normal distributions. However, the sample size of each group (each week) is big enough so that we can use ANOVA tests in this case.
## Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
## The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean.
## The test is applied to samples from two or more groups, possibly with differing sizes.
from scipy.stats import f_oneway
print('ANOVA test:')
print(stats.f_oneway(data['avg_speed'][data['week']==36],
data['avg_speed'][data['week']==37],
data['avg_speed'][data['week']==38],
data['avg_speed'][data['week']==39],
data['avg_speed'][data['week']==40]))
print("Mean average speed of week 36: %0.2f mph" % data['avg_speed'][data['week']==36].mean())
print("Mean average speed of week 37: %0.2f mph" % data['avg_speed'][data['week']==37].mean())
print("Mean average speed of week 38: %0.2f mph" % data['avg_speed'][data['week']==38].mean())
print("Mean average speed of week 39: %0.2f mph" % data['avg_speed'][data['week']==39].mean())
print("Mean average speed of week 40: %0.2f mph" % data['avg_speed'][data['week']==40].mean())
The ANOVA test shows that the p-value is very small and the statistic is very large. Thus, the polulation mean of the average trip speed of these five weeks are significantly not the same.
I checked NYC historical weather in September 2015 and found that it was raining on Sep.10th (in week 37) and Sep.30th (in week 40). This probably explains the reason why the average trip speeds in week 37 and week 40 are slightly slower than the ones in week 36 and week 39. I do not find any reasonable explanations why the average trip speed in week 38 is slow.
# create the variable "pickup time" extracted from trip pickup datetime,
# only use "hour" and "minute", not "second"
data['pickup_time'] = pd.DatetimeIndex(data['lpep_pickup_datetime']).hour + \
pd.DatetimeIndex(data['lpep_pickup_datetime']).minute/60
ax = plt.subplot()
data['avg_speed'].groupby(data['pickup_time']).mean().plot()
data['avg_speed'].groupby(data['pickup_hour']).mean().plot()
data['avg_speed'].groupby(data['pickup_time']).median().plot()
data['avg_speed'].groupby(data['pickup_hour']).median().plot()
lines = ax.get_legend_handles_labels()[0]
labels = ['mean avg speed (pickup minute)', 'mean avg speed (pickup hour)',
'median avg speed (pickup minute)', 'median avg speed (pickup hour)']
ax.legend(lines, labels, loc='best')
ax.set_xlim([0,24])
ax.set_xticks(np.arange(24))
ax.set_xlabel("Time of Day")
ax.set_ylabel("Average Speed (mph)")
ax.set_title("Average Speed of Time of Day")
Hypothesis: The average trip speed during 21:00 - 7:00 is higher than the one during 7:00 - 21:00.
We can use t-test to prove this.
from scipy.stats import ttest_ind
hour_7to21 = data[data['pickup_time'] >= 7]
hour_7to21 = hour_7to21[hour_7to21['pickup_time'] <= 21]
hour_21to7 = pd.concat([data[data['pickup_time'] >= 21], data[data['pickup_time'] <= 7]])
print('Two-sample t-test for average speed during 7:00 - 21:00 and 21:00 - 7:00.')
print(ttest_ind(hour_7to21['avg_speed'], hour_21to7['avg_speed']))
The p-value is zero, and the absolute t-statistic value is very large. This indicates that the average trip speed during 7:00 - 21:00 and the one during 21:00 - 7:00 are significantly different.