Yi Li
In this challenge, I fitted and evaluated regression tree models to predict the price-per-area of a parcel of land.
1. The best results of baseline models using gdd, slope, and the other attributes loaded from the shapefile:
RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435
2. To improve results, I added a new feature convexness. I define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa. The best results of the models added by the feature convexness:
RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693
3. Next, I added 65 land cover features into the models. Each land cover feature represents one land cover type. The value of each land cover feature is the count of pixcels of that land cover type in that parcel. The best results of the models added by the land cover features:
RMSE (test): 14.93993752104971
MAE (test): 11.58687418331668
Conclusions: Adding the feature convexness and land cover features improve the results of the regression tree model.
Each parcel in the shapefile has properties including gdd (growing degree days) and slope which are relevant for predicting land value. (Don't worry about the units in which these scalar attributes are measured: this is a simulated dataset, not real data.) As a warmup exercise, define price_per_area as the tot_price (total price) divided by the parcel's area, and show us a histogram of price_per_area, which will be the predicted variable (the "Y") for your valuation models.
import geopandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
gdf = geopandas.read_file('parcels_exam.shp')
gdf.head(2)
# gdf.describe()
gdf.info()
# define price_per_area
gdf['price_per_area'] = gdf['tot_price']/gdf['area']
gdf.head(2)
# draw histogram
gdf['price_per_area'].hist(bins=30)
plt.title('Histogram of Price Per Area')
plt.xlabel('Price')
plt.ylabel('Count')
In other words, predict price_per_area using gdd, slope, and the other attributes you loaded in part (1).
from pandas.plotting import scatter_matrix
scatter_matrix(gdf[['price_per_area', 'gdd', 'slope', 'water', 'ponding', 'organic', 'road']],
alpha=0.2, figsize=(12, 6), diagonal='kde');
These features do not follow normal distribution, so regression tree models are more suitable than linear regression models in this case.
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)
# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)
# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))
print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
The results of "Regressor 2" are better than the ones of "Regressor 1".
Hint: Parcels of land tend to be more valuable when they are "more convex". Can you construct a continuous feature that encodes "convex-ness"?
Hint: Shapely for manipulating vectors and shapes.
For example, you can use Shapely's object.convex_hull to easily get a parcel's convex hull.
Before define a feature to measure "convex-ness", let's see some examples.
import shapely
i=6
gdf['geometry'][i]
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
i=3
gdf['geometry'][i]
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
i=1
gdf['geometry'][i]
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
We can define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa.
# define convexness
gdf['convexness'] = gdf['area'] / gdf['geometry'].apply(lambda x: x.convex_hull.area)
gdf.head()
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road', 'convexness']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)
# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)
# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))
print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
The results of "Regressor 1" are better than the ones of "Regressor 2".
The best results of previous baseline models:
RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435
To improve results, we add a new feature convexness. We define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa.
The best results of the models with the feature convexness:
RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693
Conclusions: Adding the feature convexness improves the results of the regression tree model.
Careful, this is a spatial join, not a traditional database join! How does model performance improve relative to the models you built in (2) and (3)? How did you encode the CDL features?
rasterio.mask.mask to find the CDL pixels that cover a parcel shape, which will be useful for figuring out what crops were grown on the parcel.If you get stuck on this step, you can get partial credit by giving us summary statistics describing the CDL raster. For example, how many pixels are there in the raster? How many are coded as almonds? The values in the CDL raster are integer land cover codes; see cdl_mapping.csv for what each code represents.
import rasterio
with rasterio.open('CDL_2017_clip_20180808132906_1114279202.tif') as src:
array = src.read()
print('shape:', array.shape) # band, height, width
print('land cover code matrix: \n', array[0]) # land cover code matrix
print('unique land cover code: \n', np.unique(array[0])) # unique land cover code
print('count of land cover code:', len(np.unique(array[0])))
To encode CDL features, we can add 65 new features to the dataset. Each new feature represents one land cover type. The value of each feature is the count of pixcels of that land cover type in that parcel (row).
# feature names
names = [str(i) for i in np.unique(array[0])]
# initial these code features with zero
for i in names:
gdf['code_'+i] = 0
print('shape:', gdf.shape)
print("There are {}*{} = {} pixcels in the raster."\
.format(array.shape[1], array.shape[2], array.shape[1]*array.shape[2]))
# code 75 represents almonds
print("There are {} pixcels are coded as almonds (land cover code = 75)."\
.format((array[0] == 75).sum()))
rasterio.mask.mask to find the CDL pixels that cover a parcel shape.¶from rasterio.mask import mask
with rasterio.open('CDL_2017_clip_20180808132906_1114279202.tif') as src:
for i in range(gdf.shape[0]):
# find CDL pixels that cover each parcel shape
out_image, out_transform = mask(src,
geopandas.geoseries.GeoSeries(gdf['geometry'][i]),
crop = True) # pad=True
# print(out_image) # a matrix
keys, values = np.unique(out_image, return_counts=True) # feature as key, count as values
# update values in the dataset
for j in range(len(keys)):
gdf.loc[i, 'code_'+str(keys[j])] = values[j]
gdf.head()
print("There are {} pixcels are coded as almonds in these 500 parcels."\
.format(gdf['code_75'].sum()))
gdf.columns
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road', 'convexness', 'code_0',
'code_1', 'code_2', 'code_3', 'code_4', 'code_21', 'code_23', 'code_24',
'code_27', 'code_28', 'code_33', 'code_36', 'code_37', 'code_42',
'code_44', 'code_48', 'code_49', 'code_54', 'code_57', 'code_59',
'code_61', 'code_66', 'code_67', 'code_69', 'code_71', 'code_72',
'code_74', 'code_75', 'code_76', 'code_77', 'code_111', 'code_121',
'code_122', 'code_123', 'code_124', 'code_131', 'code_141', 'code_142',
'code_143', 'code_152', 'code_176', 'code_190', 'code_195', 'code_204',
'code_205', 'code_206', 'code_208', 'code_209', 'code_211', 'code_212',
'code_213', 'code_214', 'code_216', 'code_217', 'code_218', 'code_220',
'code_221', 'code_224', 'code_225', 'code_226', 'code_227', 'code_236',
'code_237', 'code_238', 'code_242']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)
# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)
# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))
print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
The results of "Regressor 1" are better than the ones of "Regressor 2".
1. The best results of baseline models using gdd, slope, and the other attributes loaded from the shapefile:
RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435
2. To improve results, add a new feature convexness. We define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa. The best results of the models added by the feature convexness:
RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693
3. Next, add 65 land cover features into the models. Each land cover feature represents one land cover type. The value of each land cover feature is the count of pixcels of that land cover type in that parcel. The best results of the models added by the land cover features:
RMSE (test): 14.93993752104971
MAE (test): 11.58687418331668
Conclusions: Adding the feature convexness and land cover features improve the results of the regression tree model.