Summary

Yi Li

In this challenge, I fitted and evaluated regression tree models to predict the price-per-area of a parcel of land.

1. The best results of baseline models using gdd, slope, and the other attributes loaded from the shapefile:

RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435

2. To improve results, I added a new feature convexness. I define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa. The best results of the models added by the feature convexness:

RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693


3. Next, I added 65 land cover features into the models. Each land cover feature represents one land cover type. The value of each land cover feature is the count of pixcels of that land cover type in that parcel. The best results of the models added by the land cover features:

RMSE (test): 14.93993752104971
MAE (test): 11.58687418331668

Conclusions: Adding the feature convexness and land cover features improve the results of the regression tree model.

1. Load the parcels and extract their scalar attributes.

Each parcel in the shapefile has properties including gdd (growing degree days) and slope which are relevant for predicting land value. (Don't worry about the units in which these scalar attributes are measured: this is a simulated dataset, not real data.) As a warmup exercise, define price_per_area as the tot_price (total price) divided by the parcel's area, and show us a histogram of price_per_area, which will be the predicted variable (the "Y") for your valuation models.

In [1]:
import geopandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

gdf = geopandas.read_file('parcels_exam.shp')
gdf.head(2)
Out[1]:
_id area water ponding organic slope gdd road tot_price geometry
0 0 212202.777883 0.470132 0.716075 0.287991 0.383462 0.749170 0.878452 2.096351e+07 POLYGON ((-2126055.914236194 1782007.052652219...
1 1 115054.634691 0.311945 0.398221 0.209844 0.186193 0.944372 0.739551 6.928590e+06 (POLYGON ((-2125540.672597182 1742579.32796026...
In [2]:
# gdf.describe()
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
_id          500 non-null int64
area         500 non-null float64
water        500 non-null float64
ponding      500 non-null float64
organic      500 non-null float64
slope        500 non-null float64
gdd          500 non-null float64
road         500 non-null float64
tot_price    500 non-null float64
geometry     500 non-null object
dtypes: float64(8), int64(1), object(1)
memory usage: 39.1+ KB
In [3]:
# define price_per_area 
gdf['price_per_area'] = gdf['tot_price']/gdf['area']
gdf.head(2)
Out[3]:
_id area water ponding organic slope gdd road tot_price geometry price_per_area
0 0 212202.777883 0.470132 0.716075 0.287991 0.383462 0.749170 0.878452 2.096351e+07 POLYGON ((-2126055.914236194 1782007.052652219... 98.79
1 1 115054.634691 0.311945 0.398221 0.209844 0.186193 0.944372 0.739551 6.928590e+06 (POLYGON ((-2125540.672597182 1742579.32796026... 60.22
In [4]:
# draw histogram
gdf['price_per_area'].hist(bins=30)
plt.title('Histogram of Price Per Area')
plt.xlabel('Price')
plt.ylabel('Count')
Out[4]:
Text(0,0.5,'Count')

2. Build and validate a baseline valuation model using only the scalar attributes.

In other words, predict price_per_area using gdd, slope, and the other attributes you loaded in part (1).

In [5]:
from pandas.plotting import scatter_matrix

scatter_matrix(gdf[['price_per_area', 'gdd', 'slope', 'water', 'ponding', 'organic', 'road']], 
               alpha=0.2, figsize=(12, 6), diagonal='kde');

These features do not follow normal distribution, so regression tree models are more suitable than linear regression models in this case.

Simple Baseline Model by Using Decision Tree Regressor

In [6]:
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)

# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))

print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
Regressor 1:
RMSE (train): 16.277120968353113
RMSE (test): 16.332843095162076
MAE (test): 13.043087891890753

Regressor 2:
RMSE of (train): 15.149164183922451
RMSE of (test): 15.828194738595942
MAE of (test): 12.868690747768436

The results of "Regressor 2" are better than the ones of "Regressor 1".

3. Calculate features from parcel geometries (polygons) and incorporate these into the model.

  • Hint: Parcels of land tend to be more valuable when they are "more convex". Can you construct a continuous feature that encodes "convex-ness"?

  • Hint: Shapely for manipulating vectors and shapes. For example, you can use Shapely's object.convex_hull to easily get a parcel's convex hull.

Before define a feature to measure "convex-ness", let's see some examples.

Example 1: Convex Parcel

In [7]:
import shapely

i=6
gdf['geometry'][i]
Out[7]:
In [8]:
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
area of original parcel: 235248.49204056288
area of the convex hull of the original parcel: 235248.49204056288
ratio: 1.0
Out[8]:

Example 2: Non-Convex Parcel

In [9]:
i=3
gdf['geometry'][i]
Out[9]:
In [10]:
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
area of original parcel: 204126.94794101038
area of the convex hull of the original parcel: 297176.44146139
ratio: 0.6868880552482527
Out[10]:

Example 3: Non-Convex Parcel

In [11]:
i=1
gdf['geometry'][i]
Out[11]:
In [12]:
print('area of original parcel:', gdf['geometry'][i].area)
print('area of the convex hull of the original parcel:', gdf['geometry'][i].convex_hull.area)
print('ratio:', gdf['geometry'][i].area/gdf['geometry'][i].convex_hull.area)
gdf['geometry'][i].convex_hull
area of original parcel: 115054.6346905522
area of the convex hull of the original parcel: 214637.86950689883
ratio: 0.5360407040699504
Out[12]:

We can define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa.

In [13]:
# define convexness
gdf['convexness'] = gdf['area'] / gdf['geometry'].apply(lambda x: x.convex_hull.area)
gdf.head()
Out[13]:
_id area water ponding organic slope gdd road tot_price geometry price_per_area convexness
0 0 212202.777883 0.470132 0.716075 0.287991 0.383462 0.749170 0.878452 2.096351e+07 POLYGON ((-2126055.914236194 1782007.052652219... 98.79 1.000000
1 1 115054.634691 0.311945 0.398221 0.209844 0.186193 0.944372 0.739551 6.928590e+06 (POLYGON ((-2125540.672597182 1742579.32796026... 60.22 0.536041
2 2 308655.040738 0.555649 0.009240 0.833038 0.984329 0.703495 0.181631 3.277608e+07 POLYGON ((-2091681.964114653 1748159.626105954... 106.19 1.000000
3 3 204126.947941 0.277596 0.128861 0.392676 0.956406 0.187131 0.903984 2.003098e+07 POLYGON ((-2089203.134130352 1764248.622634042... 98.13 0.686888
4 4 337687.699386 0.374245 0.324405 0.680116 0.795535 0.503934 0.296242 2.981107e+07 POLYGON ((-2087013.985739365 1758534.117891314... 88.28 1.000000

Rebuild Model

In [14]:
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road', 'convexness']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)

# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))

print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
Regressor 1:
RMSE (train): 14.993250377845005
RMSE (test): 15.652147157379183
MAE (test): 12.163080177963696

Regressor 2:
RMSE of (train): 13.63532898587361
RMSE of (test): 15.996193730402316
MAE of (test): 12.246639227968412

The results of "Regressor 1" are better than the ones of "Regressor 2".

Summary

The best results of previous baseline models:

RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435

To improve results, we add a new feature convexness. We define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa.

The best results of the models with the feature convexness:

RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693

Conclusions: Adding the feature convexness improves the results of the regression tree model.

4. Join CDL information to parcels and incorporate crop type or other land cover features into the model.

Careful, this is a spatial join, not a traditional database join! How does model performance improve relative to the models you built in (2) and (3)? How did you encode the CDL features?

  • Hint: Rasterio for working with raster datasets. For example, you can use rasterio.mask.mask to find the CDL pixels that cover a parcel shape, which will be useful for figuring out what crops were grown on the parcel.

If you get stuck on this step, you can get partial credit by giving us summary statistics describing the CDL raster. For example, how many pixels are there in the raster? How many are coded as almonds? The values in the CDL raster are integer land cover codes; see cdl_mapping.csv for what each code represents.

In [15]:
import rasterio

with rasterio.open('CDL_2017_clip_20180808132906_1114279202.tif') as src:
    array = src.read()
    print('shape:', array.shape) # band, height, width
    print('land cover code matrix: \n', array[0]) # land cover code matrix
    print('unique land cover code: \n', np.unique(array[0])) # unique land cover code
    print('count of land cover code:', len(np.unique(array[0])))
shape: (1, 1506, 2918)
land cover code matrix: 
 [[ 21  21  21 ... 152 176   0]
 [ 21  21  21 ... 176 176   0]
 [ 21  21  21 ... 176 176   0]
 ...
 [ 75  75  75 ...  75  75   0]
 [ 75  75  75 ...  75  75   0]
 [ 75  75  75 ...  36  24   0]]
unique land cover code: 
 [  0   1   2   3   4  21  23  24  27  28  33  36  37  42  44  48  49  54
  57  59  61  66  67  69  71  72  74  75  76  77 111 121 122 123 124 131
 141 142 143 152 176 190 195 204 205 206 208 209 211 212 213 214 216 217
 218 220 221 224 225 226 227 236 237 238 242]
count of land cover code: 65

To encode CDL features, we can add 65 new features to the dataset. Each new feature represents one land cover type. The value of each feature is the count of pixcels of that land cover type in that parcel (row).

In [16]:
# feature names
names = [str(i) for i in np.unique(array[0])]

# initial these code features with zero
for i in names:
    gdf['code_'+i] = 0
    
print('shape:', gdf.shape)
shape: (500, 77)

Statistics Summary of CDL Raster

In [17]:
print("There are {}*{} = {} pixcels in the raster."\
      .format(array.shape[1], array.shape[2], array.shape[1]*array.shape[2]))  

# code 75 represents almonds
print("There are {} pixcels are coded as almonds (land cover code = 75)."\
      .format((array[0] == 75).sum()))
There are 1506*2918 = 4394508 pixcels in the raster.
There are 692941 pixcels are coded as almonds (land cover code = 75).

Use rasterio.mask.mask to find the CDL pixels that cover a parcel shape.

In [19]:
from rasterio.mask import mask

with rasterio.open('CDL_2017_clip_20180808132906_1114279202.tif') as src:
    for i in range(gdf.shape[0]):
        #  find CDL pixels that cover each parcel shape
        out_image, out_transform = mask(src, 
                                    geopandas.geoseries.GeoSeries(gdf['geometry'][i]),
                                    crop = True) # pad=True
        # print(out_image) # a matrix
    
        keys, values = np.unique(out_image, return_counts=True) # feature as key, count as values
        
        # update values in the dataset
        for j in range(len(keys)):
            gdf.loc[i, 'code_'+str(keys[j])] = values[j]
In [20]:
gdf.head()
Out[20]:
_id area water ponding organic slope gdd road tot_price geometry ... code_220 code_221 code_224 code_225 code_226 code_227 code_236 code_237 code_238 code_242
0 0 212202.777883 0.470132 0.716075 0.287991 0.383462 0.749170 0.878452 2.096351e+07 POLYGON ((-2126055.914236194 1782007.052652219... ... 0 0 0 1 0 0 0 0 1 0
1 1 115054.634691 0.311945 0.398221 0.209844 0.186193 0.944372 0.739551 6.928590e+06 (POLYGON ((-2125540.672597182 1742579.32796026... ... 0 0 0 0 0 0 0 0 0 0
2 2 308655.040738 0.555649 0.009240 0.833038 0.984329 0.703495 0.181631 3.277608e+07 POLYGON ((-2091681.964114653 1748159.626105954... ... 0 0 0 1 0 0 0 0 0 0
3 3 204126.947941 0.277596 0.128861 0.392676 0.956406 0.187131 0.903984 2.003098e+07 POLYGON ((-2089203.134130352 1764248.622634042... ... 0 0 0 0 0 0 0 0 0 0
4 4 337687.699386 0.374245 0.324405 0.680116 0.795535 0.503934 0.296242 2.981107e+07 POLYGON ((-2087013.985739365 1758534.117891314... ... 0 0 0 1 0 0 0 0 0 0

5 rows × 77 columns

In [21]:
print("There are {} pixcels are coded as almonds in these 500 parcels."\
    .format(gdf['code_75'].sum()))
There are 19143 pixcels are coded as almonds in these 500 parcels.

Rebuild Model with Land Cover Code Features

In [22]:
gdf.columns
Out[22]:
Index(['_id', 'area', 'water', 'ponding', 'organic', 'slope', 'gdd', 'road',
       'tot_price', 'geometry', 'price_per_area', 'convexness', 'code_0',
       'code_1', 'code_2', 'code_3', 'code_4', 'code_21', 'code_23', 'code_24',
       'code_27', 'code_28', 'code_33', 'code_36', 'code_37', 'code_42',
       'code_44', 'code_48', 'code_49', 'code_54', 'code_57', 'code_59',
       'code_61', 'code_66', 'code_67', 'code_69', 'code_71', 'code_72',
       'code_74', 'code_75', 'code_76', 'code_77', 'code_111', 'code_121',
       'code_122', 'code_123', 'code_124', 'code_131', 'code_141', 'code_142',
       'code_143', 'code_152', 'code_176', 'code_190', 'code_195', 'code_204',
       'code_205', 'code_206', 'code_208', 'code_209', 'code_211', 'code_212',
       'code_213', 'code_214', 'code_216', 'code_217', 'code_218', 'code_220',
       'code_221', 'code_224', 'code_225', 'code_226', 'code_227', 'code_236',
       'code_237', 'code_238', 'code_242'],
      dtype='object')
In [23]:
# select fetures and target for modeling
X = gdf[['gdd', 'slope', 'water', 'ponding', 'organic', 'road', 'convexness', 'code_0',
       'code_1', 'code_2', 'code_3', 'code_4', 'code_21', 'code_23', 'code_24',
       'code_27', 'code_28', 'code_33', 'code_36', 'code_37', 'code_42',
       'code_44', 'code_48', 'code_49', 'code_54', 'code_57', 'code_59',
       'code_61', 'code_66', 'code_67', 'code_69', 'code_71', 'code_72',
       'code_74', 'code_75', 'code_76', 'code_77', 'code_111', 'code_121',
       'code_122', 'code_123', 'code_124', 'code_131', 'code_141', 'code_142',
       'code_143', 'code_152', 'code_176', 'code_190', 'code_195', 'code_204',
       'code_205', 'code_206', 'code_208', 'code_209', 'code_211', 'code_212',
       'code_213', 'code_214', 'code_216', 'code_217', 'code_218', 'code_220',
       'code_221', 'code_224', 'code_225', 'code_226', 'code_227', 'code_236',
       'code_237', 'code_238', 'code_242']].values
y = gdf['price_per_area'].values
# print(X.shape, y.shape)

# split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# regression tree model -- simple baseline model
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=3)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)

# Predict
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('Regressor 1:')
print('RMSE (train):', np.sqrt(mean_squared_error(y_train, regr_1.predict(X_train))))
print('RMSE (test):', np.sqrt(mean_squared_error(y_test, regr_1.predict(X_test))))
print('MAE (test):', mean_absolute_error(y_test, regr_1.predict(X_test)))

print('\nRegressor 2:')
print('RMSE of (train):', np.sqrt(mean_squared_error(y_train, regr_2.predict(X_train))))
print('RMSE of (test):', np.sqrt(mean_squared_error(y_test, regr_2.predict(X_test))))
print('MAE of (test):', mean_absolute_error(y_test, regr_2.predict(X_test)))
Regressor 1:
RMSE (train): 13.931265966263624
RMSE (test): 14.93993752104971
MAE (test): 11.58687418331668

Regressor 2:
RMSE of (train): 12.541346586003863
RMSE of (test): 15.973588850079473
MAE of (test): 12.74756328153288

The results of "Regressor 1" are better than the ones of "Regressor 2".

Summary

1. The best results of baseline models using gdd, slope, and the other attributes loaded from the shapefile:

RMSE of (test): 15.82819473859594
MAE of (test): 12.868690747768435

2. To improve results, add a new feature convexness. We define convexness as the parcel's area divided by the parcel's convex hull's area. The range of the value of convexness is 0-1. The parcel is more convex if the value covexness is more close to 1, and vice versa. The best results of the models added by the feature convexness:

RMSE (test): 15.65214715737918
MAE (test): 12.163080177963693


3. Next, add 65 land cover features into the models. Each land cover feature represents one land cover type. The value of each land cover feature is the count of pixcels of that land cover type in that parcel. The best results of the models added by the land cover features:

RMSE (test): 14.93993752104971
MAE (test): 11.58687418331668

Conclusions: Adding the feature convexness and land cover features improve the results of the regression tree model.