Zillow is an online real estate database company that was founded in 2006. Basically, its business model is to generate revenue by selling advertising on its web site. As does its competitors Trulia or Realtor, Zillow offers a home value estimation that home buyers and sellers can find useful. This service is considered a great tool but is not perfect, thus, Zillow launched a competition on Kaggle.
The main objective of this project is to propose an algorithm to predict a log-error between the price estimated by Zillow and the real home price based on sales. The formula established by Zillow is
\[ Log.error = log (Zillow.estimation) - log (Home.price) \]
For all stages of the model development, I will use R and SAS Enterprise Miner. The initial stages are to get a better understanding of the data as well as to prepare it for the modeling process. These initial stages are data collection, data exploration, and data preparation.
During the completion of this project, we will use R to convey higher quality data to SAS Enterprise Miner that can handle more than one machine learning algorithm, also SAS will allow us to use The model comparison node in order to compare the performance of some models….and we can do the setup visually. So, instead of use just one tool, we will take advantage of the best of both worlds.
The datasets provided by Zillow has been made public through the Kaggle website and even though the competition is closed for new participants, the data is still available. The datasets contain housing data from 2016 for three different counties in California: Los Angeles County, Ventura County and the Orange County.
properties <- read.csv('C:/Saul/Portfolio/data/properties_2016.csv')
transactions <- read.csv('C:/Saul/Portfolio/data/train_2016_v2.csv')
sample_submission <- read.csv('C:/Saul/Portfolio/data/sample_submission.csv')
names(properties)
[1] "parcelid" "airconditioningtypeid"
[3] "architecturalstyletypeid" "basementsqft"
[5] "bathroomcnt" "bedroomcnt"
[7] "buildingclasstypeid" "buildingqualitytypeid"
[9] "calculatedbathnbr" "decktypeid"
[11] "finishedfloor1squarefeet" "calculatedfinishedsquarefeet"
[13] "finishedsquarefeet12" "finishedsquarefeet13"
[15] "finishedsquarefeet15" "finishedsquarefeet50"
[17] "finishedsquarefeet6" "fips"
[19] "fireplacecnt" "fullbathcnt"
[21] "garagecarcnt" "garagetotalsqft"
[23] "hashottuborspa" "heatingorsystemtypeid"
[25] "latitude" "longitude"
[27] "lotsizesquarefeet" "poolcnt"
[29] "poolsizesum" "pooltypeid10"
[31] "pooltypeid2" "pooltypeid7"
[33] "propertycountylandusecode" "propertylandusetypeid"
[35] "propertyzoningdesc" "rawcensustractandblock"
[37] "regionidcity" "regionidcounty"
[39] "regionidneighborhood" "regionidzip"
[41] "roomcnt" "storytypeid"
[43] "threequarterbathnbr" "typeconstructiontypeid"
[45] "unitcnt" "yardbuildingsqft17"
[47] "yardbuildingsqft26" "yearbuilt"
[49] "numberofstories" "fireplaceflag"
[51] "structuretaxvaluedollarcnt" "taxvaluedollarcnt"
[53] "assessmentyear" "landtaxvaluedollarcnt"
[55] "taxamount" "taxdelinquencyflag"
[57] "taxdelinquencyyear" "censustractandblock"
names(transactions)
[1] "parcelid" "logerror" "transactiondate"
names(sample_submission)
[1] "ParcelId" "X201610" "X201611" "X201612" "X201710" "X201711"
[7] "X201712"
Let’s review a small sample of the data.
datatable(properties[(1:20),], filter = 'top', options = list(
pageLength = 5, scrollX = TRUE, scrollY = "200px", autoWidth = TRUE), caption = 'Table 1: Properties.')
Reviewing some of the data, the build year feature has almost a standard distribution, also an important proportion of the houses were built several years ago. Regarding the rest of features such as the number of bedrooms, number of bathrooms, number and capacity of garages, tax delinquency, property taxes, etc. Most of them have normalized data as well. Furthermore, the distribution for the critical attribute called log-error, as this is the variable to predict, has a normalized data.
df <- data.frame(properties)
ggplot(df, aes(x = yearbuilt)) +
geom_histogram(bins = 20, fill="darkgreen", alpha=1) +
geom_vline(aes(xintercept=mean(yearbuilt)), ## straight line for the mean
colour = "#ADFF2F", size=1.5, alpha=0.5) +
geom_vline(aes(xintercept=median(yearbuilt)), ## dashed line for the median
colour = "#ADFF2F", linetype="dashed", size=1.5, alpha=0.5) +
labs(x = "Years", y = "Frequency") +
ggtitle("Histogram Home Aging")
# Rename the variable names
# FunctionX(dataA) is the same as dataA %>% functionX
properties <- properties %>% rename(
id_parcel = parcelid,
build_year = yearbuilt,
area_basement = basementsqft,
area_patio = yardbuildingsqft17,
area_shed = yardbuildingsqft26,
area_pool = poolsizesum,
area_lot = lotsizesquarefeet,
area_garage = garagetotalsqft,
area_firstfloor_finished = finishedfloor1squarefeet,
area_total_calc = calculatedfinishedsquarefeet,
area_base = finishedsquarefeet6,
area_live_finished = finishedsquarefeet12,
area_liveperi_finished = finishedsquarefeet13,
area_total_finished = finishedsquarefeet15,
area_unknown = finishedsquarefeet50,
num_unit = unitcnt,
num_story = numberofstories,
num_room = roomcnt,
num_bathroom = bathroomcnt,
num_bedroom = bedroomcnt,
num_bathroom_calc = calculatedbathnbr,
num_bath = fullbathcnt,
num_75_bath = threequarterbathnbr,
num_fireplace = fireplacecnt,
num_pool = poolcnt,
num_garage = garagecarcnt,
region_county = regionidcounty,
region_city = regionidcity,
region_zip = regionidzip,
region_neighbor = regionidneighborhood,
tax_total = taxvaluedollarcnt,
tax_building = structuretaxvaluedollarcnt,
tax_land = landtaxvaluedollarcnt,
tax_property = taxamount,
tax_year = assessmentyear,
tax_delinquency = taxdelinquencyflag,
tax_delinquency_year = taxdelinquencyyear,
zoning_property = propertyzoningdesc,
zoning_landuse = propertylandusetypeid,
zoning_landuse_county = propertycountylandusecode,
flag_fireplace = fireplaceflag,
flag_tub = hashottuborspa,
quality = buildingqualitytypeid,
framing = buildingclasstypeid,
material = typeconstructiontypeid,
deck = decktypeid,
story = storytypeid,
heating = heatingorsystemtypeid,
aircon = airconditioningtypeid,
architectural_style= architecturalstyletypeid
)
transactions <- transactions %>% rename(
id_parcel = parcelid,
date = transactiondate
)
# Convert dummary variables (Y and N) to (1 and 0)
properties <- properties %>%
mutate(tax_delinquency = ifelse(tax_delinquency=="Y",1,0),
flag_fireplace = ifelse(flag_fireplace=="Y",1,0),
flag_tub = ifelse(flag_tub=="Y",1,0))
# Take a look at the data
properties <- properties %>% select(id_parcel, build_year, starts_with("area_"),
starts_with("num_"), starts_with("flag_"), starts_with("region_"), everything())
#datatable(head(properties,100), style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))
#datatable(head(transactions,100), style="bootstrap", class="table-condensed", options = list(dom = 'tp'))
Let’s take a look at the transaction data.
tmp <- transactions %>% mutate(year_month = make_date(year=year(date),month=month(date)))
tmp %>%
group_by(year_month) %>% count() %>%
ggplot(aes(x=year_month,y=n)) +
geom_bar(stat="identity", fill="darkgreen")+
geom_vline(aes(xintercept=as.numeric(as.Date("2016-10-01"))),size=2)
Let’s review the distribution of Zestimate’s forecast errors (log error).
Also, let’s check how is the distribution when using an absolute logerror.
Then, let’s see how does log error change along the time.
transactions %>%
ggplot(aes(x=logerror)) +
geom_histogram(bins=400, fill="darkgreen")+
theme_bw()+theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
ylab("Frequency")+coord_cartesian(x=c(-0.5,0.5)) +
ggtitle("Histogram Log error")
# Absolute logerror
transactions <- transactions %>% mutate(abs_logerror = abs(logerror))
transactions %>%
ggplot(aes(x=abs_logerror)) +
geom_histogram(bins=400, fill="#0da107")+
theme_bw()+theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
ylab("Frequency")+coord_cartesian(x=c(0,0.5))
# How does log error change with time
transactions %>%
mutate(year_month = make_date(year=year(date),month=month(date)) ) %>%
group_by(year_month) %>% summarize(mean_logerror = mean(logerror)) %>%
ggplot(aes(x=year_month,y=mean_logerror)) +
geom_line(size=1.5, color="darkgreen")+geom_point(size=5, color="lightgreen")+theme_bw()
Let’s plot a missingness chart to see that there are several features which have a bunch of missing values.
missing_values <- properties %>% summarize_all(funs(sum(is.na(.))/n()))
missing_values <- gather(missing_values, key="feature", value="missing_pct")
missing_values %>%
ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
geom_bar(stat="identity",fill="#0da107")+
coord_flip()+theme_bw()+
labs(y = "Missingness", x = "Features") +
ggtitle("Missing data")
Based on data provided by the above chart, We will retrieve only the features whose missing data is not more than 25%. There are 39 attributes, as shown in below table.
good_features <- filter(missing_values, missing_pct<0.25)
good_features
feature | missing_pct |
---|---|
id_parcel | 0.0000000 |
build_year | 0.0200749 |
area_total_calc | 0.0186134 |
area_live_finished | 0.0924666 |
area_lot | 0.0924888 |
num_bathroom | 0.0038396 |
num_bedroom | 0.0038356 |
num_bathroom_calc | 0.0431835 |
num_bath | 0.0431835 |
num_room | 0.0038439 |
flag_tub | 0.0000000 |
flag_fireplace | 0.0000000 |
region_city | 0.0210521 |
region_county | 0.0038312 |
region_zip | 0.0046831 |
fips | 0.0038312 |
latitude | 0.0038312 |
longitude | 0.0038312 |
zoning_landuse_county | 0.0000000 |
zoning_landuse | 0.0038312 |
zoning_property | 0.0000000 |
rawcensustractandblock | 0.0038312 |
tax_building | 0.0184181 |
tax_total | 0.0142536 |
tax_year | 0.0038319 |
tax_land | 0.0226895 |
tax_property | 0.0104683 |
tax_delinquency | 0.0000000 |
censustractandblock | 0.0251660 |
Next step is to review the correlation among the features. After using sets of attributes we can note that some of them are highly correlated, such as the number of calculated bathrooms and the actual number of bathrooms, similarly, there is a correlation for the total tax and the property tax. Correlated features had to be removed.
Let’s see correlation among room features.
col3 <- colorRampPalette(c("blue", "white", "#4bb048"))
vars <- good_features$feature[str_detect(good_features$feature,'num_')]
cor_tmp <- transactions %>% left_join(properties, by="id_parcel")
tmp <- cor_tmp %>% select(one_of(c(vars,"logerror")))
corrplot(cor(tmp, use="complete.obs"),type="lower", col = col3(20))
Let’s see correlation among tax amount features.
vars <- setdiff(good_features$feature[str_detect(good_features$feature,'tax_')],c("tax_delinquency","tax_year"))
tmp <- cor_tmp %>% select(one_of(c(vars,"logerror")))
corrplot(cor(tmp, use="complete.obs"), type="lower", col = col3(20))
After removing some attributes in the Data cleaning stage, in this transferring step we made sure to change the format when needed to the selected attributes to guarantee that they can be conveyed and processed through SAS Enterprise Manager without issues.
Additionally, to complete the missing values we will use an interpolation process available in the R packages.
After that, the final features to be used in the modeling are tax property, finished living, number of garages, number of bedrooms, number of bathrooms, build year, and tax delinquency.
cor_tmp <- na.interpolation(cor_tmp)
data.to.sas <- subset (cor_tmp, select = c(logerror , tax_property , area_live_finished , num_garage , num_bedroom , num_bathroom , build_year , tax_delinquency))
#
#
# ## Classical tabs {.tabset}
#
# classical tabs without fading effect.
#
# ### First tab
#
# Work in progress...
#
# ### Second tab
#
# Work in progress...
For this stage, the first that needed to be done is the creation of a Project in SAS Enterprise Miner, then, to create a diagram within the project. Next step is to import the dataset file generated via R after the data preparation activities.
Initially, I utilized two models for this analytics project: Linear regression that is easy to understand and explain and can be managed to avoid overfitting. Second, I chose decision trees than can handle non-linear relationships and are fairly robust to outliers.
Then, in order to refine the predictive mode, let’s select a third model, Gradient Boosting, which can create a series of decision trees that form a single predictive model. Gradient Boosting is less susceptible to overfitting the data than a single decision tree.
For this dataset conveyed from R, we will use the Partition feature of SAS Enterprise Miner to split it towards a validation dataset and a training dataset. The performance measure will use the metric called Mean Absolute Error (MAE) as stipulated by Zillow’s competition rules.
Since SAS does not automatically calculate the mean absolute error (MAE), the metric requested by Zillow, we will calculate it via SAS Code node. The MAE is 0.067921, a pretty decent value considering that the top Kaggle Error values are around 0.064.
Data preparation is much more efficiently done by another tool that is different from SAS Enterprise Miner. In this case, we used R and a lot of time was saved. SAS does offer an integration allowing to import the file directly.
The model comparison feature provided by SAS is really impressive. Although similar tasks can be performed by R, there are no control points, a very nice feature of SAS.
Modeling graphically provides a wider vision to think of out the box. That is an advantage of SAS Enterprise Miner.
Next steps migth be improve the model using other algorithms, for instance: neural network, as well as other techniques such as Principal Components Analysis (PCA).