Introduction

Zillow is an online real estate database company that was founded in 2006. Basically, its business model is to generate revenue by selling advertising on its web site. As does its competitors Trulia or Realtor, Zillow offers a home value estimation that home buyers and sellers can find useful. This service is considered a great tool but is not perfect, thus, Zillow launched a competition on Kaggle.

Business Goal

The main objective of this project is to propose an algorithm to predict a log-error between the price estimated by Zillow and the real home price based on sales. The formula established by Zillow is

\[ Log.error = log (Zillow.estimation) - log (Home.price) \]

Predictive Model Cycle

For all stages of the model development, I will use R and SAS Enterprise Miner. The initial stages are to get a better understanding of the data as well as to prepare it for the modeling process. These initial stages are data collection, data exploration, and data preparation.

Technologies

During the completion of this project, we will use R to convey higher quality data to SAS Enterprise Miner that can handle more than one machine learning algorithm, also SAS will allow us to use The model comparison node in order to compare the performance of some models….and we can do the setup visually. So, instead of use just one tool, we will take advantage of the best of both worlds.

Data Collection

The datasets provided by Zillow has been made public through the Kaggle website and even though the competition is closed for new participants, the data is still available. The datasets contain housing data from 2016 for three different counties in California: Los Angeles County, Ventura County and the Orange County.

properties <- read.csv('C:/Saul/Portfolio/data/properties_2016.csv')
transactions <- read.csv('C:/Saul/Portfolio/data/train_2016_v2.csv')
sample_submission <- read.csv('C:/Saul/Portfolio/data/sample_submission.csv')


names(properties)
 [1] "parcelid"                     "airconditioningtypeid"       
 [3] "architecturalstyletypeid"     "basementsqft"                
 [5] "bathroomcnt"                  "bedroomcnt"                  
 [7] "buildingclasstypeid"          "buildingqualitytypeid"       
 [9] "calculatedbathnbr"            "decktypeid"                  
[11] "finishedfloor1squarefeet"     "calculatedfinishedsquarefeet"
[13] "finishedsquarefeet12"         "finishedsquarefeet13"        
[15] "finishedsquarefeet15"         "finishedsquarefeet50"        
[17] "finishedsquarefeet6"          "fips"                        
[19] "fireplacecnt"                 "fullbathcnt"                 
[21] "garagecarcnt"                 "garagetotalsqft"             
[23] "hashottuborspa"               "heatingorsystemtypeid"       
[25] "latitude"                     "longitude"                   
[27] "lotsizesquarefeet"            "poolcnt"                     
[29] "poolsizesum"                  "pooltypeid10"                
[31] "pooltypeid2"                  "pooltypeid7"                 
[33] "propertycountylandusecode"    "propertylandusetypeid"       
[35] "propertyzoningdesc"           "rawcensustractandblock"      
[37] "regionidcity"                 "regionidcounty"              
[39] "regionidneighborhood"         "regionidzip"                 
[41] "roomcnt"                      "storytypeid"                 
[43] "threequarterbathnbr"          "typeconstructiontypeid"      
[45] "unitcnt"                      "yardbuildingsqft17"          
[47] "yardbuildingsqft26"           "yearbuilt"                   
[49] "numberofstories"              "fireplaceflag"               
[51] "structuretaxvaluedollarcnt"   "taxvaluedollarcnt"           
[53] "assessmentyear"               "landtaxvaluedollarcnt"       
[55] "taxamount"                    "taxdelinquencyflag"          
[57] "taxdelinquencyyear"           "censustractandblock"         
names(transactions)
[1] "parcelid"        "logerror"        "transactiondate"
names(sample_submission)
[1] "ParcelId" "X201610"  "X201611"  "X201612"  "X201710"  "X201711" 
[7] "X201712" 

Let’s review a small sample of the data.

datatable(properties[(1:20),], filter = 'top', options = list(
  pageLength = 5, scrollX = TRUE, scrollY = "200px", autoWidth = TRUE), caption = 'Table 1: Properties.')

Data Exploration

Histogram graphics

Reviewing some of the data, the build year feature has almost a standard distribution, also an important proportion of the houses were built several years ago. Regarding the rest of features such as the number of bedrooms, number of bathrooms, number and capacity of garages, tax delinquency, property taxes, etc. Most of them have normalized data as well. Furthermore, the distribution for the critical attribute called log-error, as this is the variable to predict, has a normalized data.

df <- data.frame(properties)
ggplot(df, aes(x = yearbuilt)) +
  geom_histogram(bins = 20, fill="darkgreen", alpha=1) +
  geom_vline(aes(xintercept=mean(yearbuilt)), ## straight line for the mean
             colour = "#ADFF2F", size=1.5, alpha=0.5) + 
  geom_vline(aes(xintercept=median(yearbuilt)), ## dashed line for the median
             colour = "#ADFF2F", linetype="dashed", size=1.5, alpha=0.5) +
  labs(x = "Years", y = "Frequency") +
  ggtitle("Histogram Home Aging") 

# Rename the variable names
# FunctionX(dataA) is the same as dataA %>% functionX

properties <- properties %>% rename(
  id_parcel = parcelid,
  build_year = yearbuilt,
  area_basement = basementsqft,
  area_patio = yardbuildingsqft17,
  area_shed = yardbuildingsqft26, 
  area_pool = poolsizesum,  
  area_lot = lotsizesquarefeet, 
  area_garage = garagetotalsqft,
  area_firstfloor_finished = finishedfloor1squarefeet,
  area_total_calc = calculatedfinishedsquarefeet,
  area_base = finishedsquarefeet6,
  area_live_finished = finishedsquarefeet12,
  area_liveperi_finished = finishedsquarefeet13,
  area_total_finished = finishedsquarefeet15,  
  area_unknown = finishedsquarefeet50,
  num_unit = unitcnt, 
  num_story = numberofstories,  
  num_room = roomcnt,
  num_bathroom = bathroomcnt,
  num_bedroom = bedroomcnt,
  num_bathroom_calc = calculatedbathnbr,
  num_bath = fullbathcnt,  
  num_75_bath = threequarterbathnbr, 
  num_fireplace = fireplacecnt,
  num_pool = poolcnt,  
  num_garage = garagecarcnt,  
  region_county = regionidcounty,
  region_city = regionidcity,
  region_zip = regionidzip,
  region_neighbor = regionidneighborhood,  
  tax_total = taxvaluedollarcnt,
  tax_building = structuretaxvaluedollarcnt,
  tax_land = landtaxvaluedollarcnt,
  tax_property = taxamount,
  tax_year = assessmentyear,
  tax_delinquency = taxdelinquencyflag,
  tax_delinquency_year = taxdelinquencyyear,
  zoning_property = propertyzoningdesc,
  zoning_landuse = propertylandusetypeid,
  zoning_landuse_county = propertycountylandusecode,
  flag_fireplace = fireplaceflag, 
  flag_tub = hashottuborspa,
  quality = buildingqualitytypeid,
  framing = buildingclasstypeid,
  material = typeconstructiontypeid,
  deck = decktypeid,
  story = storytypeid,
  heating = heatingorsystemtypeid,
  aircon = airconditioningtypeid,
  architectural_style= architecturalstyletypeid
)

transactions <- transactions %>% rename(
  id_parcel = parcelid,
  date = transactiondate
)

# Convert dummary variables (Y and N) to (1 and 0)
properties <- properties %>% 
  mutate(tax_delinquency = ifelse(tax_delinquency=="Y",1,0),
         flag_fireplace = ifelse(flag_fireplace=="Y",1,0),
         flag_tub = ifelse(flag_tub=="Y",1,0))

# Take a look at the data
properties <- properties %>% select(id_parcel, build_year, starts_with("area_"), 
                                    starts_with("num_"), starts_with("flag_"), starts_with("region_"), everything())


#datatable(head(properties,100), style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))
#datatable(head(transactions,100), style="bootstrap", class="table-condensed", options = list(dom = 'tp'))

Let’s take a look at the transaction data.

tmp <- transactions %>% mutate(year_month = make_date(year=year(date),month=month(date)))
tmp %>% 
  group_by(year_month) %>% count() %>% 
  ggplot(aes(x=year_month,y=n)) +
  geom_bar(stat="identity", fill="darkgreen")+
  geom_vline(aes(xintercept=as.numeric(as.Date("2016-10-01"))),size=2)

Log data plots

Let’s review the distribution of Zestimate’s forecast errors (log error).

Also, let’s check how is the distribution when using an absolute logerror.

Then, let’s see how does log error change along the time.

Log Error Histogram

transactions %>% 
  ggplot(aes(x=logerror)) + 
  geom_histogram(bins=400, fill="darkgreen")+
  theme_bw()+theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  ylab("Frequency")+coord_cartesian(x=c(-0.5,0.5)) +
  ggtitle("Histogram Log error") 

Absolute Log Error Histogram

# Absolute logerror
transactions <- transactions %>% mutate(abs_logerror = abs(logerror))
transactions %>% 
  ggplot(aes(x=abs_logerror)) + 
  geom_histogram(bins=400, fill="#0da107")+
  theme_bw()+theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  ylab("Frequency")+coord_cartesian(x=c(0,0.5))

Log Error through the time

# How does log error change with time
transactions %>% 
  mutate(year_month = make_date(year=year(date),month=month(date)) ) %>% 
  group_by(year_month) %>% summarize(mean_logerror = mean(logerror)) %>% 
  ggplot(aes(x=year_month,y=mean_logerror)) + 
  geom_line(size=1.5, color="darkgreen")+geom_point(size=5, color="lightgreen")+theme_bw()

Data Preparation

Data Cleaning

Let’s plot a missingness chart to see that there are several features which have a bunch of missing values.

missing_values <- properties %>% summarize_all(funs(sum(is.na(.))/n()))

missing_values <- gather(missing_values, key="feature", value="missing_pct")
missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
  geom_bar(stat="identity",fill="#0da107")+
  coord_flip()+theme_bw()+
  labs(y = "Missingness", x = "Features") +
  ggtitle("Missing data") 

Based on data provided by the above chart, We will retrieve only the features whose missing data is not more than 25%. There are 39 attributes, as shown in below table.

good_features <- filter(missing_values, missing_pct<0.25)
good_features
feature missing_pct
id_parcel 0.0000000
build_year 0.0200749
area_total_calc 0.0186134
area_live_finished 0.0924666
area_lot 0.0924888
num_bathroom 0.0038396
num_bedroom 0.0038356
num_bathroom_calc 0.0431835
num_bath 0.0431835
num_room 0.0038439
flag_tub 0.0000000
flag_fireplace 0.0000000
region_city 0.0210521
region_county 0.0038312
region_zip 0.0046831
fips 0.0038312
latitude 0.0038312
longitude 0.0038312
zoning_landuse_county 0.0000000
zoning_landuse 0.0038312
zoning_property 0.0000000
rawcensustractandblock 0.0038312
tax_building 0.0184181
tax_total 0.0142536
tax_year 0.0038319
tax_land 0.0226895
tax_property 0.0104683
tax_delinquency 0.0000000
censustractandblock 0.0251660

Correlation Analysis

Next step is to review the correlation among the features. After using sets of attributes we can note that some of them are highly correlated, such as the number of calculated bathrooms and the actual number of bathrooms, similarly, there is a correlation for the total tax and the property tax. Correlated features had to be removed.

Let’s see correlation among room features.

col3 <- colorRampPalette(c("blue", "white", "#4bb048")) 

  vars <- good_features$feature[str_detect(good_features$feature,'num_')]
cor_tmp <- transactions %>% left_join(properties, by="id_parcel") 
tmp <- cor_tmp %>% select(one_of(c(vars,"logerror")))
corrplot(cor(tmp, use="complete.obs"),type="lower", col = col3(20))

Let’s see correlation among tax amount features.

vars <- setdiff(good_features$feature[str_detect(good_features$feature,'tax_')],c("tax_delinquency","tax_year"))
tmp <- cor_tmp %>%  select(one_of(c(vars,"logerror")))
corrplot(cor(tmp, use="complete.obs"), type="lower", col = col3(20))

Data Imputation

After removing some attributes in the Data cleaning stage, in this transferring step we made sure to change the format when needed to the selected attributes to guarantee that they can be conveyed and processed through SAS Enterprise Manager without issues.

Additionally, to complete the missing values we will use an interpolation process available in the R packages.

After that, the final features to be used in the modeling are tax property, finished living, number of garages, number of bedrooms, number of bathrooms, build year, and tax delinquency.

cor_tmp <- na.interpolation(cor_tmp)

data.to.sas  <- subset (cor_tmp, select = c(logerror , tax_property , area_live_finished , num_garage , num_bedroom , num_bathroom , build_year , tax_delinquency))


# 
# 
# ## Classical tabs {.tabset}
# 
# classical tabs without fading effect.
# 
# ### First tab
# 
# Work in progress...
# 
# ### Second tab
# 
# Work in progress...

Modeling

Diagramming

For this stage, the first that needed to be done is the creation of a Project in SAS Enterprise Miner, then, to create a diagram within the project. Next step is to import the dataset file generated via R after the data preparation activities.

Initially, I utilized two models for this analytics project: Linear regression that is easy to understand and explain and can be managed to avoid overfitting. Second, I chose decision trees than can handle non-linear relationships and are fairly robust to outliers.

Then, in order to refine the predictive mode, let’s select a third model, Gradient Boosting, which can create a series of decision trees that form a single predictive model. Gradient Boosting is less susceptible to overfitting the data than a single decision tree.

Validation

For this dataset conveyed from R, we will use the Partition feature of SAS Enterprise Miner to split it towards a validation dataset and a training dataset. The performance measure will use the metric called Mean Absolute Error (MAE) as stipulated by Zillow’s competition rules.

Results

Since SAS does not automatically calculate the mean absolute error (MAE), the metric requested by Zillow, we will calculate it via SAS Code node. The MAE is 0.067921, a pretty decent value considering that the top Kaggle Error values are around 0.064.

Conclusions

  • Data preparation is much more efficiently done by another tool that is different from SAS Enterprise Miner. In this case, we used R and a lot of time was saved. SAS does offer an integration allowing to import the file directly.

  • The model comparison feature provided by SAS is really impressive. Although similar tasks can be performed by R, there are no control points, a very nice feature of SAS.

  • Modeling graphically provides a wider vision to think of out the box. That is an advantage of SAS Enterprise Miner.

  • Next steps migth be improve the model using other algorithms, for instance: neural network, as well as other techniques such as Principal Components Analysis (PCA).