1 Introduction and learning objectives

The purpose of this exercise is to build an estimation engine to guide investment decisions in London house market. You will first build machine learning algorithms (and tune them) to estimate the house prices given variety of information about each property. Then, using your algorithm, you will choose 200 houses to invest in out of about 2000 houses on the market at the moment.

Learning objectives

Using different data mining algorithms for prediction.
Dealing with large data sets
Tuning data mining algorithms
Interpreting data mining algorithms and deducing importance of variables
Using results of data mining algorithms to make business decisions

2 Load data

There are two sets of data, i) training data that has the actual prices ii) out of sample data that has the asking prices. Load both data sets.

First, we have to understand what information each column contains. Not all information provided might be useful in predicting house prices, but I will not make any assumptions before I decide what information I use in my prediction algorithms.

#read in the data

london_house_prices_2019_training<-read.csv("training_data_assignment_with_prices.csv")
london_house_prices_2019_out_of_sample<-read.csv("test_data_assignment.csv")



#fix data types in both data sets

#fix dates
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate(date=as.Date(date))
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% mutate(date=as.Date(date))
#change characters to factors
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate_if(is.character,as.factor)
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% mutate_if(is.character,as.factor)


#make sure out of sample data and training data has the same levels for factors 
a<-union(levels(london_house_prices_2019_training$postcode_short),levels(london_house_prices_2019_out_of_sample$postcode_short))
london_house_prices_2019_out_of_sample$postcode_short <- factor(london_house_prices_2019_out_of_sample$postcode_short, levels = a)
london_house_prices_2019_training$postcode_short <- factor(london_house_prices_2019_training$postcode_short, levels = a)

# #take a quick look at what's in the data
# str(london_house_prices_2019_training)
# str(london_house_prices_2019_out_of_sample)

3 Visualize and examine data

3.1 Glimpse and skim

#Let's have a look at the dataset
glimpse(london_house_prices_2019_training)

## Rows: 13,998
## Columns: 37
## $ ID                           <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ date                         <date> 0011-01-19, 0008-08-19, NA, 0007-05-19,…
## $ postcode                     <fct> SW8 1UY, TW10 5DF, TW14 9QY, E12 5AX, TW…
## $ property_type                <fct> F, F, S, F, S, F, D, T, T, F, F, T, T, F…
## $ whether_old_or_new           <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N…
## $ freehold_or_leasehold        <fct> L, L, F, L, F, L, F, F, F, L, L, F, F, L…
## $ address1                     <fct> STAMFORD BUILDINGS, 73, 199, 72A, 40, 19…
## $ address2                     <fct> FLAT 8, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ address3                     <fct> SOUTH LAMBETH ROAD, SHEEN COURT, HATTON …
## $ town                         <fct> NA, NA, NA, MANOR PARK, NA, NA, NA, NA, …
## $ local_aut                    <fct> LONDON, RICHMOND, FELTHAM, LONDON, FELTH…
## $ county                       <fct> LAMBETH, RICHMOND UPON THAMES, HOUNSLOW,…
## $ postcode_short               <fct> SW8, TW10, TW14, E12, TW14, TW10, BR4, S…
## $ current_energy_rating        <fct> E, D, D, E, D, C, E, D, E, C, E, D, D, C…
## $ total_floor_area             <dbl> 30.00, 50.00, 100.00, 39.00, 88.00, 101.…
## $ number_habitable_rooms       <int> 2, 2, 5, 2, 4, 4, 6, 6, 6, 3, 3, 7, 4, 3…
## $ co2_emissions_current        <dbl> 2.3, 3.0, 3.7, 2.8, 3.9, 3.1, 8.1, 5.6, …
## $ co2_emissions_potential      <dbl> 1.7, 1.7, 1.5, 1.1, 1.4, 1.4, 4.1, 2.0, …
## $ energy_consumption_current   <int> 463, 313, 212, 374, 251, 175, 339, 216, …
## $ energy_consumption_potential <int> 344, 175, 82, 144, 90, 77, 168, 75, 186,…
## $ windows_energy_eff           <fct> Average, Average, Average, Very Poor, Av…
## $ tenure                       <fct> owner-occupied, rental (private), owner-…
## $ latitude                     <dbl> 51.47925, 51.46515, 51.45945, 51.55140, …
## $ longitude                    <dbl> -0.122944, -0.282765, -0.431547, 0.04225…
## $ population                   <int> 34, 75, 83, 211, 73, 51, 25, 91, 60, 97,…
## $ altitude                     <int> 8, 9, 25, 11, 21, 11, 95, 7, 7, 106, 13,…
## $ london_zone                  <int> 1, 3, 5, 3, 6, 6, 3, 2, 2, 3, 6, 4, 6, 2…
## $ nearest_station              <fct> stockwell, north sheen, hatton cross, ma…
## $ water_company                <fct> Thames Water, Thames Water, Affinity Wat…
## $ average_income               <int> 57200, 61900, 50600, 45400, 49000, 56200…
## $ district                     <fct> Lambeth, Richmond upon Thames, Hounslow,…
## $ price                        <dbl> 360000, 408500, 499950, 259999, 395000, …
## $ type_of_closest_station      <fct> tube, rail, tube, light_rail, tube, rail…
## $ num_tube_lines               <int> 1, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 2…
## $ num_rail_lines               <int> 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0…
## $ num_light_rail_lines         <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ distance_to_station          <dbl> 0.5275686, 0.7699146, 0.8531822, 0.29019…

#I skim through the data. Some variables have missing values : address2,town (very incomplete) and population (only 69 missing values)
skim(london_house_prices_2019_training)

Data summary
Name	london_house_prices_2019_…
Number of rows	13998
Number of columns	37
_______________________
Column type frequency:
Date	1
factor	18
numeric	18
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	8777	0.37	0001-02-19	0012-12-19	0007-12-19	106

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
postcode	0	1.00	FALSE	12635	E11: 5, BR1: 4, CR0: 4, E17: 4
property_type	0	1.00	FALSE	4	F: 5402, T: 4999, S: 2780, D: 817
whether_old_or_new	0	1.00	FALSE	2	N: 13990, Y: 8
freehold_or_leasehold	0	1.00	FALSE	2	F: 8457, L: 5541
address1	0	1.00	FALSE	2825	3: 220, 7: 212, 4: 204, 12: 197
address2	10812	0.23	FALSE	434	FLA: 218, FLA: 214, FLA: 213, FLA: 160
address3	0	1.00	FALSE	8543	LON: 25, GRE: 24, THE: 23, HIG: 21
town	13384	0.04	FALSE	133	WAL: 35, CHE: 33, STR: 27, CHI: 24
local_aut	0	1.00	FALSE	69	LON: 7512, ROM: 396, BRO: 277, CRO: 243
county	0	1.00	FALSE	33	BRO: 861, CRO: 729, WAN: 702, HAV: 672
postcode_short	0	1.00	FALSE	247	CR0: 243, SW1: 204, E17: 193, SW1: 187
current_energy_rating	0	1.00	FALSE	6	D: 7075, C: 3488, E: 2646, B: 361
windows_energy_eff	0	1.00	FALSE	5	Ave: 7819, Goo: 3244, Ver: 1698, Poo: 1231
tenure	0	1.00	FALSE	3	own: 11236, ren: 2503, ren: 259
nearest_station	0	1.00	FALSE	592	rom: 201, cha: 103, bex: 99, har: 96
water_company	0	1.00	FALSE	5	Tha: 10471, Aff: 1577, Ess: 1154, SES: 792
district	0	1.00	FALSE	33	Cro: 925, Bro: 843, Hav: 681, Bex: 603
type_of_closest_station	0	1.00	FALSE	3	rai: 6512, tub: 4711, lig: 2775

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	0	1	8003.94	4622.57	2.00	4004.50	8015.50	12018.75	1.5996e+04	▇▇▇▇▇
total_floor_area	0	1	92.55	45.08	21.00	64.00	83.00	108.00	4.8000e+02	▇▂▁▁▁
number_habitable_rooms	0	1	4.31	1.66	1.00	3.00	4.00	5.00	1.4000e+01	▆▇▁▁▁
co2_emissions_current	0	1	4.25	2.39	0.10	2.70	3.80	5.20	4.4000e+01	▇▁▁▁▁
co2_emissions_potential	0	1	2.22	1.44	-0.50	1.30	1.80	2.70	2.0500e+01	▇▁▁▁▁
energy_consumption_current	0	1	262.98	93.42	12.00	202.00	249.00	308.00	1.2960e+03	▇▅▁▁▁
energy_consumption_potential	0	1	141.38	78.06	-49.00	89.00	122.00	167.00	1.0230e+03	▇▂▁▁▁
latitude	0	1	51.49	0.08	51.30	51.43	51.49	51.56	5.1680e+01	▂▇▇▇▂
longitude	0	1	-0.11	0.16	-0.49	-0.21	-0.11	0.00	2.9000e-01	▂▅▇▅▂
population	69	1	83.74	43.71	1.00	52.00	79.00	109.00	5.1000e+02	▇▃▁▁▁
altitude	0	1	36.57	26.01	0.00	16.00	32.00	51.00	2.3900e+02	▇▃▁▁▁
london_zone	0	1	3.76	1.43	1.00	3.00	4.00	5.00	7.0000e+00	▇▇▇▆▅
average_income	0	1	55334.70	8454.23	36000.00	49400.00	54600.00	60600.00	8.5200e+04	▃▇▆▂▁
price	0	1	593790.93	519158.10	77000.00	351000.00	460000.00	650000.00	1.0800e+07	▇▁▁▁▁
num_tube_lines	0	1	0.44	0.74	0.00	0.00	0.00	1.00	6.0000e+00	▇▁▁▁▁
num_rail_lines	0	1	0.58	0.51	0.00	0.00	1.00	1.00	2.0000e+00	▆▁▇▁▁
num_light_rail_lines	0	1	0.24	0.43	0.00	0.00	0.00	0.00	1.0000e+00	▇▁▁▁▂
distance_to_station	0	1	0.65	0.40	0.00	0.37	0.57	0.84	5.6100e+00	▇▁▁▁▁

#I look at the data into more details with the describe function
#describe(london_house_prices_2019_training)

3.2 Distribution of price

#We study the distribution of the most important variable : price. The distribution is right skewed.
london_house_prices_2019_training %>%
  ggplot(aes(x= price)) + geom_histogram() +
  labs(title = "The distribution of price is right-skewed",subtitle ="", x= "Price", y = "Count") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Using log prices
london_house_prices_2019_training %>%
  ggplot(aes(x= log(price))) + geom_histogram() +
  labs(title = "Log price is normally distributed",subtitle ="", x= "Log price", y = "Count") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of price in our dataset is right-skewed. Using a log transformation makes data look more like the normal distribution. This is however not necessarily helpful for linear regression. The most important part is normality of residuals.

3.3 Price and date

#I want to see the prices by date. Is there some kind of seasonality or a pattern? I see some regular surges. Maybe more houses are sold on weekends.
london_house_prices_2019_training %>%
  ggplot(aes(x=date,y=price)) + geom_col() + labs(title = "Regular surges appear over time",subtitle ="", x= "Date", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

## Warning: Removed 8777 rows containing missing values (position_stack).

3.4 Price and number of habitable rooms

#Let's see the distribution of price according to the number of rooms. The median price does not seem to increase after 10 rooms.
london_house_prices_2019_training %>%
  ggplot(aes(x=as.factor(number_habitable_rooms),price)) + geom_boxplot() + labs(title = "Median price does not increase after 10 rooms",subtitle ="", x= "Number of habitable rooms", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

3.5 Number of habitable rooms and total floor area

#Let's see the number of rooms vs the total floor area. I would obviously expect it to be positively correlated but I want to make sure that it is the case. I prefer not to make too many assumptions about data.

london_house_prices_2019_training %>%
ggplot(aes(x=as.factor(number_habitable_rooms),total_floor_area)) + geom_boxplot() + labs(title = "Total floor area increases along with number of habitable rooms",subtitle ="", x= "Number of habitable rooms", y = "Total floor area") + theme_minimal()  + theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

3.6 Size and price of houses

#How big are the biggest houses in the dataset?
london_house_prices_2019_training %>% 
  arrange(desc(total_floor_area)) %>%
  head(10) %>%
  summarise(district,total_floor_area,price)

##                  district total_floor_area    price
## 1                  Ealing              480  3500000
## 2  Kensington and Chelsea              470 10800000
## 3               Redbridge              470  2100000
## 4  Kensington and Chelsea              460  9000000
## 5                  Camden              453  6180000
## 6                 Enfield              436  1750000
## 7                  Barnet              436  3500000
## 8               Greenwich              432  1975000
## 9                 Lambeth              419  4861875
## 10   Richmond upon Thames              416  2100000

#How small are the smallest houses in the dataset?
london_house_prices_2019_training %>% 
  arrange(total_floor_area) %>%
  head(10) %>%
  summarise(district,total_floor_area,price)

##                  district total_floor_area  price
## 1  Kensington and Chelsea            21.00 301000
## 2                 Enfield            21.00 171500
## 3             Westminster            21.00 270000
## 4  Kensington and Chelsea            21.00 265000
## 5  Kensington and Chelsea            21.10 385000
## 6  Kensington and Chelsea            21.41 275000
## 7    Richmond upon Thames            22.00 190000
## 8               Greenwich            22.00 315000
## 9  Kensington and Chelsea            22.76 850000
## 10 Hammersmith and Fulham            23.00 275000

#What are the most expensive houses in the dataset?
london_house_prices_2019_training %>% 
  arrange(desc(price)) %>%
  head(10) %>%
  summarise(district,total_floor_area,price)

##                  district total_floor_area    price
## 1  Kensington and Chelsea           470.00 10800000
## 2  Kensington and Chelsea           357.00 10500000
## 3  Kensington and Chelsea           460.00  9000000
## 4  Kensington and Chelsea           293.00  8750000
## 5  Kensington and Chelsea           365.00  8225000
## 6  Kensington and Chelsea           318.20  8175000
## 7             Westminster           257.20  8150000
## 8  Kensington and Chelsea           309.00  8050000
## 9             Westminster           412.00  8000000
## 10            Westminster           168.89  7250000

#What are the least expensive houses in the dataset?
london_house_prices_2019_training %>% 
  arrange(price) %>%
  head(10)  %>%
  summarise(district,total_floor_area,price)

##      district total_floor_area price
## 1    Havering               41 77000
## 2     Bromley               31 80000
## 3      Ealing               55 85000
## 4  Hillingdon               68 87500
## 5      Harrow               61 87500
## 6      Bexley               77 88000
## 7   Redbridge               44 89000
## 8       Brent               55 90000
## 9  Hillingdon               65 90000
## 10      Brent               36 91000

3.7 Price and distance to nearest station

#Is distance to station very correlated with price? Pearson correlation coefficient = -0.129. I would have expected a bigger negative correlation.
london_house_prices_2019_training %>%
ggplot(aes(x=distance_to_station,y=price)) + geom_smooth(method="lm") + labs(title = "Distance to station seems very negatively correlated with price..",subtitle ="But the correlation coefficient has a low magnitude!", x= "Distance to station", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank()) + geom_text(
    data = data.frame(x = 2.3, y = 0, label = "Pearson correlation coefficient = -0.129"),
    aes(x = x, y = y, label = label),
    colour="black",
    family="Oswald",
    size= 4.5,
    hjust = 0.5,
    lineheight = .8,
    inherit.aes = FALSE)

## `geom_smooth()` using formula 'y ~ x'

#Correlation coefficient between distance to station and price.
cor(london_house_prices_2019_training$price,london_house_prices_2019_training$distance_to_station)

## [1] -0.1291175

3.8 Price and tenure

#Let's see if price depends on tenure. Rentals and social properties seem to have lower prices than owner_occupied properties.
london_house_prices_2019_training %>%
  ggplot(aes(x=as.factor(tenure),price)) + geom_boxplot() + labs(title = "Owner-occupied properties are more expensive than the others",subtitle ="", x= "Tenure", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

3.9 Price and property types

#Do different property types have different prices?
london_house_prices_2019_training %>%
  ggplot(aes(x=property_type,price)) + geom_boxplot()  + labs(title = "Detached properties are more expensive than the others",subtitle ="", x= "Tenure", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

3.10 Price and oldness

#Do new houses really have a higher price?
london_house_prices_2019_training %>%
  ggplot(aes(x=whether_old_or_new,price)) + geom_boxplot() + labs(title = "Are new houses more expensive than old ones?",subtitle ="", x= "Whether old or New", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

#The boxplot does not really answer this question, let's look at the average
london_house_prices_2019_training %>% group_by(whether_old_or_new) %>%
  summarise(mean(price))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   whether_old_or_new `mean(price)`
##   <fct>                      <dbl>
## 1 N                        593884.
## 2 Y                        431875

#New houses have a lower average price than old houses. However, this does not mean anything since old houses in the dataset could simply be bigger. Let's try to group by number of habitable rooms too. Now it makes sense. For the same number of rooms, houses are on average more expensive when they are new.
london_house_prices_2019_training %>% group_by(number_habitable_rooms,whether_old_or_new) %>%
  summarise(mean(price))

## `summarise()` regrouping output by 'number_habitable_rooms' (override with `.groups` argument)

## # A tibble: 16 x 3
## # Groups:   number_habitable_rooms [14]
##    number_habitable_rooms whether_old_or_new `mean(price)`
##                     <int> <fct>                      <dbl>
##  1                      1 N                        282129.
##  2                      2 N                        328988.
##  3                      2 Y                        400000 
##  4                      3 N                        442166.
##  5                      3 Y                        451000 
##  6                      4 N                        523025.
##  7                      5 N                        593832.
##  8                      6 N                        786567.
##  9                      7 N                       1010177.
## 10                      8 N                       1272085.
## 11                      9 N                       1771558.
## 12                     10 N                       2264689.
## 13                     11 N                       2218333.
## 14                     12 N                       2829125 
## 15                     13 N                       1687500 
## 16                     14 N                       1560000

#We see that there are only 8 new houses in the dataset! I can't really make conclusions based on that.
london_house_prices_2019_training %>% group_by(whether_old_or_new) %>%
  count()

## # A tibble: 2 x 2
## # Groups:   whether_old_or_new [2]
##   whether_old_or_new     n
##   <fct>              <int>
## 1 N                  13990
## 2 Y                      8

3.11 Price and districts

#Are all districts equal in terms of price?
london_house_prices_2019_training %>%
  ggplot(aes(x=district,price)) + geom_boxplot() + labs(title = "Districts have radically different median prices",subtitle ="", x= "District", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

3.12 Price and total floor area

#I expect total floor area to be very correlated with price. Let's check if this is the case. Both variables are very positively correlated (0.7)
london_house_prices_2019_training %>%
ggplot(aes(x=total_floor_area,y=price)) + geom_smooth() + labs(title = "Price increases along with total floor area ",subtitle ="", x= "Total floor area", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank()) + geom_text(
    data = data.frame(x = 300, y = 1000000, label = "Pearson correlation coefficient = 0.69"),
    aes(x = x, y = y, label = label),
    colour="black",
    family="Oswald",
    size= 4.5,
    hjust = 0.5,
    lineheight = .8,
    inherit.aes = FALSE)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

cor(london_house_prices_2019_training$total_floor_area,london_house_prices_2019_training$price)

## [1] 0.6937882

3.13 Freehold and leashold properties

#Let's see how freehold and leasehold properties are distributed. Let's also check if they have any impact on price. I expect freehold properties to be more expensive on average than leasehold properties.
london_house_prices_2019_training %>% group_by(freehold_or_leasehold) %>%
  count()

## # A tibble: 2 x 2
## # Groups:   freehold_or_leasehold [2]
##   freehold_or_leasehold     n
##   <fct>                 <int>
## 1 F                      8457
## 2 L                      5541

london_house_prices_2019_training %>% group_by(freehold_or_leasehold) %>%
  summarise(mean(price))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   freehold_or_leasehold `mean(price)`
##   <fct>                         <dbl>
## 1 F                           675909.
## 2 L                           468457.

3.14 Interaction variables

#It makes sense to think that total floor area has a different effect on price depending on the postcode, or at least the district. I will verify this and eventually use an interaction variable in my linear regression. I check this with 2 districts in order to clearly visualize data.
london_house_prices_2019_training %>% filter(district == c("Barking and Dagenham", "Kensington and Chelsea")) %>% ggplot(aes(x=total_floor_area,y=price,colour=district)) + geom_smooth(method="lm")  + labs(title = "Price per square meter differs among districts",subtitle ="", x= "Total floor area", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

## `geom_smooth()` using formula 'y ~ x'

#I repeat the process with postcodes
london_house_prices_2019_training %>% filter(postcode_short == c("SW8", "NW1")) %>% ggplot(aes(x=total_floor_area,y=price,colour=postcode_short)) + geom_smooth(method="lm")  + labs(title = "Price per square meter differs among postcodes",subtitle ="", x= "Total floor area", y = "Price") +
  theme_minimal()  +
  theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
      legend.title = element_text(family = "Oswald"),
      legend.text = element_text(family = "Oswald",size=10),
      plot.subtitle=element_text(size=17, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.text.x = element_text(size= 11, family='Oswald',color="black"),
      axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
      axis.title.x = element_text(size= 11, family='Oswald',color="black"),
      axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
       plot.caption.position = "plot",
      panel.grid.minor=element_blank())

## `geom_smooth()` using formula 'y ~ x'

It is intuitive to think that the price per square meter is not the same in all districts or postcodes. This is further proved with these visualizations. The slope of the functions changes according to the district/postcode, indicating that using interaction variables may be necessary in our algorithms.

3.15 Correlation matrix

We study the correlation between variables more into details.

# produce a correlation table using GGally::ggcor()
# this takes a while to plot

library("GGally")
london_house_prices_2019_training %>% 
  select(-ID) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), layout.exp = 2,label_round=2, label = TRUE,label_size = 2,hjust = 1,nbreaks = 5,size = 2,angle = -20)

Multicolinearity will not be a problem for our linear regression since it is a predictive model. It would have been a problem if our linear regression was explanatory since colinearity increases variance.

We will focus on the variables that are very correlated with price in our algorithms.

#let's do the initial split
library(rsample)
train_test_split <- initial_split(london_house_prices_2019_training, prop = 0.75) #training set contains 75% of the data
# Create the training dataset
train_data <- training(train_test_split)
test_data <- testing(train_test_split)

4 Algorithms

4.1 Linear regression model

To help you get started I build a linear regression model below. I chose a subset of the features with no particular goal. You can (and should) add more variables and/or choose variable selection methods if you want.

#number of folds in cross validation
CVfolds <- 5

#Define folds
set.seed(1)
  #create five folds with no repeats
indexPreds <- createMultiFolds(train_data$price, CVfolds,times = 1) 
#Define traincontrol using folds
ctrl <- trainControl(method = "cv",  number = CVfolds, returnResamp = "final", savePredictions = "final", index = indexPreds,sampling = NULL)

4.1.1 Base model

base_model<-train(
    price ~ distance_to_station +water_company+property_type+whether_old_or_new+freehold_or_leasehold+latitude+ longitude+postcode_short,
    train_data,
   method = "lm",
    trControl = ctrl
   )

# summary of the results
summary(base_model)

#Use the model to make predictions out of sample in the testing set
pred_base <-predict(base_model,test_data)

test_results_base <-data.frame(RMSE = RMSE(pred_base, test_data$price), 
                            Rsquared = R2(pred_base, test_data$price))

test_results_base

4.1.2 Best model

#we are going to train the model and report the results using k-fold cross validation
set.seed(1)
model1_lm<-train(
    price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income ,
    train_data,
   method = "lm",
    trControl = ctrl
   )

# summary of the results
summary(model1_lm)

#Use the model to make predictions out of sample in the testing set
pred<-predict(model1_lm,test_data)

test_results<-data.frame(  RMSE = RMSE(pred, test_data$price), 
                            Rsquared = R2(pred, test_data$price))

test_results

# we can check variable importance as well
importance <- varImp(model1_lm, scale=TRUE)
plot(importance)

4.1.3 Best model with log transformation

set.seed(1)
#Let's try our model with log price.

model1_lm_log<-train(
    log(price) ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income ,
    train_data,
   method = "lm",
    trControl = ctrl
   )

# summary of the results
summary(model1_lm_log)

test_data_log <- test_data %>%
  mutate(price=log(price))

#Use the model to make predictions out of sample in the testing set
pred_log <-predict(model1_lm_log,test_data)

test_results_log <-data.frame(  RMSE = RMSE(pred_log, test_data_log$price), 
                            Rsquared = R2(pred_log, test_data_log$price))

test_results_log

The RMSE is way higher so we will keep our model without the log transformation.

4.2 LASSO regression

#Using LASSO

set.seed(1)
#we will look for the optimal lambda in this sequence
lambda_seq <- seq(0, 1000, length = 1000)

#LASSO regression with using 5-fold cross validation to select the best lambda among the lambdas specified in "lambda_seq".

lasso <- train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,
data = train_data,
method = "glmnet",
preProc = c("center", "scale"), 
#This option standardizes the data before running the LASSO regression
trControl = ctrl,
 tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.
  )
plot(lasso)

coef(lasso$finalModel, lasso$bestTune$lambda)
lasso$bestTune$lambda

# Count of how many coefficients are greater than zero and how many are equal to zero
sum(coef(lasso$finalModel, lasso$bestTune$lambda)!=0)
sum(coef(lasso$finalModel, lasso$bestTune$lambda)==0)

predictions_lasso <- predict(lasso,test_data)

# Model prediction performance
LASSO_results<-data.frame(  RMSE = RMSE(predictions_lasso, test_data$price), 
                            Rsquare = R2(predictions_lasso, test_data$price)
)
LASSO_results

#Lasso gives very similar results to OLS.



#Stepwise regression

#BackFit <- train(price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,train_data,
   # method = "leapBackward", #can chance method to "leapSeq", "leapForward"
   # tuneGrid = data.frame(nvmax = 10:30), #Will find the best model with 10:16 variables. 
   # trControl = control
#)

##show the results of all models
#BackFit$results

4.3 Regression trees

Next I fit a tree model using the same subset of features.

model2_tree <- train(
  price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income, train_data,
  method = "rpart",
  trControl = ctrl,
  tuneLength=10
    )

plot(model2_tree)

#You can view how the tree performs
model2_tree$results

#You can view the final tree
#rpart.plot(model2_tree$finalModel)

#you can also visualize the variable importance
#importance <- varImp(model2_tree, scale=TRUE)
#plot(importance)


#Tuning the complexity parameter

#I choose cp values that seems to result in low error based on plot above
Grid <- expand.grid(cp = seq( 0.0000, 0.0020,0.0001))

dtree_fit <- train(price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income ,data = train_data,
                   method = "rpart",
                   metric="RMSE",
                   trControl=ctrl,
                   tuneGrid=Grid) 
# Plot the best tree model found
#rpart.plot(dtree_fit$finalModel)

# Print the search results of 'train' function
 plot(dtree_fit) 
 
dtree_fit$results


#Use the model to make predictions out of sample in the testing set
predict_trees <- predict(dtree_fit,test_data)


test_results_trees<-data.frame(  RMSE = RMSE(predict_trees, test_data$price), 
                            Rsquared = R2(predict_trees, test_data$price))

test_results_trees

# we can check variable importance as well
importance_tree <- varImp(dtree_fit, scale=TRUE)
importance_tree
plot(importance_tree)

#Most postcodes are not important so I removed this variable.
 
# #I set the number of samples in the bagging to iterations. You should experiment with different numbers
# iterations<-100
# 
# #Next I use the foreach loop and I will combine the estimates in each sample in predictions. It works as a for loop but using `.combine` option I am also merging the results
# predictions_trees_bagging <-foreach(m=1:iterations,.combine=cbind) %do% {
#   #Change seed to get different samples
#   set.seed(m*100)
#   #Get a new sample
#   training_positions <- sample(nrow(train_data), size=nrow(train_data),replace=TRUE)
#   #fit a large tree model in this sample
#   tree_bag <- rpart(formula= price~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income,data = train_data,  
#                      control =  rpart.control(cp = 0, maxdepth = 30,minbucket=2,minsplit=2))
#   
#   #foreach loop adds them to the predictions automatically
# predictions_trees_bagging <- predict(tree_bag,test_data)
# }
# 
# #Find the average estimate
# predictions_trees_bagging <-rowMeans(predictions_trees_bagging)
# 
# test_results_bagging<-data.frame(  RMSE = RMSE(predictions_trees_bagging, test_data$price), 
#                             Rsquared = R2(predictions_trees_bagging, test_data$price))
# 
# test_results_bagging

I will use a random forest with mtry = number of variables, which is equivalent to using a bagged tree.

4.3.1 Bagged tree

# The following function gives the list of tunable parameters
modelLookup("ranger")

set.seed(1)
  
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF <- data.frame(
  .mtry = 12 ,
  .splitrule = "variance",
  .min.node.size = seq(5,10,1)
)
# Fit random forest: model= ranger using caret library and train function
bagged_tree <- train(
  price  ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
  method = "ranger",
  metric="RMSE",
  trControl = ctrl,
  tuneGrid = gridRF,
  importance = 'permutation',
  num.trees= 500
  #This is the method used to determine variable importance.
  #Permutation=leave one variable out and fit the model again
)


predictions_bagged <-predict(bagged_tree,test_data)

# Model prediction performance
rf_bagged<-data.frame(  RMSE = RMSE(predictions_bagged, test_data$price), 
                            Rsquare = R2(predictions_bagged, test_data$price))
rf_bagged

As expected, the bagged tree is better than the long tree since it has a lower RMSE. I will keep this one for my stacking.

4.4 Random forest

#MODEL WITH POSTCODE SHORT. PERFORMS BAD.

# # Define the tuning grid: tuneGrid
# # Let's do a search on 'mtry'; number of variables to use in each split
# gridRF <- data.frame(
#   .mtry = 5,
#   .splitrule = "variance",
#   .min.node.size = 5
# )
# # Fit random forest: model= ranger using caret library anf train function
# rf_fit <- train(
#   price  ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income + postcode_short, data= train_data,
#   method = "ranger",
#   metric="RMSE",
#   trControl = control,
#   tuneGrid = gridRF,
#   importance = 'permutation',
#   num.trees= 500
#   #This is the method used to determine variable importance.
#   #Permutation=leave one variable out and fit the model again
# )
# 
# varImp(rf_fit)
# 
# plot(varImp(rf_fit))
# 
# summary(rf_fit)
# 
# print(rf_fit)
# 
# predictions_rf <-predict(rf_fit,test_data)
# 
# # Model prediction performance
# rf_results<-data.frame(  RMSE = RMSE(predictions_rf, test_data$price), 
#                             Rsquare = R2(predictions_rf, test_data$price))
# rf_results

# The following function gives the list of tunable parameters
modelLookup("ranger")

set.seed(1)  
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF <- data.frame(
  .mtry = 4,
  .splitrule = "variance",
  .min.node.size = seq(5,10,1)
)
# Fit random forest: model= ranger using caret library and train function
rf_fit <- train(
  price  ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
  method = "ranger",
  metric="RMSE",
  trControl = ctrl,
  tuneGrid = gridRF,
  importance = 'permutation',
  num.trees= 500
  #This is the method used to determine variable importance.
  #Permutation=leave one variable out and fit the model again
)

varImp(rf_fit)

plot(varImp(rf_fit))

summary(rf_fit)

print(rf_fit)

predictions_rf <-predict(rf_fit,test_data)

# Model prediction performance
rf_results<-data.frame(  RMSE = RMSE(predictions_rf, test_data$price), 
                            Rsquare = R2(predictions_rf, test_data$price))
rf_results

4.5 KNN

set.seed(1) #I will use cross validation. To be able to replicate the results I set the seed to a fixed number

# Below I use 'train' function from caret library. 
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# dfeault 'metric' is accuracy

knn_fit <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district + postcode_short , data=train_data,
                 method = "knn",
                 trControl = ctrl, #use cross validation with 5 folds
                 tuneLength = 10, #number of parameter values train function will try
                 preProcess = c("center", "scale"))  #center and scale the data in k-nn this is pretty important

knn_fit

plot(knn_fit) #we can plot the results

4.5.1 Best KNN

# I will store the values of k I want to experiment with in knnGrid

knnGrid <-  expand.grid(k= seq(1,10 , by = 1)) 

# By fixing the seed I can re-generate the results when needed
set.seed(1)
# Below I use 'train' function from caret library. 
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# 'metric' is ROC or AUC
# I already defined the 'trControl' and 'tuneGrid' options above
fit_KNN <- train(price~ total_floor_area + latitude + longitude + average_income + district, data=train_data,
                 preProcess = c("center", "scale"), 
                 method="knn", 
                 metric="RMSE", 
                 trControl=ctrl,
                 tuneGrid = knnGrid)
# display results
print(fit_KNN)

# plot results
plot(fit_KNN)

pred_knn <-predict(fit_KNN,test_data)


test_results_knn<-data.frame(  RMSE = RMSE(pred_knn, test_data$price), 
                            Rsquared = R2(pred_knn, test_data$price))

test_results_knn

importance_knn <- varImp(fit_KNN, scale=TRUE)
importance_knn
plot(importance_knn)

#I removed number habitable rooms, co2 emissions potential and co2 emissions current because it is very correlated with total_floor_area. I also used the importance plot and I concluded that property_type and num_light_rail_lines were not necessary. R-squared increased.

4.6 Gradient boosting machine

modelLookup("gbm")


#Expand the search grid (see above for definitions)
grid<-expand.grid(interaction.depth = seq(6,10,1),n.trees = seq(300,500,50),shrinkage = seq(0.05,0.08,0.005), n.minobsinnode = 10)
set.seed(100)
#Train for gbm
fit_gbm <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district, data=train_data,
                 method = "gbm", 
                 trControl = ctrl,
                 tuneGrid =grid,
                   metric = "RMSE",
                 verbose = FALSE)

#I had to remove postcode short because it was too computationally expensive
print(fit_gbm)

#performance, predict 

predictions_gbm <-predict(fit_gbm, test_data)

gbm_results<-data.frame(  RMSE = RMSE(predictions_gbm, test_data$price), 
                            Rsquared = R2(predictions_gbm, test_data$price))
gbm_results

#modelLookup("gbm")

#Expand the search grid (see above for definitions)
#grid<-expand.grid(interaction.depth = 8,n.trees = 500,shrinkage =0.075, n.minobsinnode = 10)
#set.seed(100)
#Train for gbm
#fit_gbm <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district + postcode_short , data=train_data,
  #               method = "gbm", 
   #              trControl = control,
    #             tuneGrid =grid,
     #              metric = "RMSE",
      #           verbose = FALSE
       #          )

#print(fit_gbm)

#performance, predict 

#predictions_gbm <-predict(fit_gbm, test_data)

#gbm_results<-data.frame(  RMSE = RMSE(predictions_gbm, test_data$price), 
                           # Rsquared = R2(predictions_gbm, test_data$price))
#gbm_results

5 Stacking

Use stacking to ensemble your algorithms.

#training models
#Define folds
set.seed(1)
  #create five folds with no repeats
indexPreds <- createMultiFolds(train_data$price, CVfolds,times = 1) 
#Define traincontrol using folds
ctrl <- trainControl(method = "cv",  number = CVfolds, returnResamp = "final", savePredictions = "final", index = indexPreds,sampling = NULL)


#train linear regression
linear_model <-train(
    price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income,
    train_data,
   method = "lm",
    trControl = ctrl
   )

#train a tree (with different independent variables) use same trcontrol
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF_bagged <- data.frame(
  .mtry = 12,
  .splitrule = "variance",
  .min.node.size = 5
)
# Fit random forest: model= ranger using caret library and train function
bagged_tree_model <- train(
  price  ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
  method = "ranger",
  metric="RMSE",
  trControl = ctrl,
  tuneGrid = gridRF_bagged,
  importance = 'permutation',
  num.trees= 500
  #This is the method used to determine variable importance.
  #Permutation=leave one variable out and fit the model again
)



#LASSO. I know that the best lambda is 102, but I still use lambda seq to use the exact decimals on my lambda in my final models.
lambda_seq <- seq(0, 1000, length = 1000)
lasso_model <- train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,
data = train_data,
method = "glmnet",
preProc = c("center", "scale"), 
#This option standardizes the data before running the LASSO regression
trControl = ctrl,
 tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.
  )



# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF2 <- data.frame(
  .mtry = 4,
  .splitrule = "variance",
  .min.node.size = 5
)
# Fit random forest: model= ranger using caret library anf train function
rf_model <- train(
  price  ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
  method = "ranger",
  metric="RMSE",
  trControl = ctrl,
  tuneGrid = gridRF2,
  importance = 'permutation',
  num.trees= 500
  #This is the method used to determine variable importance.
  #Permutation=leave one variable out and fit the model again
)

#Gradient boosting machine

grid<-expand.grid(interaction.depth = 10,n.trees = 500,shrinkage =0.07, n.minobsinnode = 10)
#Train for gbm
gbm_model <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district, data=train_data,
                 method = "gbm", 
                 trControl = ctrl,
                 tuneGrid =grid,
                   metric = "RMSE",
                 verbose = FALSE
                 )



#KNN. I already know that the optimal number of neighbors is 7.
# Below I use 'train' function from caret library. 
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# I already defined the 'trControl' and 'tuneGrid' options above
knn_model <- train(price~ total_floor_area + latitude + longitude + average_income + district, data=train_data,
                 preProcess = c("center", "scale"), 
                 method="knn", 
                 metric="RMSE", 
                 trControl=ctrl,
                 tuneGrid = data.frame(k=7))

#combine the results 
#make sure to use the method names  from above
multimodel <- list(lm = linear_model, rpart =  bagged_tree_model, lasso = lasso_model, knn= knn_model, gbm = gbm_model, random_forest = rf_model )
class(multimodel) <- "caretList"

5.1 Differences in performance of each algorithm

#we can visualize the differences in performance of each algorithm for each fold 
  modelCor(resamples(multimodel))

##                      lm     rpart     lasso        knn         gbm
## lm            1.0000000 0.6677961 0.9941921  0.5755736  0.61750682
## rpart         0.6677961 1.0000000 0.6333412  0.8757454  0.15939515
## lasso         0.9941921 0.6333412 1.0000000  0.5016088  0.69842837
## knn           0.5755736 0.8757454 0.5016088  1.0000000 -0.18195883
## gbm           0.6175068 0.1593952 0.6984284 -0.1819588  1.00000000
## random_forest 0.6794163 0.9855954 0.6304373  0.9427806  0.06414141
##               random_forest
## lm               0.67941627
## rpart            0.98559543
## lasso            0.63043732
## knn              0.94278062
## gbm              0.06414141
## random_forest    1.00000000

  dotplot(resamples(multimodel), metric = "Rsquared") #you can set metric=MAE, RMSE, or Rsquared

   xyplot(resamples(multimodel), metric = "Rsquared")

    splom(resamples(multimodel), metric = "Rsquared")

5.2 Performance of the stacked algorithm

set.seed(1)
library(caret)
library(caretEnsemble)
model_list<- caretStack(multimodel, #creating a model that stacks both models together
                        trControl=ctrl,
                        method="lm",
                        metric="RMSE")


summary(model_list)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2604842   -54082     4049    52678  3672126 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.223e+03  4.635e+03  -0.264 0.791848    
## lm             3.065e-01  1.014e-01   3.022 0.002518 ** 
## rpart          4.042e-01  3.695e-02  10.938  < 2e-16 ***
## lasso          3.255e-01  9.996e-02   3.257 0.001131 ** 
## knn           -5.430e-02  1.591e-02  -3.413 0.000646 ***
## gbm            1.921e-01  1.733e-02  11.084  < 2e-16 ***
## random_forest -1.732e-01  3.229e-02  -5.364  8.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 187400 on 10492 degrees of freedom
## Multiple R-squared:  0.8714, Adjusted R-squared:  0.8713 
## F-statistic: 1.185e+04 on 6 and 10492 DF,  p-value: < 2.2e-16

pred_stacked <- predict(model_list,test_data)
test_results_stacked<-data.frame(  RMSE = RMSE(pred_stacked, test_data$price), 
                            Rsquared = R2(pred_stacked, test_data$price))

test_results_stacked

##       RMSE  Rsquared
## 1 173263.8 0.8863724

6 Pick investments

In this section you should use the best algorithm you identified to choose 200 properties from the out of sample data.

numchoose=200

oos<-london_house_prices_2019_out_of_sample


#predict the value of houses
oos$predict <- predict(model_list,oos)
#Choose the ones you want to invest here
#Make sure you choose exactly 200 of them

oos <-oos %>% 
  mutate(profitMargin=(predict-asking_price)/(asking_price)) %>% 
  arrange(-profitMargin)
#Make sure you chooses exactly 200 of them
oos$buy=0
oos[1:numchoose,]$buy=1
oos<-oos %>% mutate( actualProfit=buy*profitMargin)
#let's find the actual profit
mean(oos$profitMargin)

## [1] 0.03149146

sum(oos$actualProfit)/numchoose

## [1] 0.6684041

#output your choices. Change the name of the file to your "lastname_firstname.csv"
write.csv(oos,"bensouda_othman.csv")

library(ggmap)

## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.

## Please cite ggmap if you use it! See citation("ggmap") for details.

oos_map <- oos %>% mutate(longitude2=buy*longitude,latitude2=buy*latitude)
  

#I define the coordinates of London
london <- c(left = -0.6, bottom = 51.2, right = 0.35, top = 51.8) 
#let's get the map for London
map_london <-ggmap(get_stamenmap(london, zoom = 10))

## Source : http://tile.stamen.com/terrain/10/510/339.png

## Source : http://tile.stamen.com/terrain/10/511/339.png

## Source : http://tile.stamen.com/terrain/10/512/339.png

## Source : http://tile.stamen.com/terrain/10/510/340.png

## Source : http://tile.stamen.com/terrain/10/511/340.png

## Source : http://tile.stamen.com/terrain/10/512/340.png

## Source : http://tile.stamen.com/terrain/10/510/341.png

## Source : http://tile.stamen.com/terrain/10/511/341.png

## Source : http://tile.stamen.com/terrain/10/512/341.png

#Now I add the points of the stops and searches
map_london +
geom_point(aes(x =longitude2, y = latitude2),data= oos_map,col="red", alpha=1, size= 0.6) +
labs(title= "Map of the 200 best investments", x = "Longitude", y= "Latitude") + 
theme(plot.title = element_text(face = "bold", size = 16, family = "Oswald"),
      plot.subtitle=element_text(size=15, family = "Oswald"),
      plot.caption = element_text(size=9 , family = "Oswald"),
      axis.title.x = element_text(size=12,family="Oswald"),
      axis.title.y = element_text(size=12,family="Oswald")
      
      )

7 What is your return?

Let’s test how much money your algorithm can make. I will take the in-sample testing data (whose prices we know). And generate a random asking price. For the sake of simplicity assume that asking price is within 20% of the actual price. Then I will run my algorithm to see how it does on this data.

numchoose=200
set.seed(2)
random_mult<-1/(1+runif(nrow(test_data), min = -0.2, max = 0.2))
test_data$asking_price<- test_data$price*random_mult


#predict the value of houses
test_data$predict <- predict(model_list,test_data)


#Choose the ones you want to invest here

#Let's find the profit margin given our predicted price and asking price
test_data<- test_data%>%mutate(profitMargin=(predict-asking_price)/asking_price)%>%arrange(-profitMargin)
#Make sure you choose exactly 200 of them
test_data$invest=0
test_data[1:numchoose,]$invest=1


#let's find the actual profit

test_data<-test_data%>%mutate(profit=(price-asking_price)/asking_price)%>%mutate(actualProfit=invest*profit)

#if we invest in everything
mean(test_data$profit)

## [1] 0.0005645001

#just invest in those we chose
sum(test_data$actualProfit)/numchoose

## [1] 0.05930893

Data Science Capstone Project – London house prices

Othman Bensouda Koraichi

2020-12-13