There are two sets of data, i) training data that has the actual prices ii) out of sample data that has the asking prices. Load both data sets.
First, we have to understand what information each column contains. Not all information provided might be useful in predicting house prices, but I will not make any assumptions before I decide what information I use in my prediction algorithms.
#read in the data
london_house_prices_2019_training<-read.csv("training_data_assignment_with_prices.csv")
london_house_prices_2019_out_of_sample<-read.csv("test_data_assignment.csv")
#fix data types in both data sets
#fix dates
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate(date=as.Date(date))
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% mutate(date=as.Date(date))
#change characters to factors
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate_if(is.character,as.factor)
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% mutate_if(is.character,as.factor)
#make sure out of sample data and training data has the same levels for factors
a<-union(levels(london_house_prices_2019_training$postcode_short),levels(london_house_prices_2019_out_of_sample$postcode_short))
london_house_prices_2019_out_of_sample$postcode_short <- factor(london_house_prices_2019_out_of_sample$postcode_short, levels = a)
london_house_prices_2019_training$postcode_short <- factor(london_house_prices_2019_training$postcode_short, levels = a)
# #take a quick look at what's in the data
# str(london_house_prices_2019_training)
# str(london_house_prices_2019_out_of_sample)
## Rows: 13,998
## Columns: 37
## $ ID <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ date <date> 0011-01-19, 0008-08-19, NA, 0007-05-19,…
## $ postcode <fct> SW8 1UY, TW10 5DF, TW14 9QY, E12 5AX, TW…
## $ property_type <fct> F, F, S, F, S, F, D, T, T, F, F, T, T, F…
## $ whether_old_or_new <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N…
## $ freehold_or_leasehold <fct> L, L, F, L, F, L, F, F, F, L, L, F, F, L…
## $ address1 <fct> STAMFORD BUILDINGS, 73, 199, 72A, 40, 19…
## $ address2 <fct> FLAT 8, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ address3 <fct> SOUTH LAMBETH ROAD, SHEEN COURT, HATTON …
## $ town <fct> NA, NA, NA, MANOR PARK, NA, NA, NA, NA, …
## $ local_aut <fct> LONDON, RICHMOND, FELTHAM, LONDON, FELTH…
## $ county <fct> LAMBETH, RICHMOND UPON THAMES, HOUNSLOW,…
## $ postcode_short <fct> SW8, TW10, TW14, E12, TW14, TW10, BR4, S…
## $ current_energy_rating <fct> E, D, D, E, D, C, E, D, E, C, E, D, D, C…
## $ total_floor_area <dbl> 30.00, 50.00, 100.00, 39.00, 88.00, 101.…
## $ number_habitable_rooms <int> 2, 2, 5, 2, 4, 4, 6, 6, 6, 3, 3, 7, 4, 3…
## $ co2_emissions_current <dbl> 2.3, 3.0, 3.7, 2.8, 3.9, 3.1, 8.1, 5.6, …
## $ co2_emissions_potential <dbl> 1.7, 1.7, 1.5, 1.1, 1.4, 1.4, 4.1, 2.0, …
## $ energy_consumption_current <int> 463, 313, 212, 374, 251, 175, 339, 216, …
## $ energy_consumption_potential <int> 344, 175, 82, 144, 90, 77, 168, 75, 186,…
## $ windows_energy_eff <fct> Average, Average, Average, Very Poor, Av…
## $ tenure <fct> owner-occupied, rental (private), owner-…
## $ latitude <dbl> 51.47925, 51.46515, 51.45945, 51.55140, …
## $ longitude <dbl> -0.122944, -0.282765, -0.431547, 0.04225…
## $ population <int> 34, 75, 83, 211, 73, 51, 25, 91, 60, 97,…
## $ altitude <int> 8, 9, 25, 11, 21, 11, 95, 7, 7, 106, 13,…
## $ london_zone <int> 1, 3, 5, 3, 6, 6, 3, 2, 2, 3, 6, 4, 6, 2…
## $ nearest_station <fct> stockwell, north sheen, hatton cross, ma…
## $ water_company <fct> Thames Water, Thames Water, Affinity Wat…
## $ average_income <int> 57200, 61900, 50600, 45400, 49000, 56200…
## $ district <fct> Lambeth, Richmond upon Thames, Hounslow,…
## $ price <dbl> 360000, 408500, 499950, 259999, 395000, …
## $ type_of_closest_station <fct> tube, rail, tube, light_rail, tube, rail…
## $ num_tube_lines <int> 1, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 2…
## $ num_rail_lines <int> 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0…
## $ num_light_rail_lines <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ distance_to_station <dbl> 0.5275686, 0.7699146, 0.8531822, 0.29019…
#I skim through the data. Some variables have missing values : address2,town (very incomplete) and population (only 69 missing values)
skim(london_house_prices_2019_training)
Name | london_house_prices_2019_… |
Number of rows | 13998 |
Number of columns | 37 |
_______________________ | |
Column type frequency: | |
Date | 1 |
factor | 18 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 8777 | 0.37 | 0001-02-19 | 0012-12-19 | 0007-12-19 | 106 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
postcode | 0 | 1.00 | FALSE | 12635 | E11: 5, BR1: 4, CR0: 4, E17: 4 |
property_type | 0 | 1.00 | FALSE | 4 | F: 5402, T: 4999, S: 2780, D: 817 |
whether_old_or_new | 0 | 1.00 | FALSE | 2 | N: 13990, Y: 8 |
freehold_or_leasehold | 0 | 1.00 | FALSE | 2 | F: 8457, L: 5541 |
address1 | 0 | 1.00 | FALSE | 2825 | 3: 220, 7: 212, 4: 204, 12: 197 |
address2 | 10812 | 0.23 | FALSE | 434 | FLA: 218, FLA: 214, FLA: 213, FLA: 160 |
address3 | 0 | 1.00 | FALSE | 8543 | LON: 25, GRE: 24, THE: 23, HIG: 21 |
town | 13384 | 0.04 | FALSE | 133 | WAL: 35, CHE: 33, STR: 27, CHI: 24 |
local_aut | 0 | 1.00 | FALSE | 69 | LON: 7512, ROM: 396, BRO: 277, CRO: 243 |
county | 0 | 1.00 | FALSE | 33 | BRO: 861, CRO: 729, WAN: 702, HAV: 672 |
postcode_short | 0 | 1.00 | FALSE | 247 | CR0: 243, SW1: 204, E17: 193, SW1: 187 |
current_energy_rating | 0 | 1.00 | FALSE | 6 | D: 7075, C: 3488, E: 2646, B: 361 |
windows_energy_eff | 0 | 1.00 | FALSE | 5 | Ave: 7819, Goo: 3244, Ver: 1698, Poo: 1231 |
tenure | 0 | 1.00 | FALSE | 3 | own: 11236, ren: 2503, ren: 259 |
nearest_station | 0 | 1.00 | FALSE | 592 | rom: 201, cha: 103, bex: 99, har: 96 |
water_company | 0 | 1.00 | FALSE | 5 | Tha: 10471, Aff: 1577, Ess: 1154, SES: 792 |
district | 0 | 1.00 | FALSE | 33 | Cro: 925, Bro: 843, Hav: 681, Bex: 603 |
type_of_closest_station | 0 | 1.00 | FALSE | 3 | rai: 6512, tub: 4711, lig: 2775 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1 | 8003.94 | 4622.57 | 2.00 | 4004.50 | 8015.50 | 12018.75 | 1.5996e+04 | ▇▇▇▇▇ |
total_floor_area | 0 | 1 | 92.55 | 45.08 | 21.00 | 64.00 | 83.00 | 108.00 | 4.8000e+02 | ▇▂▁▁▁ |
number_habitable_rooms | 0 | 1 | 4.31 | 1.66 | 1.00 | 3.00 | 4.00 | 5.00 | 1.4000e+01 | ▆▇▁▁▁ |
co2_emissions_current | 0 | 1 | 4.25 | 2.39 | 0.10 | 2.70 | 3.80 | 5.20 | 4.4000e+01 | ▇▁▁▁▁ |
co2_emissions_potential | 0 | 1 | 2.22 | 1.44 | -0.50 | 1.30 | 1.80 | 2.70 | 2.0500e+01 | ▇▁▁▁▁ |
energy_consumption_current | 0 | 1 | 262.98 | 93.42 | 12.00 | 202.00 | 249.00 | 308.00 | 1.2960e+03 | ▇▅▁▁▁ |
energy_consumption_potential | 0 | 1 | 141.38 | 78.06 | -49.00 | 89.00 | 122.00 | 167.00 | 1.0230e+03 | ▇▂▁▁▁ |
latitude | 0 | 1 | 51.49 | 0.08 | 51.30 | 51.43 | 51.49 | 51.56 | 5.1680e+01 | ▂▇▇▇▂ |
longitude | 0 | 1 | -0.11 | 0.16 | -0.49 | -0.21 | -0.11 | 0.00 | 2.9000e-01 | ▂▅▇▅▂ |
population | 69 | 1 | 83.74 | 43.71 | 1.00 | 52.00 | 79.00 | 109.00 | 5.1000e+02 | ▇▃▁▁▁ |
altitude | 0 | 1 | 36.57 | 26.01 | 0.00 | 16.00 | 32.00 | 51.00 | 2.3900e+02 | ▇▃▁▁▁ |
london_zone | 0 | 1 | 3.76 | 1.43 | 1.00 | 3.00 | 4.00 | 5.00 | 7.0000e+00 | ▇▇▇▆▅ |
average_income | 0 | 1 | 55334.70 | 8454.23 | 36000.00 | 49400.00 | 54600.00 | 60600.00 | 8.5200e+04 | ▃▇▆▂▁ |
price | 0 | 1 | 593790.93 | 519158.10 | 77000.00 | 351000.00 | 460000.00 | 650000.00 | 1.0800e+07 | ▇▁▁▁▁ |
num_tube_lines | 0 | 1 | 0.44 | 0.74 | 0.00 | 0.00 | 0.00 | 1.00 | 6.0000e+00 | ▇▁▁▁▁ |
num_rail_lines | 0 | 1 | 0.58 | 0.51 | 0.00 | 0.00 | 1.00 | 1.00 | 2.0000e+00 | ▆▁▇▁▁ |
num_light_rail_lines | 0 | 1 | 0.24 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0000e+00 | ▇▁▁▁▂ |
distance_to_station | 0 | 1 | 0.65 | 0.40 | 0.00 | 0.37 | 0.57 | 0.84 | 5.6100e+00 | ▇▁▁▁▁ |
#We study the distribution of the most important variable : price. The distribution is right skewed.
london_house_prices_2019_training %>%
ggplot(aes(x= price)) + geom_histogram() +
labs(title = "The distribution of price is right-skewed",subtitle ="", x= "Price", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Using log prices
london_house_prices_2019_training %>%
ggplot(aes(x= log(price))) + geom_histogram() +
labs(title = "Log price is normally distributed",subtitle ="", x= "Log price", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of price in our dataset is right-skewed. Using a log transformation makes data look more like the normal distribution. This is however not necessarily helpful for linear regression. The most important part is normality of residuals.
#I want to see the prices by date. Is there some kind of seasonality or a pattern? I see some regular surges. Maybe more houses are sold on weekends.
london_house_prices_2019_training %>%
ggplot(aes(x=date,y=price)) + geom_col() + labs(title = "Regular surges appear over time",subtitle ="", x= "Date", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
## Warning: Removed 8777 rows containing missing values (position_stack).
#Let's see the distribution of price according to the number of rooms. The median price does not seem to increase after 10 rooms.
london_house_prices_2019_training %>%
ggplot(aes(x=as.factor(number_habitable_rooms),price)) + geom_boxplot() + labs(title = "Median price does not increase after 10 rooms",subtitle ="", x= "Number of habitable rooms", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#Let's see the number of rooms vs the total floor area. I would obviously expect it to be positively correlated but I want to make sure that it is the case. I prefer not to make too many assumptions about data.
london_house_prices_2019_training %>%
ggplot(aes(x=as.factor(number_habitable_rooms),total_floor_area)) + geom_boxplot() + labs(title = "Total floor area increases along with number of habitable rooms",subtitle ="", x= "Number of habitable rooms", y = "Total floor area") + theme_minimal() + theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#How big are the biggest houses in the dataset?
london_house_prices_2019_training %>%
arrange(desc(total_floor_area)) %>%
head(10) %>%
summarise(district,total_floor_area,price)
## district total_floor_area price
## 1 Ealing 480 3500000
## 2 Kensington and Chelsea 470 10800000
## 3 Redbridge 470 2100000
## 4 Kensington and Chelsea 460 9000000
## 5 Camden 453 6180000
## 6 Enfield 436 1750000
## 7 Barnet 436 3500000
## 8 Greenwich 432 1975000
## 9 Lambeth 419 4861875
## 10 Richmond upon Thames 416 2100000
#How small are the smallest houses in the dataset?
london_house_prices_2019_training %>%
arrange(total_floor_area) %>%
head(10) %>%
summarise(district,total_floor_area,price)
## district total_floor_area price
## 1 Kensington and Chelsea 21.00 301000
## 2 Enfield 21.00 171500
## 3 Westminster 21.00 270000
## 4 Kensington and Chelsea 21.00 265000
## 5 Kensington and Chelsea 21.10 385000
## 6 Kensington and Chelsea 21.41 275000
## 7 Richmond upon Thames 22.00 190000
## 8 Greenwich 22.00 315000
## 9 Kensington and Chelsea 22.76 850000
## 10 Hammersmith and Fulham 23.00 275000
#What are the most expensive houses in the dataset?
london_house_prices_2019_training %>%
arrange(desc(price)) %>%
head(10) %>%
summarise(district,total_floor_area,price)
## district total_floor_area price
## 1 Kensington and Chelsea 470.00 10800000
## 2 Kensington and Chelsea 357.00 10500000
## 3 Kensington and Chelsea 460.00 9000000
## 4 Kensington and Chelsea 293.00 8750000
## 5 Kensington and Chelsea 365.00 8225000
## 6 Kensington and Chelsea 318.20 8175000
## 7 Westminster 257.20 8150000
## 8 Kensington and Chelsea 309.00 8050000
## 9 Westminster 412.00 8000000
## 10 Westminster 168.89 7250000
#What are the least expensive houses in the dataset?
london_house_prices_2019_training %>%
arrange(price) %>%
head(10) %>%
summarise(district,total_floor_area,price)
## district total_floor_area price
## 1 Havering 41 77000
## 2 Bromley 31 80000
## 3 Ealing 55 85000
## 4 Hillingdon 68 87500
## 5 Harrow 61 87500
## 6 Bexley 77 88000
## 7 Redbridge 44 89000
## 8 Brent 55 90000
## 9 Hillingdon 65 90000
## 10 Brent 36 91000
#Is distance to station very correlated with price? Pearson correlation coefficient = -0.129. I would have expected a bigger negative correlation.
london_house_prices_2019_training %>%
ggplot(aes(x=distance_to_station,y=price)) + geom_smooth(method="lm") + labs(title = "Distance to station seems very negatively correlated with price..",subtitle ="But the correlation coefficient has a low magnitude!", x= "Distance to station", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank()) + geom_text(
data = data.frame(x = 2.3, y = 0, label = "Pearson correlation coefficient = -0.129"),
aes(x = x, y = y, label = label),
colour="black",
family="Oswald",
size= 4.5,
hjust = 0.5,
lineheight = .8,
inherit.aes = FALSE)
## `geom_smooth()` using formula 'y ~ x'
#Correlation coefficient between distance to station and price.
cor(london_house_prices_2019_training$price,london_house_prices_2019_training$distance_to_station)
## [1] -0.1291175
#Let's see if price depends on tenure. Rentals and social properties seem to have lower prices than owner_occupied properties.
london_house_prices_2019_training %>%
ggplot(aes(x=as.factor(tenure),price)) + geom_boxplot() + labs(title = "Owner-occupied properties are more expensive than the others",subtitle ="", x= "Tenure", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#Do different property types have different prices?
london_house_prices_2019_training %>%
ggplot(aes(x=property_type,price)) + geom_boxplot() + labs(title = "Detached properties are more expensive than the others",subtitle ="", x= "Tenure", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#Do new houses really have a higher price?
london_house_prices_2019_training %>%
ggplot(aes(x=whether_old_or_new,price)) + geom_boxplot() + labs(title = "Are new houses more expensive than old ones?",subtitle ="", x= "Whether old or New", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#The boxplot does not really answer this question, let's look at the average
london_house_prices_2019_training %>% group_by(whether_old_or_new) %>%
summarise(mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## whether_old_or_new `mean(price)`
## <fct> <dbl>
## 1 N 593884.
## 2 Y 431875
#New houses have a lower average price than old houses. However, this does not mean anything since old houses in the dataset could simply be bigger. Let's try to group by number of habitable rooms too. Now it makes sense. For the same number of rooms, houses are on average more expensive when they are new.
london_house_prices_2019_training %>% group_by(number_habitable_rooms,whether_old_or_new) %>%
summarise(mean(price))
## `summarise()` regrouping output by 'number_habitable_rooms' (override with `.groups` argument)
## # A tibble: 16 x 3
## # Groups: number_habitable_rooms [14]
## number_habitable_rooms whether_old_or_new `mean(price)`
## <int> <fct> <dbl>
## 1 1 N 282129.
## 2 2 N 328988.
## 3 2 Y 400000
## 4 3 N 442166.
## 5 3 Y 451000
## 6 4 N 523025.
## 7 5 N 593832.
## 8 6 N 786567.
## 9 7 N 1010177.
## 10 8 N 1272085.
## 11 9 N 1771558.
## 12 10 N 2264689.
## 13 11 N 2218333.
## 14 12 N 2829125
## 15 13 N 1687500
## 16 14 N 1560000
#We see that there are only 8 new houses in the dataset! I can't really make conclusions based on that.
london_house_prices_2019_training %>% group_by(whether_old_or_new) %>%
count()
## # A tibble: 2 x 2
## # Groups: whether_old_or_new [2]
## whether_old_or_new n
## <fct> <int>
## 1 N 13990
## 2 Y 8
#Are all districts equal in terms of price?
london_house_prices_2019_training %>%
ggplot(aes(x=district,price)) + geom_boxplot() + labs(title = "Districts have radically different median prices",subtitle ="", x= "District", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
#I expect total floor area to be very correlated with price. Let's check if this is the case. Both variables are very positively correlated (0.7)
london_house_prices_2019_training %>%
ggplot(aes(x=total_floor_area,y=price)) + geom_smooth() + labs(title = "Price increases along with total floor area ",subtitle ="", x= "Total floor area", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank()) + geom_text(
data = data.frame(x = 300, y = 1000000, label = "Pearson correlation coefficient = 0.69"),
aes(x = x, y = y, label = label),
colour="black",
family="Oswald",
size= 4.5,
hjust = 0.5,
lineheight = .8,
inherit.aes = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## [1] 0.6937882
#Let's see how freehold and leasehold properties are distributed. Let's also check if they have any impact on price. I expect freehold properties to be more expensive on average than leasehold properties.
london_house_prices_2019_training %>% group_by(freehold_or_leasehold) %>%
count()
## # A tibble: 2 x 2
## # Groups: freehold_or_leasehold [2]
## freehold_or_leasehold n
## <fct> <int>
## 1 F 8457
## 2 L 5541
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## freehold_or_leasehold `mean(price)`
## <fct> <dbl>
## 1 F 675909.
## 2 L 468457.
#It makes sense to think that total floor area has a different effect on price depending on the postcode, or at least the district. I will verify this and eventually use an interaction variable in my linear regression. I check this with 2 districts in order to clearly visualize data.
london_house_prices_2019_training %>% filter(district == c("Barking and Dagenham", "Kensington and Chelsea")) %>% ggplot(aes(x=total_floor_area,y=price,colour=district)) + geom_smooth(method="lm") + labs(title = "Price per square meter differs among districts",subtitle ="", x= "Total floor area", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
## `geom_smooth()` using formula 'y ~ x'
#I repeat the process with postcodes
london_house_prices_2019_training %>% filter(postcode_short == c("SW8", "NW1")) %>% ggplot(aes(x=total_floor_area,y=price,colour=postcode_short)) + geom_smooth(method="lm") + labs(title = "Price per square meter differs among postcodes",subtitle ="", x= "Total floor area", y = "Price") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 19, family = "Oswald"),
legend.title = element_text(family = "Oswald"),
legend.text = element_text(family = "Oswald",size=10),
plot.subtitle=element_text(size=17, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.text.x = element_text(size= 11, family='Oswald',color="black"),
axis.text.y = element_text(size= 11,family = "Oswald",color="black"),
axis.title.x = element_text(size= 11, family='Oswald',color="black"),
axis.title.y = element_text(size= 11,family = "Oswald",color="black"),plot.title.position = "plot",
plot.caption.position = "plot",
panel.grid.minor=element_blank())
## `geom_smooth()` using formula 'y ~ x'
It is intuitive to think that the price per square meter is not the same in all districts or postcodes. This is further proved with these visualizations. The slope of the functions changes according to the district/postcode, indicating that using interaction variables may be necessary in our algorithms.
We study the correlation between variables more into details.
# produce a correlation table using GGally::ggcor()
# this takes a while to plot
library("GGally")
london_house_prices_2019_training %>%
select(-ID) %>% #keep Y variable last
ggcorr(method = c("pairwise", "pearson"), layout.exp = 2,label_round=2, label = TRUE,label_size = 2,hjust = 1,nbreaks = 5,size = 2,angle = -20)
Multicolinearity will not be a problem for our linear regression since it is a predictive model. It would have been a problem if our linear regression was explanatory since colinearity increases variance.
We will focus on the variables that are very correlated with price in our algorithms.
To help you get started I build a linear regression model below. I chose a subset of the features with no particular goal. You can (and should) add more variables and/or choose variable selection methods if you want.
#number of folds in cross validation
CVfolds <- 5
#Define folds
set.seed(1)
#create five folds with no repeats
indexPreds <- createMultiFolds(train_data$price, CVfolds,times = 1)
#Define traincontrol using folds
ctrl <- trainControl(method = "cv", number = CVfolds, returnResamp = "final", savePredictions = "final", index = indexPreds,sampling = NULL)
base_model<-train(
price ~ distance_to_station +water_company+property_type+whether_old_or_new+freehold_or_leasehold+latitude+ longitude+postcode_short,
train_data,
method = "lm",
trControl = ctrl
)
# summary of the results
summary(base_model)
#Use the model to make predictions out of sample in the testing set
pred_base <-predict(base_model,test_data)
test_results_base <-data.frame(RMSE = RMSE(pred_base, test_data$price),
Rsquared = R2(pred_base, test_data$price))
test_results_base
#we are going to train the model and report the results using k-fold cross validation
set.seed(1)
model1_lm<-train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income ,
train_data,
method = "lm",
trControl = ctrl
)
# summary of the results
summary(model1_lm)
#Use the model to make predictions out of sample in the testing set
pred<-predict(model1_lm,test_data)
test_results<-data.frame( RMSE = RMSE(pred, test_data$price),
Rsquared = R2(pred, test_data$price))
test_results
# we can check variable importance as well
importance <- varImp(model1_lm, scale=TRUE)
plot(importance)
set.seed(1)
#Let's try our model with log price.
model1_lm_log<-train(
log(price) ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income ,
train_data,
method = "lm",
trControl = ctrl
)
# summary of the results
summary(model1_lm_log)
test_data_log <- test_data %>%
mutate(price=log(price))
#Use the model to make predictions out of sample in the testing set
pred_log <-predict(model1_lm_log,test_data)
test_results_log <-data.frame( RMSE = RMSE(pred_log, test_data_log$price),
Rsquared = R2(pred_log, test_data_log$price))
test_results_log
The RMSE is way higher so we will keep our model without the log transformation.
#Using LASSO
set.seed(1)
#we will look for the optimal lambda in this sequence
lambda_seq <- seq(0, 1000, length = 1000)
#LASSO regression with using 5-fold cross validation to select the best lambda among the lambdas specified in "lambda_seq".
lasso <- train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,
data = train_data,
method = "glmnet",
preProc = c("center", "scale"),
#This option standardizes the data before running the LASSO regression
trControl = ctrl,
tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.
)
plot(lasso)
coef(lasso$finalModel, lasso$bestTune$lambda)
lasso$bestTune$lambda
# Count of how many coefficients are greater than zero and how many are equal to zero
sum(coef(lasso$finalModel, lasso$bestTune$lambda)!=0)
sum(coef(lasso$finalModel, lasso$bestTune$lambda)==0)
predictions_lasso <- predict(lasso,test_data)
# Model prediction performance
LASSO_results<-data.frame( RMSE = RMSE(predictions_lasso, test_data$price),
Rsquare = R2(predictions_lasso, test_data$price)
)
LASSO_results
#Lasso gives very similar results to OLS.
#Stepwise regression
#BackFit <- train(price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,train_data,
# method = "leapBackward", #can chance method to "leapSeq", "leapForward"
# tuneGrid = data.frame(nvmax = 10:30), #Will find the best model with 10:16 variables.
# trControl = control
#)
##show the results of all models
#BackFit$results
Next I fit a tree model using the same subset of features.
model2_tree <- train(
price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income, train_data,
method = "rpart",
trControl = ctrl,
tuneLength=10
)
plot(model2_tree)
#You can view how the tree performs
model2_tree$results
#You can view the final tree
#rpart.plot(model2_tree$finalModel)
#you can also visualize the variable importance
#importance <- varImp(model2_tree, scale=TRUE)
#plot(importance)
#Tuning the complexity parameter
#I choose cp values that seems to result in low error based on plot above
Grid <- expand.grid(cp = seq( 0.0000, 0.0020,0.0001))
dtree_fit <- train(price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income ,data = train_data,
method = "rpart",
metric="RMSE",
trControl=ctrl,
tuneGrid=Grid)
# Plot the best tree model found
#rpart.plot(dtree_fit$finalModel)
# Print the search results of 'train' function
plot(dtree_fit)
dtree_fit$results
#Use the model to make predictions out of sample in the testing set
predict_trees <- predict(dtree_fit,test_data)
test_results_trees<-data.frame( RMSE = RMSE(predict_trees, test_data$price),
Rsquared = R2(predict_trees, test_data$price))
test_results_trees
# we can check variable importance as well
importance_tree <- varImp(dtree_fit, scale=TRUE)
importance_tree
plot(importance_tree)
#Most postcodes are not important so I removed this variable.
# #I set the number of samples in the bagging to iterations. You should experiment with different numbers
# iterations<-100
#
# #Next I use the foreach loop and I will combine the estimates in each sample in predictions. It works as a for loop but using `.combine` option I am also merging the results
# predictions_trees_bagging <-foreach(m=1:iterations,.combine=cbind) %do% {
# #Change seed to get different samples
# set.seed(m*100)
# #Get a new sample
# training_positions <- sample(nrow(train_data), size=nrow(train_data),replace=TRUE)
# #fit a large tree model in this sample
# tree_bag <- rpart(formula= price~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income,data = train_data,
# control = rpart.control(cp = 0, maxdepth = 30,minbucket=2,minsplit=2))
#
# #foreach loop adds them to the predictions automatically
# predictions_trees_bagging <- predict(tree_bag,test_data)
# }
#
# #Find the average estimate
# predictions_trees_bagging <-rowMeans(predictions_trees_bagging)
#
# test_results_bagging<-data.frame( RMSE = RMSE(predictions_trees_bagging, test_data$price),
# Rsquared = R2(predictions_trees_bagging, test_data$price))
#
# test_results_bagging
I will use a random forest with mtry = number of variables, which is equivalent to using a bagged tree.
# The following function gives the list of tunable parameters
modelLookup("ranger")
set.seed(1)
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF <- data.frame(
.mtry = 12 ,
.splitrule = "variance",
.min.node.size = seq(5,10,1)
)
# Fit random forest: model= ranger using caret library and train function
bagged_tree <- train(
price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
method = "ranger",
metric="RMSE",
trControl = ctrl,
tuneGrid = gridRF,
importance = 'permutation',
num.trees= 500
#This is the method used to determine variable importance.
#Permutation=leave one variable out and fit the model again
)
predictions_bagged <-predict(bagged_tree,test_data)
# Model prediction performance
rf_bagged<-data.frame( RMSE = RMSE(predictions_bagged, test_data$price),
Rsquare = R2(predictions_bagged, test_data$price))
rf_bagged
As expected, the bagged tree is better than the long tree since it has a lower RMSE. I will keep this one for my stacking.
#MODEL WITH POSTCODE SHORT. PERFORMS BAD.
# # Define the tuning grid: tuneGrid
# # Let's do a search on 'mtry'; number of variables to use in each split
# gridRF <- data.frame(
# .mtry = 5,
# .splitrule = "variance",
# .min.node.size = 5
# )
# # Fit random forest: model= ranger using caret library anf train function
# rf_fit <- train(
# price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income + postcode_short, data= train_data,
# method = "ranger",
# metric="RMSE",
# trControl = control,
# tuneGrid = gridRF,
# importance = 'permutation',
# num.trees= 500
# #This is the method used to determine variable importance.
# #Permutation=leave one variable out and fit the model again
# )
#
# varImp(rf_fit)
#
# plot(varImp(rf_fit))
#
# summary(rf_fit)
#
# print(rf_fit)
#
# predictions_rf <-predict(rf_fit,test_data)
#
# # Model prediction performance
# rf_results<-data.frame( RMSE = RMSE(predictions_rf, test_data$price),
# Rsquare = R2(predictions_rf, test_data$price))
# rf_results
# The following function gives the list of tunable parameters
modelLookup("ranger")
set.seed(1)
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF <- data.frame(
.mtry = 4,
.splitrule = "variance",
.min.node.size = seq(5,10,1)
)
# Fit random forest: model= ranger using caret library and train function
rf_fit <- train(
price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
method = "ranger",
metric="RMSE",
trControl = ctrl,
tuneGrid = gridRF,
importance = 'permutation',
num.trees= 500
#This is the method used to determine variable importance.
#Permutation=leave one variable out and fit the model again
)
varImp(rf_fit)
plot(varImp(rf_fit))
summary(rf_fit)
print(rf_fit)
predictions_rf <-predict(rf_fit,test_data)
# Model prediction performance
rf_results<-data.frame( RMSE = RMSE(predictions_rf, test_data$price),
Rsquare = R2(predictions_rf, test_data$price))
rf_results
set.seed(1) #I will use cross validation. To be able to replicate the results I set the seed to a fixed number
# Below I use 'train' function from caret library.
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# dfeault 'metric' is accuracy
knn_fit <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district + postcode_short , data=train_data,
method = "knn",
trControl = ctrl, #use cross validation with 5 folds
tuneLength = 10, #number of parameter values train function will try
preProcess = c("center", "scale")) #center and scale the data in k-nn this is pretty important
knn_fit
plot(knn_fit) #we can plot the results
# I will store the values of k I want to experiment with in knnGrid
knnGrid <- expand.grid(k= seq(1,10 , by = 1))
# By fixing the seed I can re-generate the results when needed
set.seed(1)
# Below I use 'train' function from caret library.
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# 'metric' is ROC or AUC
# I already defined the 'trControl' and 'tuneGrid' options above
fit_KNN <- train(price~ total_floor_area + latitude + longitude + average_income + district, data=train_data,
preProcess = c("center", "scale"),
method="knn",
metric="RMSE",
trControl=ctrl,
tuneGrid = knnGrid)
# display results
print(fit_KNN)
# plot results
plot(fit_KNN)
pred_knn <-predict(fit_KNN,test_data)
test_results_knn<-data.frame( RMSE = RMSE(pred_knn, test_data$price),
Rsquared = R2(pred_knn, test_data$price))
test_results_knn
importance_knn <- varImp(fit_KNN, scale=TRUE)
importance_knn
plot(importance_knn)
#I removed number habitable rooms, co2 emissions potential and co2 emissions current because it is very correlated with total_floor_area. I also used the importance plot and I concluded that property_type and num_light_rail_lines were not necessary. R-squared increased.
modelLookup("gbm")
#Expand the search grid (see above for definitions)
grid<-expand.grid(interaction.depth = seq(6,10,1),n.trees = seq(300,500,50),shrinkage = seq(0.05,0.08,0.005), n.minobsinnode = 10)
set.seed(100)
#Train for gbm
fit_gbm <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district, data=train_data,
method = "gbm",
trControl = ctrl,
tuneGrid =grid,
metric = "RMSE",
verbose = FALSE)
#I had to remove postcode short because it was too computationally expensive
print(fit_gbm)
#performance, predict
predictions_gbm <-predict(fit_gbm, test_data)
gbm_results<-data.frame( RMSE = RMSE(predictions_gbm, test_data$price),
Rsquared = R2(predictions_gbm, test_data$price))
gbm_results
#modelLookup("gbm")
#Expand the search grid (see above for definitions)
#grid<-expand.grid(interaction.depth = 8,n.trees = 500,shrinkage =0.075, n.minobsinnode = 10)
#set.seed(100)
#Train for gbm
#fit_gbm <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district + postcode_short , data=train_data,
# method = "gbm",
# trControl = control,
# tuneGrid =grid,
# metric = "RMSE",
# verbose = FALSE
# )
#print(fit_gbm)
#performance, predict
#predictions_gbm <-predict(fit_gbm, test_data)
#gbm_results<-data.frame( RMSE = RMSE(predictions_gbm, test_data$price),
# Rsquared = R2(predictions_gbm, test_data$price))
#gbm_results
Use stacking to ensemble your algorithms.
#training models
#Define folds
set.seed(1)
#create five folds with no repeats
indexPreds <- createMultiFolds(train_data$price, CVfolds,times = 1)
#Define traincontrol using folds
ctrl <- trainControl(method = "cv", number = CVfolds, returnResamp = "final", savePredictions = "final", index = indexPreds,sampling = NULL)
#train linear regression
linear_model <-train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area+ average_income,
train_data,
method = "lm",
trControl = ctrl
)
#train a tree (with different independent variables) use same trcontrol
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF_bagged <- data.frame(
.mtry = 12,
.splitrule = "variance",
.min.node.size = 5
)
# Fit random forest: model= ranger using caret library and train function
bagged_tree_model <- train(
price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
method = "ranger",
metric="RMSE",
trControl = ctrl,
tuneGrid = gridRF_bagged,
importance = 'permutation',
num.trees= 500
#This is the method used to determine variable importance.
#Permutation=leave one variable out and fit the model again
)
#LASSO. I know that the best lambda is 102, but I still use lambda seq to use the exact decimals on my lambda in my final models.
lambda_seq <- seq(0, 1000, length = 1000)
lasso_model <- train(
price ~ distance_to_station +property_type+ freehold_or_leasehold+ london_zone + total_floor_area:district + co2_emissions_potential + co2_emissions_current + postcode_short:total_floor_area,
data = train_data,
method = "glmnet",
preProc = c("center", "scale"),
#This option standardizes the data before running the LASSO regression
trControl = ctrl,
tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.
)
# Define the tuning grid: tuneGrid
# Let's do a search on 'mtry'; number of variables to use in each split
gridRF2 <- data.frame(
.mtry = 4,
.splitrule = "variance",
.min.node.size = 5
)
# Fit random forest: model= ranger using caret library anf train function
rf_model <- train(
price ~ distance_to_station +property_type+whether_old_or_new+freehold_or_leasehold+ london_zone + total_floor_area + co2_emissions_potential + co2_emissions_current + district + latitude + longitude + average_income , data= train_data,
method = "ranger",
metric="RMSE",
trControl = ctrl,
tuneGrid = gridRF2,
importance = 'permutation',
num.trees= 500
#This is the method used to determine variable importance.
#Permutation=leave one variable out and fit the model again
)
#Gradient boosting machine
grid<-expand.grid(interaction.depth = 10,n.trees = 500,shrinkage =0.07, n.minobsinnode = 10)
#Train for gbm
gbm_model <- train(price~ total_floor_area + latitude + longitude + number_habitable_rooms + num_tube_lines + num_rail_lines + num_light_rail_lines + distance_to_station + average_income + property_type + number_habitable_rooms + district, data=train_data,
method = "gbm",
trControl = ctrl,
tuneGrid =grid,
metric = "RMSE",
verbose = FALSE
)
#KNN. I already know that the optimal number of neighbors is 7.
# Below I use 'train' function from caret library.
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# I already defined the 'trControl' and 'tuneGrid' options above
knn_model <- train(price~ total_floor_area + latitude + longitude + average_income + district, data=train_data,
preProcess = c("center", "scale"),
method="knn",
metric="RMSE",
trControl=ctrl,
tuneGrid = data.frame(k=7))
#combine the results
#make sure to use the method names from above
multimodel <- list(lm = linear_model, rpart = bagged_tree_model, lasso = lasso_model, knn= knn_model, gbm = gbm_model, random_forest = rf_model )
class(multimodel) <- "caretList"
#we can visualize the differences in performance of each algorithm for each fold
modelCor(resamples(multimodel))
## lm rpart lasso knn gbm
## lm 1.0000000 0.6677961 0.9941921 0.5755736 0.61750682
## rpart 0.6677961 1.0000000 0.6333412 0.8757454 0.15939515
## lasso 0.9941921 0.6333412 1.0000000 0.5016088 0.69842837
## knn 0.5755736 0.8757454 0.5016088 1.0000000 -0.18195883
## gbm 0.6175068 0.1593952 0.6984284 -0.1819588 1.00000000
## random_forest 0.6794163 0.9855954 0.6304373 0.9427806 0.06414141
## random_forest
## lm 0.67941627
## rpart 0.98559543
## lasso 0.63043732
## knn 0.94278062
## gbm 0.06414141
## random_forest 1.00000000
set.seed(1)
library(caret)
library(caretEnsemble)
model_list<- caretStack(multimodel, #creating a model that stacks both models together
trControl=ctrl,
method="lm",
metric="RMSE")
summary(model_list)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2604842 -54082 4049 52678 3672126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.223e+03 4.635e+03 -0.264 0.791848
## lm 3.065e-01 1.014e-01 3.022 0.002518 **
## rpart 4.042e-01 3.695e-02 10.938 < 2e-16 ***
## lasso 3.255e-01 9.996e-02 3.257 0.001131 **
## knn -5.430e-02 1.591e-02 -3.413 0.000646 ***
## gbm 1.921e-01 1.733e-02 11.084 < 2e-16 ***
## random_forest -1.732e-01 3.229e-02 -5.364 8.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 187400 on 10492 degrees of freedom
## Multiple R-squared: 0.8714, Adjusted R-squared: 0.8713
## F-statistic: 1.185e+04 on 6 and 10492 DF, p-value: < 2.2e-16
pred_stacked <- predict(model_list,test_data)
test_results_stacked<-data.frame( RMSE = RMSE(pred_stacked, test_data$price),
Rsquared = R2(pred_stacked, test_data$price))
test_results_stacked
## RMSE Rsquared
## 1 173263.8 0.8863724
In this section you should use the best algorithm you identified to choose 200 properties from the out of sample data.
numchoose=200
oos<-london_house_prices_2019_out_of_sample
#predict the value of houses
oos$predict <- predict(model_list,oos)
#Choose the ones you want to invest here
#Make sure you choose exactly 200 of them
oos <-oos %>%
mutate(profitMargin=(predict-asking_price)/(asking_price)) %>%
arrange(-profitMargin)
#Make sure you chooses exactly 200 of them
oos$buy=0
oos[1:numchoose,]$buy=1
oos<-oos %>% mutate( actualProfit=buy*profitMargin)
#let's find the actual profit
mean(oos$profitMargin)
## [1] 0.03149146
## [1] 0.6684041
#output your choices. Change the name of the file to your "lastname_firstname.csv"
write.csv(oos,"bensouda_othman.csv")
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
oos_map <- oos %>% mutate(longitude2=buy*longitude,latitude2=buy*latitude)
#I define the coordinates of London
london <- c(left = -0.6, bottom = 51.2, right = 0.35, top = 51.8)
#let's get the map for London
map_london <-ggmap(get_stamenmap(london, zoom = 10))
## Source : http://tile.stamen.com/terrain/10/510/339.png
## Source : http://tile.stamen.com/terrain/10/511/339.png
## Source : http://tile.stamen.com/terrain/10/512/339.png
## Source : http://tile.stamen.com/terrain/10/510/340.png
## Source : http://tile.stamen.com/terrain/10/511/340.png
## Source : http://tile.stamen.com/terrain/10/512/340.png
## Source : http://tile.stamen.com/terrain/10/510/341.png
## Source : http://tile.stamen.com/terrain/10/511/341.png
## Source : http://tile.stamen.com/terrain/10/512/341.png
#Now I add the points of the stops and searches
map_london +
geom_point(aes(x =longitude2, y = latitude2),data= oos_map,col="red", alpha=1, size= 0.6) +
labs(title= "Map of the 200 best investments", x = "Longitude", y= "Latitude") +
theme(plot.title = element_text(face = "bold", size = 16, family = "Oswald"),
plot.subtitle=element_text(size=15, family = "Oswald"),
plot.caption = element_text(size=9 , family = "Oswald"),
axis.title.x = element_text(size=12,family="Oswald"),
axis.title.y = element_text(size=12,family="Oswald")
)
Let’s test how much money your algorithm can make. I will take the in-sample testing data (whose prices we know). And generate a random asking price. For the sake of simplicity assume that asking price is within 20% of the actual price. Then I will run my algorithm to see how it does on this data.
numchoose=200
set.seed(2)
random_mult<-1/(1+runif(nrow(test_data), min = -0.2, max = 0.2))
test_data$asking_price<- test_data$price*random_mult
#predict the value of houses
test_data$predict <- predict(model_list,test_data)
#Choose the ones you want to invest here
#Let's find the profit margin given our predicted price and asking price
test_data<- test_data%>%mutate(profitMargin=(predict-asking_price)/asking_price)%>%arrange(-profitMargin)
#Make sure you choose exactly 200 of them
test_data$invest=0
test_data[1:numchoose,]$invest=1
#let's find the actual profit
test_data<-test_data%>%mutate(profit=(price-asking_price)/asking_price)%>%mutate(actualProfit=invest*profit)
#if we invest in everything
mean(test_data$profit)
## [1] 0.0005645001
## [1] 0.05930893