The Ames housing dataset (available here) was the basis for the Kaggle house prices competition. The object of the competition was to predict the sale price of a house based on a set of features such as the number of bedrooms, the neighbourhood within Ames, etc. It is worth looking into it with Tableau to do some initial exploratory data analysis.
As a fist step let us look at the distribution of sale prices (click to enlarge) in the Ames housing dataset.
Everything looks good here, with no missing values. A tail extends to high prices. Later on I will prefer the median over the average to limit the effect of the few very expensive houses. The typical house price in Ames is roughly 150k.
Lot area (the lot size in square feet) is a very important variable. Obviously, larger estates are worth more. Lot area varies wildly over the Ames housing dataset, by more than a factor 100. Therefore I chose to use the logarithm of lot area (in base 10), and I divided it into eight bins. The median price of houses (central line in the box-plots, between the two different shades of grey) seems to increase roughly linearly with log lot area.
The number of bedrooms (above ground) is also another obvious feature. It does drive the price up even within bins in lot area, though its effect is clearer in the intermediate-sized lots. Here I decided to group together the houses with four or more bedrooms and those with one or less (there are few houses with zero bedrooms: for now I don’t want to look into why). This way the number of bins is manageable.
Before moving on to other variables, I was curious about how the number of bathrooms interacts with the number of bedrooms to affect price.
This is a concise way to show the effect of two ordinal or categoric variables. The number in the table is the median price of houses with the given number of bathrooms and bedrooms, and the color shade also becomes darker with higher median prices. So we see that even for a given number of bedrooms, e.g. three, the number of baths commands almost a factor two price increase.
I defined the age of the building as the year of sale minus the year of construction (or last remodelling). New houses have a price premium that declines as they age even by one or two years. After a while the effect of age plateaus off, only to come back for very old houses.
The neighbourhood where the house is located most likely affects the sale price. But there are so many neighbourhoods in the Ames housing dataset that displaying directly the price as a function of neighbourhood may lead to a cluttered plot. So I first looked at how many sales took place by neighbourhood (the color codes instead the median sale price, the darker the higher). It’s clear that most sales are concentrated in few neighbourhoods, so it’s worth regrouping all the neighbourhoods with comparatively few sales together. I wouldn’t like to lose the information on NridgHt though: it looks like the most lucrative sales took place there! So let’s group neighbourhoods from Sawyer down.
Here we can see how the neighbourhood affects the sale price even after controlling for the number of bedrooms in the house. Some neighbours are more expensive not just because the houses are larger there, but for some other reason (e.g. distance from the center, crime, age of the buildings, etc.).
Two categoric variables in the Ames housing dataset score the overall condition and overall quality of the houses from one to ten (I assume; even though the condition is never ten in the dataset). It turns out they are great predictors of sale price, with higher quality and/or better condition commanding higher prices (darker shades of blue in the table). But people are ready to pay for a high quality estate even if it is in poor condition (median sale price is highest in the bottom left corner of the table).
I don’t see a strong trend of price with the month of sale. I am looking into this because the benchmark model of the competition uses month of sale (alongside year of sale) as a predictor in a simple linear regression. If no trend emerges with the month of sale, what about seasons? I grouped months together (March, April and May into Spring, and so on) and I did find some interesting interactions:
Heating quality scores the heating system condition from Excellent (Ex) to Poor (Po). Good (Gd) and Fair (Fa) are in-between. There was only one sale with a poor heating system (in winter, by the way!), so I filtered it out. Median sale prices seem to drop with heating system quality, as this is probably an indicator of the overall quality of the structure.
However the pattern is not the same in spring/summer/fall as it is in winter! In summer the spread between the median price of Excellent heating and Fair heating is about 50k. In winter it goes up to about 150k! Unsurprisingly people value heating quality more when it’s cold, and don’t plan long term.