King County Home Price Analysis
I've recently completed an analysis with a goal of building a linear regression model that can predict which home improvement projects positively affect housing prices with the greatest accuracy possible. The results of my analysis can inform home owners interested in selling their homes about which are the most profitable improvements to consider for adding value to their home, and thus increasing sale prices.
Business Understanding
In my project I used the King County Housing Data Set which contains information about the size, location, condition, and other features of houses in King County. The model and evaluation is focused on single family homes and what are some of the many factors that contribute to real estate sales prices.
Data Preparation
I've merged and cleaned the provided datasets based on the variables I've concluded would be good features in the model. I've dropped a number of columns: some contained outliers, others offered no value to the data (such as an 'Address' column which contained the house number and street name only.) I've filtered the data to reflect only those residential properties under 4500 sqft. that were sold from 2018 till 2020. I did so to narrow down to single family homes only.
Modeling
The features of my final model include total square footage of living space, bathroom counts, deck square footage, building grade and township.
Summary of Model:
Target: Sales Price
After filtering the data to be relevant to first time home buyers(narrowed down size and price range), I visualized the distribution of sales prices in King County over the last 3 years by checking the homoscedasticity and normality (1):
And the results:
The features I chose for my final model were:
- Total living square footage
- Bathroom count
- Enclosed porch square footage
- Deck square footage
- Building grades
- Townships
Methods
Modeling
I can make the following determinations based on this list:
- The square footage of decks is slightly correlated
- The square footage of a basement (finished or not) has some relevance
- The number of bathrooms has more relevance than the number of bedrooms
- The township is important (location, location, location!)
- The quality of the home (the grade) is most important.
Assumptions
My final model produced an R-squared score (4) of 0.685 and while this is somewhat high, it only mildly met the homoscedasticity assumption and did not meet the normality assumption, as can be see from the following visualizations:
Validation
The final piece in evaluating the quality of the model is cross-validation to provide an idea of how the model would perform with new data for the same variables. Using sklearn's train_test_split function I'll split the data into two subsets: one that the model will be trained on, and another that it will be tested on. By default, the function takes 75% of the data as the training subset and the other 25% as its test subset.
I created train and test data for the x and y variables, use the x subsets to predict new y values, then calculate the distance between these and the actual y-values. Lastly, I'll use the mean_squared_error function to calculate the MSE (Mean squared error) (5) for both subsets.
Finally, after training the model, I've determined the final p-value using het_goldfeldquandt (6)
Evaluation
The model does have some limitations: given that some of the variables needed to be log and sqrt-transformed to satisfy regression assumptions, any new data used with this model would need to undergo similar processing. In addition, considering regional differences in housing prices, this model's applicability to data from other counties may be limited. And lastly, since outliers were removed, the model may not accurately predict extreme values.
Analysis Conclusions and Recommendations:
Conclusions
- Total living square footage is a major factor in the price of a home. The larger the home, the pricier. On this note, should a homeowner wish to increase the price of their home before selling by adding to it's living space, converting the garage to an additional bedroom is a highly profitable choice.
- The grade of a house has the greatest effect on it's value. A surefire way to instantly add value to a home is to upgrade the construction using the highest quality materials.
- Location, location, and again, location. The location of a home, be it determined by zip code or township, is a great predictor of the price of a home. If a well built, high quality home is constructed in a bad area, most likely the home owner will have a difficult time sell or will have to compromise on it's sale price. On the flip side, a run down home in a highly sought after area usually doesn't lose too much value since many people are willing to make the purchase with the intent of fixing it up.
- The number of bathrooms factors into the high sale price of the house. The more bathrooms, the more valuable.
- Decks add to the value of a home. Home owners appreciate having the ability to enjoy the outdoors while maintaining a slight sense of seclusion, thus making these homes desirable and more expensive.
Recommendations
New home buyers should consider the location of the home first, then size (square footage), and then grade to when deciding on a purchase budget. I could also recommend that first time home buyers consider purchasing smaller than their desired size or less than their desired quality and then later upgrading and expanding the home's living space to increase value.
For DIY and fix-it-up home owners who are looking to get the most value out of one of the largest investments they've made, my recommendations are to:
- Add an extra bedroom (perhaps by converting the garage),
- Add an additional bathroom,
- And to add or update outdoor living space.
To view the full github repository click here.
References
(1) Homoscedasticity: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed. Source: The Four Assumptions of Linear Regression - Statology
(2) Log Transformation: Source: How to use Square Root, log, & Box-Cox Transformation in Python (marsja.se)
(3) Sqrt Transformation: Source: Square Root Transformation: A Beginner’s Guide – Quantifying Health
(4) R-squared: statistical measure that represents the goodness of fit of a regression model. The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model fitted. Source: ML | R-squared in Regression Analysis - GeeksforGeeks
(5) MSE (Mean Squared Error): Source: sklearn.metrics.mean_squared_error — scikit-learn 0.24.2 documentation, Mean Squared Error: Definition, Applications and Examples (mygreatlearning.com)
Comments
Post a Comment