Using Linear Regression to Predict The Best Way To Increase the Price of Your Home
In my analysis I used data from a town in Washington called King County, which included information about the size, location, condition, and other features of homes in the area. My model and evaluation are focused on single family homes, and which are some of the many factors that contribute to real estate sales prices.
Modeling
The features of my final model include total square footage of living space, bathroom counts, deck square footage, building grade and township.
After filtering the data to be relevant to first time home buyers (narrowed down by size and price range), I plotted the distribution of sales prices in King County over the last 3 years by checking the homoscedasticity and normality:
And the results:
The features I chose for my final model were:
- Total living square footage
- Bathroom count
- Enclosed porch square footage
- Deck square footage
- Building grades
- Townships
Methods
I can make the following determinations based on this list:
- The square footage of decks is slightly correlated
- The square footage of a basement (finished or not) has some relevance
- The number of bathrooms has more relevance than the number of bedrooms
- The township is important (location, location, location!)
- The quality of the home (the grade) is most important.
Assumptions
My final model produced an R-squared score (4) of 0.685 and while this is somewhat high, it only mildly met the homoscedasticity assumption and did not meet the normality assumption, as can be see from the following visualizations:

Validation
The final piece in evaluating the quality of the model is cross-validation to provide an idea of how the model would perform with new data for the same variables. Using sklearn's train_test_split function I split the data into two subsets: one that the model was to be trained on, and another that it was tested on. By default, the function takes 75% of the data as the training subset and the other 25% as its test subset.
I created train and test datasets for the x and y variables, used the x subsets to predict new y values, then calculated the distance between these and the actual y-values. Then I used the mean_squared_error function to calculate the MSE (Mean squared error) for both subsets.
Finally, after training the model, I've determined the final p-value using het_goldfeldquandt
The model does have some limitations: given that some of the variables needed to be log and sqrt-transformed to satisfy regression assumptions, any new data used with this model would need to undergo similar processing. In addition, considering regional differences in housing prices, this model's applicability to data from other counties may be limited. And lastly, since outliers were removed, the model may not accurately predict extreme values.
Analysis Conclusions
- Total living square footage is a major factor in the price of a home. The larger the home, the pricier. On this note, if a homeowner wanted to increase the price of their home before selling by increasing the living space, converting the garage to an additional bedroom is a highly profitable choice.
- The grade of a house has the greatest effect on its value. A surefire way to instantly add value to a home is to upgrade the construction using the highest quality materials.
- Location, location, and again, location! The location of a home, be it determined by zip code or township, is a great predictor of the price of a home. If a well-built, high-quality home is constructed in a bad area, most likely the homeowner will have a difficult time sell or will have to compromise on its sale price. On the flip side, a run-down home in a highly sought-after area usually doesn't lose too much value since many people are willing to make the purchase with the intent of fixing it up.
- The number of bathrooms factors into the high sale price of the house. The more bathrooms, the more valuable.
- Decks add to the value of a home. Homeowners appreciate having the ability to enjoy the outdoors while maintaining a slight sense of seclusion, thus making these homes desirable and more expensive.

To view the full github repository click here.
References
(1) Homoscedasticity: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed. Source: The Four Assumptions of Linear Regression - Statology
(2) Log Transformation: Source: How to use Square Root, log, & Box-Cox Transformation in Python (marsja.se)
(3) Sqrt Transformation: Source: Square Root Transformation: A Beginner’s Guide – Quantifying Health
(4) R-squared: statistical measure that represents the goodness of fit of a regression model. The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model fitted. Source: ML | R-squared in Regression Analysis - GeeksforGeeks
(5) MSE (Mean Squared Error): Source: sklearn.metrics.mean_squared_error — scikit-learn 0.24.2 documentation, Mean Squared Error: Definition, Applications and Examples (mygreatlearning.com)










Comments
Post a Comment