Using Linear Regression to Predict The Best Way To Increase the Price of Your Home

 

Using Linear Regression to Predict The Best Way To Increase the Price of Your Home 

During my time at flatiron, I've performed an analysis with the goal of building a linear regression model that can predict which home improvement projects can increase the price of your home - with the greatest accuracy possible. The results can help homeowners interested in selling their homes decide which are the best improvements to consider for adding value to their home!

In my analysis I used data from a town in Washington called King County, which included information about the size, location, condition, and other features of homes in the area. My model and evaluation are focused on single family homes, and which are some of the many factors that contribute to real estate sales prices.

Modeling

The features of my final model include total square footage of living space, bathroom counts, deck square footage, building grade and township.

After filtering the data to be relevant to first time home buyers (narrowed down by size and price range), I plotted the distribution of sales prices in King County over the last 3 years by checking the homoscedasticity and normality:


And the results:

visual6

The features I chose for my final model were:

  • Total living square footage
  • Bathroom count
  • Enclosed porch square footage
  • Deck square footage
  • Building grades
  • Townships

Methods

Some of the methods used included log transforming the data since it was initially right skewed: 

And then a sqrt transformation (3) to balance and prepare for modeling:

My initial model was  saleprice ~ sqfttotliving and was defined as follows:



Which produced the following results:

And since the p-value for the sqrt_sqfttotliving coefficient is 0, which indicates there is significant evidence to suggest that total living square footage does in fact have an impact on sales price, I am now able to confidently choose other targets based on their coefficiency with 'saleprice'.


I can make the following determinations based on this list:

  • The square footage of decks is slightly correlated
  • The square footage of a basement (finished or not) has some relevance
  • The number of bathrooms has more relevance than the number of bedrooms
  • The township is important (location, location, location!)
  • The quality of the home (the grade) is most important.


Assumptions

My final model produced an R-squared score (4) of 0.685 and while this is somewhat high, it only mildly met the homoscedasticity assumption and did not meet the normality assumption, as can be see from the following visualizations: homo normal

Validation

The final piece in evaluating the quality of the model is cross-validation to provide an idea of how the model would perform with new data for the same variables. Using sklearn's train_test_split function I split the data into two subsets: one that the model was to be trained on, and another that it was tested on. By default, the function takes 75% of the data as the training subset and the other 25% as its test subset.

I created train and test datasets for the x and y variables, used the x subsets to predict new y values, then calculated the distance between these and the actual y-values. Then I used the mean_squared_error function to calculate the MSE (Mean squared error) for both subsets.


The MSE and $R^2$ values for the train and test subsets are similar which suggests that the model will perform similarly on different data.

Finally, after training the model, I've determined the final p-value using het_goldfeldquandt



The model does have some limitations: given that some of the variables needed to be log and sqrt-transformed to satisfy regression assumptions, any new data used with this model would need to undergo similar processing. In addition, considering regional differences in housing prices, this model's applicability to data from other counties may be limited. And lastly, since outliers were removed, the model may not accurately predict extreme values.

Analysis Conclusions

After completing the analysis, I've concluded the following:

  1. Total living square footage is a major factor in the price of a home. The larger the home, the pricier. On this note, if a homeowner wanted to increase the price of their home before selling by increasing the living space, converting the garage to an additional bedroom is a highly profitable choice.
  2. The grade of a house has the greatest effect on its value. A surefire way to instantly add value to a home is to upgrade the construction using the highest quality materials.
  3. Location, location, and again, location!  The location of a home, be it determined by zip code or township, is a great predictor of the price of a home. If a well-built, high-quality home is constructed in a bad area, most likely the homeowner will have a difficult time sell or will have to compromise on its sale price. On the flip side, a run-down home in a highly sought-after area usually doesn't lose too much value since many people are willing to make the purchase with the intent of fixing it up.
  4. The number of bathrooms factors into the high sale price of the house. The more bathrooms, the more valuable.
  5. Decks add to the value of a home. Homeowners appreciate having the ability to enjoy the outdoors while maintaining a slight sense of seclusion, thus making these homes desirable and more expensive. decks

To view the full github repository click here.

References

(1) Homoscedasticity: The residuals have constant variance at every level of x. 

 Normality: The residuals of the model are normally distributed. Source: The Four Assumptions of Linear Regression - Statology

(2) Log Transformation:  Source: How to use Square Root, log, & Box-Cox Transformation in Python (marsja.se)

(3) Sqrt Transformation: Source: Square Root Transformation: A Beginner’s Guide – Quantifying Health

(4) R-squared:  statistical measure that represents the goodness of fit of a regression model. The ideal value for r-square is 1. The closer the value of r-square to 1, the better is the model fitted. Source: ML | R-squared in Regression Analysis - GeeksforGeeks

(5) MSE (Mean Squared Error): Source: sklearn.metrics.mean_squared_error — scikit-learn 0.24.2 documentationMean Squared Error: Definition, Applications and Examples (mygreatlearning.com)

Comments

Popular Posts