EXPLORATORY DATA ANALYSIS (EDA)
Data Description:
The dataset provides enough historical values of the target and explanatory variables for further analysis. Furthermore, the dataset provides an adequate number of observations that uphold the data partitioning, which answers our business questions.
The average sale price is considered the dependent variable in our study. The dataset variables description is provided in the Manufactured Housing Survey Public Use File Documentation for more information. Nevertheless, a snippet providing a description is also attached in the appendix.
Most of the variables in the dataset come in encoded as integers; on further exploratory analysis, the variables are found as supposed to be categorical variables.
The screenshot provides the 2017 dataset variables information; this is just an example providing the same procedures performed on the other years’ data. Among the variables encoded as numerical variables and supposed to be categorical are; status, region, titled, sections, bedrooms, location, foundation, and secured. The motivation for changing the variables’ data types was the information provided in the description documentation of the variables. The integers are coded to represent something; for instance, region has the values 1,2,3,4 and 5 representing Northeast, Midwest, South, West, and United States, respectively; this is the same for other variables listed above. Analyzing these variables as integers would provide false information, especially when working with describe function, as it returns an aggregate of integers. We had to change the variables’ data types to objects to convert them to categorical variables; this would give the correct information about the variables and help build the model.
A further call of the .info() function in the 2017 dataset reveals that the variables data types have been changed to object as intended.
The j-variables, as informed earlier, are just dropped as we work on the dataset. They are just used to provide more information about other variables for reported, imputed, and non-applicable cases.
The dataset columns from 2018 – 2021 are in upper case as opposed to the columns in the 2017 dataset that are lower-case. To merge data from these years to be one dataset, we need to convert the column to follow one convention; we decided to go with lower-case, so the columns of datasets for the following years, 2018 – 2021, were converted to lower-case; this would ease our work during the merging of the datasets.
Missing Data:
If not carefully handled, missing values can reduce the quality of the analysis. Further analysis to find missing values reveals that the datasets have no null values for all the years.
Duplicate data:
Duplicated values also, if not carefully handled, can reduce the quality of the analysis, a further analysis to find duplicates reveals that the datasets have no duplicated values for all the years. Using the. duplicated() function provided by python libraries, we can calculate the number of duplicated rows in the dataset; the function returns 0, indicating our data is free of duplicates. The .nunique() function can also be used to support null values; when we look at the result of the function, the count of unique values for each variable is as expected, confirming no repeated data.
Outliers:
Outliers are observations in the dataset that have abnormal values compared to the values expected in each variable. Like missing and duplicated values, the outliers can diminish the quality of our analysis overall. How we treat outliers also must be professional and thoughtful, guided highly by our dataset domain knowledge, so as not to drop relevant data useful in analysis. We found that the reasons behind most of these “extreme” observations were essential to consider as possible variances in our analysis; therefore, we dropped most of them. Values marked as outliers in our analysis were the values that were imputed or had a value of 9 to represent Nonapplicable cases.
With a closer look at the price and square feet columns, we detect rows having values as 9, which is bizarre compared to the other values in the columns. However, from the description of these variables, 9 represents non-applicable cases or for disclosure purposes. This value needs to be more informative in this variable and will lead to wrong predictions as the price is our target variable. Having a 9 as a value in the squarefeet variable is also not informative. We treat this as an outlier. Another explanation would be for disclosure, as this dataset was about personal information; some data was withheld from the public. Carrying on with rows having this value for price and square feet would lead to wrong predictions, so we dropped them, as shown in the code snippet below.
We find that rows with outliers in the price column were like those with outliers in the square feet columns since when we dropped the outlier rows in the price column, the square feet column was free of outliers. We got rid of 4876 rows in the 2017 dataset.
We continue to check for outliers in other columns. We find the same kind of outliers in the titled, bedrooms, foundation, and location columns. The outliers have common rows for these columns. We drop them all, leading to getting rid of another 1612 rows. We removed the row’s as titled column was supposed to have values of 1,2 and 3, bedrooms were supposed to have values of 1 and 3, foundation was supposed to have 1,2 and 3 and location 1 and 3. The value 9 for these columns represented non-applicable cases or for disclosure purposes. The value 9 for these columns would affect the analysis quality since it is an outlier and varies significantly with the other values. The rows are dropped as shown in the code snippets below.
The secured variable also has 809 outliers that would be dropped as they are not a representative of the values expected within the column. The rows are dropped, as shown below.
In the following years, 2018 – 2021, datasets follow the same data inconsistencies issues in the same columns, so we drop them appropriately. We follow the same procedure to drop the outliers.
Visualizations:
It is said that a picture is worth more than a thousand words. Visualizations help us better explain concepts from a glance. We carried out more exploratory data analysis using visualizations of our variables to understand them better and be informed as we built a model.
We find out that 90% of our data is labeled as Placed/Sold/leased for residential use while the remaining 10% is labeled as Intended for Sale/lease for residential use; this is informed by the status of the house variable, which represents the status of the home four months after shipment. The region variable informs us that most of the dealers are in the south region, West and Midwest regions have almost the same number of dealers. They are few dealers in the Northeast region and the least number of dealers in regions labeled as ‘5 – United States’.
The dataset has most houses labeled as personal property, and a smaller percentage of the houses are also labeled as real estate. 3 represents not titled and reveals only a few houses in the dataset that are not titled. The distribution of the titled houses as viewed through the regions is also the same, with personal properties being the most followed by houses labeled as real estates and lastly, not titled for all the regions.
Most of the home sizes are double sections. Several homes are also single-sectioned, with minimal homes having more than 3 sections. The size of the home’s distribution is the same even when compared to different locations.
Majority of the houses in our dataset have 3 or more bedrooms. Only a few houses have 2 or fewer bedrooms. Most of the house’s foundations are blocks, some houses have masonry/concrete as their foundations, and lastly, some houses have steel/other materials as the foundations. We conclude that houses with foundations made of blocks are preferred.
There is almost an equal number of houses placed Inside manufacturing home communities and Outside manufacturing home communities; however, the number of houses placed outside the manufacturing homes is slightly higher. Most of the houses are secured using tie-down straps and anchors. Very few houses are secured using other methods, as seen from the plots.
We feature-engineered a variable named ship_month from the existing variable, showing only the months irrespective of the year; this would help us visualize the number of shipments expected in each month overall. From October, the number of shipments increased gradually till February the following year. These months are expected to generate the most sales. The months between March to September have few numbers of shipments, and conversely, fewer sales are expected during these months.
Correlation:
Correlation helps us understand the relationship between variables and other variables, but most importantly, the relationship between the variables in our dataset and the target variable; this tool is essential in building models. Supported by visualizations, for example, scatter plots and other charts, as well as statistical tests, we can delude the best variables in predicting our target variable. The correlation plot represents how variables correlate with each other and how each variable is correlated with the target variable.
The following is a list of variables and the correlation with our target variable; the variables are ordered from the highest correlated variables to the lowest correlated variables:
- Square Feet- 0.60
- Sections – 0.59
- Region – 0.31
- Foundation – 0.30
- Control – 0.28
- Weight – (-0.27)
- Bedrooms – 0.17
- Secured – 0.15
- Titled – 0.10
- Status – 0.09
- Wgtadj – (-0.05)
- Location – 0.02
CHAPTER 5: MODELLING
Our data is continuous; we tried fitting the data using logistic regression models, KNN, decision tree classifier, and linear regression. The best model was linear regression, built for modeling continuous data. The KNN, logistic regression and decision tree classifiers are used to model binary and multiclass data, which are discrete and not continuous. These models fail to fit continuous data properly and give low accuracy scores.
The dataset has categorical variables, and the data was first domified to turn categorical variables into numerical values understandable by the machine language. The data was also scaled to follow the same distribution for the whole dataset while still taking care of the original data. The data was split into 75% for training and 25% for testing. The data was then fed into the models for analysis.
We used mean squared error (MSE) as a metric to test the models. It is known that good models have low MSE scores. MSE is calculated by taking each residual’s square and then the mean of squares. The smaller the MSE, the better the model. The model score also gauged the models’ accuracy. The model score(R-Square) is the proportion of total variation explained by the model in predicting the sale price; it is calculated as the sum of squares of the difference between the response and the mean of the response.
KNN & Decision Tree Classifier, Logistic Regression:
Before fitting our data into the KNN, Logistic regression, and Decision tree classifier, we had to encode the target variable to be multiclass so that it can be acceptable by these since they do not model continuous data; this greatly affected the results and the models performed poorly on our test data.
We fitted a decision tree classifier; the model was scored on Mean squared error and the model’s score. The model MSE was very high, and the score was low. Evidently, the decision tree classifier was poorly predicting the target variable.
KNN and Logistic regression models’, MSE, and model’s scores were also very high and low, respectively, as seen in the screenshots below.
When scored, the KNN and Logistic regression models averaged 0, showing how poor they were in predicting the average sale price; this is highly associated with the fact that our target variable is continuous, and changing it to multiclass distorted the prediction.
Linear Regression:
We fitted and scored a linear regression model.
The model explains 100% variability in predicting the Manufactured Houses’ sale price. The model Mean Squared Error is 0.07%, which is very small and approaches zero; this indicates that our model is excellently predicting the target variable. We choose Linear Regression as our best model.
CHAPTER 6: CONCLUSION
The analysis answered the research questions:
- Does region and location influence the average sale price for new manufactured homes placed/sold or intended for sale? The region is highly correlated with the price of our target, indicating that it influences the average sale price of the homes. The visualization also confirmed that some regions are preferred as opposed to others; this would make the average sale price of homes in some regions higher or lower. Location is lowly correlated with our target variable. Visualizations also confirm that the variable does not significantly influence sale price since both locations had almost equal numbers of houses.
- Does the home’s number of bedrooms and size influence the average sale price for newly manufactured homes placed/sold or intended for sale? The number of bedrooms and sections is highly correlated with our target variable. Visualization confirms these variables influence average sale price too; for instance, it is found that homes sized as double are highly preferred as they are the most, and those with more than three sections are not preferred. The same is true for bedrooms as homes with 3 or more bedrooms are the most, with the ones with fewer bedrooms not being sought after.
- What is the estimated total monthly shipment at the national level for homes sold or for sale for residential use and homes sold or for sale for non-residential or other uses? The expected estimate of total shipment each month is approximately 2300 for the months between March and September. The expected shipment for months between October and February varies between 2800 to 4000.
- What factors build the best model/what are the best predictors for the investigation of the average sales price, and do all the variables in the dataset determine the average sales price? When the model was built excluding some variables, the score was not as good as when all the variables were included, leading to the conclusion that each variable plays a crucial role in predicting the target variable. We included each variable in building the final model.
- Does Foundation, Footings, and Piers influence the average sale price for newly manufactured homes placed/sold or intended for sale? Footings are variables included in the 2021 dataset only, so they were not considered in building the final model. A closer inspection of the variables and how they influence the average sale price in the year 2021 showed that they are lowly correlated with the target variable.
- What are the preferred manufactured homes based on the materials used for foundation, footings, and piers? From the analysis, the preferred homes were the ones that had concrete as piers, concrete footings, and blocks as the foundation.
- Is there growth in Manufactured house sale sector? We observed the average sales price over the years from 2017 – 2021. The average sale price is 83102, 88585, 90718, 93834, and 116920, respectively. There is an upward sales trend from the year 2017. This confirmed growth in the manufactured houses industry.
APPENDIX:
REFERENCES:
Ash, K. D., Egnoto, M. J., Strader, S. M., Ashley, W. S., Roueche, D. B., Klockow-McClain,
- E., … & Dickerson, M. (2020). Structural forces: Perception and vulnerability factors for tornado sheltering within mobile and manufactured housing in Alabama and Mississippi. Weather, climate, and society, 12(3), 453–472.
Becher, D. (2020). Manufactured Insecurity: Mobile Home Parks and Americans’ Tenuous
Right to Place.
Bewley, K. (2018) A REVIEW OF OREGON’S MANUFACTURED HOUSING POLICIES.
Dawkins, C. J., & Koebel, C. T. (2009). Overcoming barriers to placing manufactured housing
in metropolitan communities. Journal of the American Planning Association, 76(1), 73–88.
Durst, N. J., & Sullivan, E. (2019). The contribution of manufactured housing to affordable
housing in the United States: Assessing variation among manufactured housing tenures and community types. Housing Policy Debate, 29(6), 880–898.
Genz, R. (2001). Why advocates need to rethink manufactured housing. Housing Policy Debate,
12(2), 393–414.
Kaul, K., & Pang, D. (2022). The Role of Manufactured Housing in Increasing the Supply of
Affordable Housing. The Urban Institute, July 2022-07.
Mandelker, D. R. (2016). Zoning barriers to manufactured housing. Urb. Law., 48, 233.
Mazelis, J. M. (2020). Book Review: Manufactured Insecurity: Mobile Home Parks and
Americans’ Tenuous Right to Place.
More, C. F. I. T. (2018). Challenges to Obtaining Manufactured Home Financing.
Ouoba, S. T. A. (2021). Fragility Functions of Manufactured Houses under Earthquake Loads.
Pierce, G., Gabbe, C. J., & Gonzalez, S. R. (2018). Improperly zoned, spatially marginalized,
and poorly served? An analysis of mobile home parks in Los Angeles County. Land Use Policy, 76, 178-185.
Razkenari, M. A., Fenner, A. E., Hakim, H., &Kibert, C. J. (2018). Training for Manufactured
Construction (TRAMCON)–Benefits and Challenges for Workforce Development at Manufactured Housing Industry. Modular and Offsite Construction (MOC) Summit Proceedings.
Senghore, O., Hastak, M., Abdelhamid, T. S., AbuHammad, A., &Syal, M. G. (2004).
The production process for manufactured housing. Journal of construction engineering and management, 130(5), 708-718.
Strader, S. M., Roueche, D. B., & Davis, B. M. (2021). Unpacking Tornado Disasters:
Illustrating Southeastern US Tornado Mobile and Manufactured Housing Problem Using
March 3, 2019, Beauregard-Smith Station, Alabama, Tornado Event. Natural Hazards Review, 22(1), 04020060.
Sullivan, E. (2017). Displaced in place: Manufactured housing, mass eviction, and the paradox
of state intervention. American Sociological Review, 82(2), 243-269.
Sullivan, E. (2022). Personal, not Real: Manufactured Housing Insecurity, Real Property, and
the Law. Annual Review of Law and Social Science, 18.
US Census Bureau. (2022, July 6th). Manufactured Housing Survey Public Use
File.Census.gov.RetrievedSeptember27th,2022,fromhttps://www.census.gov/data/datasets/2021/econ/MHS/pufihtml