Does Weather in Indiana Impact
If & When a Home Is Sold?
A team project that explores the changes in Hoosier weather and its impact on the residential real estate market
Aims
- AIM 1: Evaluate existing data from the Indiana Property Disclosure datasets
- AIM 2: Analyze data for weather, home sales, and Census data to find correlations within each dataset
- AIM 3: Design tables and utilize Open Weather Map (API), Kaggle (CSV), U.S. Census (API)
- AIM 4: Utilize Analysis tools such as: Json | Python | CSV | pgAdmin 4
- AIM 5: Provide data on if weather impacts the home sales in Indiana
Impact and Intended Results
The Hoosier State enjoys all four seasons of the year. No
matter what time of year it is, you’ll find something fun to do outdoors
thanks to our Midwestern location. In recent years, Indiana has been listed as one of the top
growth states in terms of USA population. Since Indiana experiences all four
seasons, the question to explore is how weather impacts homesales.
Our goal is to provide the impact of home sales in Indiana based on weather
utilizing the CSV from Kaggle.
We will look at dates, temperature changes, conditions, buyer location per county,
sale prices, population count, household income, and median home value.
Our goal is to provide the impact of home sales in Indiana based on weather.
We will look at dates, temperature changes, conditions, buyer location per county,
sale prices, population count, household income and median home value.
Average Temperatures
As you might guess, January is typically the coldest month and July the hottest.
Here are average temperatures for each month.
Monthly Precipitation
While kids do get to enjoy snow days in Indiana, the summer months bring the most
precipitation. Check out the monthly averages.
The Plan
Steps to Extract and Transform data from the U.S. Census
The extracted housing related data source was the U.S. Census and was acquired
using an API. The specific data set selected was from the 2018 American Community
Survey 5-Year Survey (U.S. Census ACS) and the data points we selected came from the
following table: U.S. Census Variables
The five-year ACS was selected since it has the most complete housing data. From
the table containing approximately 20,000 variables, we selected:
- Median Home Value
- Household Income
- Population
- Median Age
- Per Capita Income
- State
- County
- Transforming the U.S. Census housing related data
- Fields were converted to integer using .astype.
- The first dataframe, us_census_df, called pulled data from all 50 states. We created a second dataframe, IN_census, and verified the data contained 92 rows - one for each of Indiana's 92 counties.
- The county and state fields were pulled as FIPS codes and the "us" dependency, the FIPS codes were converted to state and county names.
- Columns were re-sorted for a more natural looking arrangement.
- The final dataframe, IN_census_3, was also set to a csv file called IN_census_file.csv.
Steps to Extract and Transform the Home Price Data:
Extract
- The first step was to read in the original CSV file. This data set came from Kaggle and was s titled: Indiana - Property Sales Disclosure. This file is 2GB.
- The file was downloaded directly from the website.
Transform
- This dataset contains 87 columns, including data on the buyer, seller, sales price, lot description, tax information, and sales date.
- The following columns were selected:
- Parcel1_Acreage
- PropStreet
- PropCity
- PropState
- PropZip
- C6_Sales_price_Assessor
- Buyer1Street
- Buyer1City
- Buyer1State
- Buyer1ZIP
- Conveyance_Date
- A .dropna() was performed on the data to drop any rows that contain an NaN.
- There were many inconsistencies in the Property City column, including misspellings and differences in the cases of the letters.
- We obtained weather data for 107 Indiana cities. Only data for those cities was included in the final home price dataset. To do this, the city list was copied over from the weather Jupyter notebook. Then the cities needed to be converted to upper case letters to match up with the city entries in the home price dataset. A for loop was setup to include only those 107 cities.
- Columns were renamed to be more descriptive. New column names: - Acreage - Property Street - Property City - Property State - Property Zip - Sale Price - Buyer Street - Buyer City - Buyer State - Buyer Zip - Conveyance Date
- We discovered 1,934 duplicates on home_prices.csv file.
- Duplicate data was dropped using: .drop_duplicates().
- Finally, the cleaned file was saved as a csv.
Steps to Extract and Transform the Weather Data:
Extract
- Our original plan was to get data for the specific location and date of each home sale using an API call from Open Weather Map. Historical data was unavailable without a paid subscription. This was true of many other weather websites as well.
- Weather data was scraped from the following website: U.S. Climate Data
- ChromeDriver was used to collect html data from the website and BeautifulSoup was used to extract the parts of the data that we needed.
- The first piece of information needed was the URL endpoint for each city to obtain weather data from the tables. This was done by scraping the endpoints.
- An empty dataframe titled average_monthly was created to hold the scraped data. A for loop was created to cycle through the URLs and the pandas function .read_html() was used to collect the data from the tables on the page. A column was added to include the name of the city. This was done by splitting the URL string using the pandas function .split(“/”), and then indexing the component containing the name of the city.
Transform
- The columns were transposed and then the files merged back together so the averaged weather data became columns.
- A County column was added so that the census data and the weather table were joined. The home price data was joined via the city name.
- Please note that in this dataframe some “NaN” values were intentionally left in. Data is present for all cities for the average high and low temperatures, but snowfall data is only available for 64 of the 107 cities.
Notes
- The final csv file for the home price data: home_price_final.csv
- The final csv file for the weather data: average_indiana_weather.csv
- All work to extract and transform the home price data is in: Home_Price_Data.ipynb
- All work to extract and transform the weather data is in: weather_data.ipynb
Load
All three CSV files were read and loaded into a Dataframe
- Source for home_price_final.csv: Kaggle: Indiana - Property Sales Disclosure
- Source for average_indiana_weather.csv: U.S. Climate Data
- Source for IN_census_file.csv: U.S. Census Data
For ERD
- CSV files had some random empty columns and gave error “Column violates ‘Not Null’ constraint” while trying to create ERD diagram and create Tables.
- CSV files after filling empty columns to replace with new address where address was null, etc.
- The home_price_final.csv was renamed to home_price_final_ERD.csv
- The average_indiana_weather.csv was renamed to average_indiana_weather_ERD.csv
- The IN_census_file.csv was renamed to IN_census_file_ERD.csv
ERD Notes
- ERD diagram is saved as a PNG file QuickDBD-export.png
- DB Table creation script is stored in HomeSales_DB.sql
For IPYNB
- Column names were made pretty by removing the space and brackets by using .str.replace(' ', '_') ('(', '') (')', '') function
- Columns with NaN were replaced with key value pair
Loading Data Into the Database
- Created connection to PostgreSQL
- Final DB named HomeSales_DB is loaded to PostgreSQL
- All three DFs were converted to SQL
- Finally Queried to check if the DB had all the tables
- Final Extract Transform Load work shown in: Final_ETL.ipynb
Potential Future Questions to Research:
- Does Indiana monthly temperature impact monthly growth?
- Where in the Unites States are most people moving to Indiana from?
- What counties in Indiana experience the most growth per year?
- Which season of the year provides a decline in yearly population growth?
Tools Used
- pgAdmin4
- PostgreSQL
- Python
- Beautiful Soup
- ChromeDriver
- Multiple APIs
- Plot.ly
- ERD Diagram
- Jupyter Notebook/Lab
- JSON
- CSV
- GitHub & GitHub Pages