Integration and Data Cleaning
Things to Address Before Integration
Data Source 1
-
Price format
-
Date format
Data Source 2
Integrated Data
-
Address format
-
Few rows with insufficient location data
-
Want target schema to be fit for machine learning
Field Mapping
PETL Library
We decided to use the PETL Python library to perform a lazy evaluation style of ETL processes and pipelines. Our main focus was on the transformation step. We immediately realized that it would be more efficient to transform data source 2 to look like data source 1. Once the data was cleaned and transformed, we were able to concatenate the two into one integrated data set.
Integration Results
We were able to integrate the data in order to proceed to the machine learning portion of the project. The goal was to do most of the cleaning while prepping for integration, but some additional processing may be needed depending on the model's behaviors.