top of page

Integration and Data Cleaning

Things to Address Before Integration

Data Source 1
  • Price format

  • Date format

Data Source 2
Integrated Data
  • Address format

  • Few rows with insufficient location data

  • Want target schema to be fit for machine learning

Things to Address

Field Mapping

Mapping.JPG
Field Mapping

PETL Library

We decided to use the PETL Python library to perform a lazy evaluation style of ETL processes and pipelines. Our main focus was on the transformation step. We immediately realized that it would be more efficient to transform data source 2 to look like data source 1. Once the data was cleaned and transformed, we were able to concatenate the two into one integrated data set.

PETL Library

Integration Results

We were able to integrate the data in order to proceed to the machine learning portion of the project. The goal was to do most of the cleaning while prepping for integration, but some additional processing may be needed depending on the model's behaviors.

Integration Results
bottom of page