6. Prepare the data

# 6. Prepare the data ## Preamble - This stage applies to statistical and machine learning projects only. - Work on copies of the data (keep the original dataset intact, do not overwrite it). - Expect to iteratively repeat this step with the next. - Write **reusable functions** for all data transformations you apply, for five reasons: - So you can easily prepare the data the next time you get a fresh dataset. - So you can apply these transformations in future projects. - To clean and prepare the test set the same way as the training set. - To clean and prepare new data instances once your solution is live. - To make it easy to treat your preparation techniques as hyperparameters that can be tuned. ## Steps 1. Clean the data: - Fix or remove outliers (optional, and only if reasonable). - Fill in missing values (e.g., with zero, mean, median…) or drop the observations (sometimes, even the whole variable). 2. Perform feature selection (optional): - Drop the variables that provide no useful information for the task. 3. Perform feature engineering, where appropriate: - Discretize continuous features. - Decompose features (e.g., categorical, date/time, etc.). - Add promising transformations of features (e.g., log(_x_), sqrt(_x_), _x_2, etc.). - Aggregate features into promising new features, but beware of [aggregation leakage](Aggregation%20leakage.md). 4. Perform feature scaling: - Standardize or normalize features. ## Pitfalls - Lack of reproducibility when fetching more data. - Leaking information to the test set. Next up is [7. Explore different models](7.%20Explore%20different%20models.md)