## Preamble
- This stage applies to statistical and machine learning projects only.
- Work on copies of the data (keep the original dataset intact, do not overwrite it).
- Expect to iteratively repeat this step with the next.
- Write **reusable functions** for all data transformations you apply, for five reasons:
- So you can easily prepare the data the next time you get a fresh dataset.
- So you can apply these transformations in future projects.
- To clean and prepare the test set the same way as the training set.
- To clean and prepare new data instances once your solution is live.
- To make it easy to treat your preparation techniques as hyperparameters that can be tuned.
## Steps
1. Clean the data:
- Fix or remove outliers (optional, and only if reasonable).
- Fill in missing values (e.g., with zero, mean, median…) or drop the observations (sometimes, even the whole variable).
2. Perform feature selection (optional):
- Drop the variables that provide no useful information for the task.
3. Perform feature engineering, where appropriate:
- Discretize continuous features.
- Decompose features (e.g., categorical, date/time, etc.).
- Add promising transformations of features (e.g., log(_x_), sqrt(_x_), _x_2, etc.).
- Aggregate features into promising new features, but beware of [aggregation leakage](Aggregation%20leakage.md).
4. Perform feature scaling:
- Standardize or normalize features.
## Pitfalls
- Lack of reproducibility when fetching more data.
- Leaking information to the test set.
Next up is [7. Explore different models](7.%20Explore%20different%20models.md)