4. Get the data

# 4. Get the data ## Steps 1. Document where you can get that data. 1. Check how much space it will take. 1. Review cost implications. 2. Check legal obligations, and get authorization if necessary. 3. Get access authorizations. 4. Create a workspace (with enough storage space). 5. Get the data. 6. **Put it under version control.** 7. Convert the data to a format you can easily manipulate (without changing or studying the data itself). 8. Ensure sensitive information is deleted or protected (e.g., PII that needs to be anonymized). But beware of the [privacy masking divergence](Privacy%20masking%20divergence.md) anti-pattern. 9. Validate the size and type of data (time series, sample, geographical, etc.) is as expected. 10. For machine learning tasks: Sample the test set ("out-of-sample" data) and put it away. - Ensure the split happens in a fully randomized but **reproducible** way (e.g., filter by the modulo of the hash of each observation’s unique ID - Never look at it (no data snooping!) before your fully tuned your model. - For temporal data, ensure you have no [temporal leakage](Temporal%20leakage.md) and consider the [latency effect of time-series data](Timeseries%20latency.md). - Make this split before any [oversampling](https://imbalanced-learn.org/stable/over_sampling.html) to avoid [oversampling leakage](Oversampling%20leakage.md). ## Pitfalls - Lack of reproducibility when fetching more data. - Leaking information to the test set. Next up is [5. Explore the data](5.%20Explore%20the%20data.md)