5. Explore the data

# 5. Explore the data **Pro-Tip**: Try to get insights from a domain expert on these steps. ## Steps 1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary). 2. Create a Jupyter notebook to keep a record of your data exploration. 3. Identify the response variable(s) for supervised learning tasks. 4. Study each variable and its characteristics: - Name - Type (categorical, int/float, bounded/unbounded, text, structured, etc.) - % of missing values - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) - Usefulness for the task - Type of distribution (Gaussian, uniform, logarithmic, etc.) 5. Visualize the data (boxplots, histograms, etc.). 6. Study the correlations between explanatory variables as well as between explanatory and response variables (via coefficients, scatter plot [matrix], variance inflation factors, etc.) 7. Consider how you would solve the problem manually. 8. Identify promising transformations you may want to apply (min-max, standard scaler, log-transform, etc.). 9. Identify extra data that would be useful (and go back to [[4. Get the data]]). 10. Review that you didn't miss anything by going over the [Exploratory Data Analysis (EDA) checklist](Exploratory%20Data%20Analysis%20(EDA)%20checklist.md). 11. Document what you have learned. ## Pitfalls - Lack of scripts for the exploration. Next up is [6. Prepare the data](6.%20Prepare%20the%20data.md)