Exploratory Data Analysis (EDA) checklist

# Exploratory Data Analysis (EDA) checklist **If you have not plotted your data, you are not done with EDA.** ### ==Data Inspection 101== 1. Have you fixed all the problems that emerged during the loading of the data? 2. Are you sure that no serious errors were made at the data collection stage (for example, leading to data loss)? 3. If there are measurement errors, are they small enough to be neglected? 4. Does the head and tail of the dataset look as expected? 5. Are the data types correct? If using Pandas DataFrames, check `info()`. 6. Can you drop some columns simply based on common sense? 7. Are there duplicate observations in your dataset? Can you explain them? 8. Does the min and max value of any variable reveal technical issues with the data? If using Pandas DataFrames, check `describe()`. 9. If there are multiple, redundant columns/sources of a data point, have you ensured that you picked the right column/source? ### Domain and Time 1. Can you embed indicators for event types and/or special domain events in your observations? 2. Should you extract timing data from any available date/time metadata? 3. ==**For date/time columns, have you plotted them to see if there are unexpected clusters**== of data around specific dates (or ranges, such as months with higher/lower-than-expected numbers of instances) ==**or outliers**== (datapoints that are much earlier/later in time than expected)? - You might need to resample/group your timestamps into hours, days, months, etc., depending on the meaning of the column. E.g., check your 1-year sample isn’t all clustered in March without reason. ### Missing Values 1. ==**How are the missing values encoded in your data?**== (0, `null`, `NA`, some string, etc.) 2. Have you found any missing values? 3. If yes, how do you handle them? Do you drop missing values (`dropna()`), perform a generic imputation and replace missing values with that (`fillna()`), or predict missing values for each instance (`sklearn.impute`)? [Handling missing values in Pandas](Handling%20missing%20values%20in%20Pandas.md) explains how to do this in Python. ### Dataset Structure 1. Are the features in columns and observations in rows? Does each feature have its dedicated column? 1. Do you have enough observations for the training and evaluation given the candidate models? 1. If not, is it reasonable to use data augmentation or do you need to collect more data points? ### String Handling 1. Do the first and last few strings, if ordered alphabetically or by length, look good? 1. Are there typos or unwanted variation in any string variables? 1. Are any string values convertible to categorical factor levels? 1. Are there any text strings that are more frequent than expected? 1. If you (still) have plain-text variables: Will you extract keywords, calculate an embedding, or standardize the texts with a language model? ### Distribution Analysis 1. ==**Have you checked all variables for the presence of outliers?**== Are the outliers reasonable and in a plausible range? If not, can you trust them, validate their correctness, or do you need to drop them? 2. Have you checked the data for constant features or near-zero variance features? Can you drop them? 3. ==**Have you plotted all continuous and categorical variables, including the response?**== Review [this Seaborne tutorial](https://seaborn.pydata.org/tutorial/distributions.html) for inspiration on plotting univariate distributions (consider combining a histogram, a [strip plot](https://seaborn.pydata.org/generated/seaborn.stripplot.html), and a box plot side-by-side to fully explain a distribution). 4. Are the continuous variables approximately normally distributed? If not, how do you handle that distribution? 5. If you plan on doing classification, is there any class imbalance, and how do you plan to handle that? 6. Is it necessary to scale any explanatory or response variables (for use in neural nets, clustering, etc.)? If yes, what scaler will you use (log, standard, min/max, etc.)? 7. Have you reviewed the absolute and relative value counts in categorical predictors? Should some sparse categories be united? ### Correlations 1. Are the data points mutually independent and identically distributed (i.i.d.)? 1. Have you looked at contingency tables of any two pairs of categorical variables? 1. Have you looked at the [correlation (scatter) matrix](Spot%20correlations%20with%20a%20Pandas%20Scatter%20Matrix.md) of the explanatory variables? 1. Have you plotted scatter plots of explanatory with the *response* variables? Review the [Seaborn distribution visualization tutorial for bivariate distributions](https://seaborn.pydata.org/tutorial/distributions.html#visualizing-bivariate-distributions) for inspiration. 1. Have you [calculated the correlation](Measuring%20Correlations.md) of any visually correlated explanatory variables with the response(s)? 1. Can you remove redundant explanatory variables with high correlation? 1. Is it reasonable to combine existing features? (km/h, rooms/square_meters, area, etc.)