Handling missing values in Pandas

# Replace NA with a constant Simply replace missing values in a Pandas DataFrame with a constant like this: ```python X_train.fillna(constant, inplace=True) X_test.fillna(constant, inplace=True) ``` Replacement can work well if the missing values might be meaningful; For example, in a housing dataset, missing m2 for the garage might indicate there is no garage. So setting the missing values to 0 might be the best strategy. # Drop NA values Simply drop the rows (or columns): ```python X_train.dropna(axis='index', subset=['Y'], inplace=True) X_valid.dropna(axis='index', subset=['Y'], inplace=True) # drop all rows that are NA in column Y, in-place ``` This works well when there are plenty of missing values and it is impossible to reconstruct the missing values. # Impute NA values Instead of filling, various strategies can be used to "backfill" missing values: ```python from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(strategy='mean') imp_mean.fit(X_train) imputed_X_train = pd.DataFrame(imp_mean.transform(X_train)) imputed_X_valid = pd.DataFrame(imp_mean.transform(X_valid)) # imputation removed column names; put them back imputed_X_train.columns = X_train.columns imputed_X_valid.columns = X_valid.columns ``` # Record NA values It can be very useful to let the model know which values where imputed: ```python X_train["ColnameIsNA"] = X_train["Colname"].isna() X_valid["ColnameIsNA"] = X_valid["Colname"].isna() ``` # Predict NA values A more demanding approach might be to predict missing values using a regression model trained on the existing values: ```python from sklearn.linear_model import LinearRegression model = LinearRegression() column = "Y" df_train = df[df[column].isnull()==True] df_predi = df[df[column].isnull()==False] y_train = df_train[column] df_train.drop(column, axis='columns', inplace=True) df_predi.drop(column, axis='columns', inplace=True) model.fit(df_train, y_train) df_train[column] = y_train df_predi[column] = model.predict(df_predi) df = pd.concat([df_train, df_predi]) ``` It is impossible to predict what the best strategy might be. It depends on how many values are missing and whether the missing values are random, biased, or even indicate a special state.