# Replace NA with a constant
Simply replace missing values in a Pandas DataFrame with a constant like this:
```python
X_train.fillna(constant, inplace=True)
X_test.fillna(constant, inplace=True)
```
Replacement can work well if the missing values might be meaningful; For example, in a housing dataset, missing m2 for the garage might indicate there is no garage. So setting the missing values to 0 might be the best strategy.
# Drop NA values
Simply drop the rows (or columns):
```python
X_train.dropna(axis='index', subset=['Y'], inplace=True)
X_valid.dropna(axis='index', subset=['Y'], inplace=True)
# drop all rows that are NA in column Y, in-place
```
This works well when there are plenty of missing values and it is impossible to reconstruct the missing values.
# Impute NA values
Instead of filling, various strategies can be used to "backfill" missing values:
```python
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(strategy='mean')
imp_mean.fit(X_train)
imputed_X_train = pd.DataFrame(imp_mean.transform(X_train))
imputed_X_valid = pd.DataFrame(imp_mean.transform(X_valid))
# imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
```
# Record NA values
It can be very useful to let the model know which values where imputed:
```python
X_train["ColnameIsNA"] = X_train["Colname"].isna()
X_valid["ColnameIsNA"] = X_valid["Colname"].isna()
```
# Predict NA values
A more demanding approach might be to predict missing values using a regression model trained on the existing values:
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
column = "Y"
df_train = df[df[column].isnull()==True]
df_predi = df[df[column].isnull()==False]
y_train = df_train[column]
df_train.drop(column, axis='columns', inplace=True)
df_predi.drop(column, axis='columns', inplace=True)
model.fit(df_train, y_train)
df_train[column] = y_train
df_predi[column] = model.predict(df_predi)
df = pd.concat([df_train, df_predi])
```
It is impossible to predict what the best strategy might be. It depends on how many values are missing and whether the missing values are random, biased, or even indicate a special state.
Not found
This page does not exist