Stable splits of datasets with hashes

Sometimes, you want to be able to reproducibly split datasets on an identifier or a set of columns that can act as an identifier. That allows you to put the same instances into the same split groups even in the future when the set of instances might grow. The downside is that you cannot make stratified splits, so that is only useful to remove a fixed set of instances before you even have a real look at the data. ```python import numpy as np from zlib import crc32 def extract_id(row): # TODO: define your ID extraction method return row['varA'] * row['varB'] def is_in_test_set(row, test_ratio, extract_id=extract_id): bytelike = np.int64(extract_id(row)) return crc32(bytelike) & 0xffffffff < test_ratio * 2**32 def train_test_split_on_id(df, test_ratio, extract_id=extract_id): in_test_set = df.apply( lambda row: is_in_test_set(row, test_ratio, extract_id), axis=1 ) return df.loc[~in_test_set], df.loc[in_test_set] ``` When working in a Jupyter notebook, you can append the following function to test the ID extraction, if involved: ```python df.apply( lambda row: extract_id(row), axis=1 ).astype(np.int64).value_counts() ``` You would like to see as many distinct IDs as possible - ideally every ID is unique, otherwise there could be some bias in the split based on how you generated the ID.