## Preamble - This stage applies to statistical and machine learning projects only. - If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or random forests). - Once again, try to automate these steps as much as possible. ## Steps 1. Train many quick-and-dirty models from different categories (e.g., linear, naive Bayes, SVM, random forest, neural net, etc.) using standard parameters. 2. **Put the trained models under version control.** 3. Document the following experiments and what you expect as outcomes. 4. Measure and compare their performance: - For each model, use _N_-fold cross-validation and compute the mean and standard deviation of the performance measure on the _N_ folds. 5. Analyze the most significant variables for each algorithm (ablation studies). 6. Analyze the types of errors the models make: - What data would a human have used to avoid these errors? 7. Perform a quick round of feature selection and engineering. 8. Document the actual outcomes and what learnings you made from that. 9. Perform one or two more quick iterations of the five previous steps. 10. Shortlist the top three to five most promising models, preferring models that make different types of errors. ## Pitfalls - Inefficient setups that prevent quick iteration. - Inefficient tooling and infrastructure. - Lack of model versioning. - No documentation of the model exploration. Next up is [8. Tune your models](8.%20Tune%20your%20models.md)