- Learn more about the problem. Search for similar Kaggle competitions. Check the task in Papers with Code.
- Do a basic data exploration. Try to understand the problem and gather a sense of what can be important.
- Get baseline model working.
- Design an evaluation method as close as the final evaluation. Plot local evaluation metrics against the public ones (correlation) to validate how well your validation strategy works.
- Try different approaches for preprocessing (encodings, Deep Feature Synthesis, lags, aggregations, imputers, ...). If you're working as a group, split preprocessing feature generation between files.
- Plot learning curves (sklearn or external tools) to avoid overfitting.
- Plot real and predicted target distribution to see how well your model understand the underlying distribution. Apply any postprocessing that might fix small things.
- Tune hyper-parameters once you've settled on an specific approach ([hyperopt](target distribution), optuna).
- Plot and visualize the predictions (target vs predicted errors, histograms, random prediction, ...) to make sure they're doing as expected. Explain the predictions with SHAP.
- Think about what postprocessing heuristics can be done to improve or correct predictions.
- Stack classifiers (example).
- Try AutoML models.
- Tabular: AutoGluon, AutoSklearn, Google AI Platform, PyCaret, Fast.ai.
- Time Series: AtsPy, DeepAR, Nixtla's NBEATS, AutoTS.
- Feature Engineering Library.
- Feature Engineering Ideas.
- Deep Feature Synthesis. Simple tutorial.
- Modern Feature Engineering Ideas (code).
- Target Encoding (with cross-validation to avoid leakage).
- Forward Feature Selection.
- Hillclimbing.
- Quick Tutorials
- Tsfresh
- Fold
- Neural Prophet or TimesFM
- Darts
- Functime
- Pytimetk
- Sktime / Aeon
- Awesome Collection
- Video with great ideas
- Tutorial Kaggle Notebook
- Think about adding external datasets like related Google Trends search, PiPy Packages downloads, Statista, weather, ...
- TabPFN for time series
- Kaggle
- MLContest. They also share a "State of Competitive Machine Learning" report every year (2023) and summaries on the state of the art for "Tabular Data".
- Humyn
- DrivenData
- Xeek
- Cryptopond