Tabular
Advice for Tabular data
Neural Network
Take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.
Conclusion
We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.
From that foundation, you can try Gradient Boosting and Neural Nets, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.
Neural Nets
TO-DO Read
- Tweet
- Papers with Code Newsletter #12: Deep Learning for Tabular Data (1 July 2021)
- Arxiv paper: Deep Neural Networks and Tabular Data: A Survey (5 Oct 2021)
- Factorization Machines
- Read notes on matrix factorization machines
- Code: LibFM in Keras
- TF implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM.
- DeepFM (Mar 2017)
- xDeepFM (Mar 2018)
- Neural nets for Airbnb search (Oct 2018)
- TabNet: Attentive Interpretable Tabular Learning (Aug 2019)
- NODE: Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data (Sep 2019)
- Graph NNs: DL on Relational DBs with Graph NNs (Feb 2020)
- GrowNet: Gradient Boosting Neural Networks (Feb 2020)
- Shallow NNs as “weak learners” in gradient boosting framework
- Incorporates 2nd order stats, corrective step & dynamic boost rate to remedy pitfalls of gradient boosting tree
- Outperforms XGBoost
- TabTransformer: Tabular Data Modeling Using Contextual Embeddings (Dec 2020)
- Idea mia: Residual conecctions on MLP ???
- Frameworks
- DeepTables
- Pytorch tabular
- Microsoft hummingbird: Convert trees to NNs
Datasets
- UCI
- Adult Binary Classification
- Glass Identification Multi-class Classification
- Sklearn atasets
- Boston regression, 506 samples
- Kaggle competitions
Benchmark
- https://forums.fast.ai/t/tabnet-with-fastai-v2/62600/19
- https://forums.fast.ai/t/some-baselines-for-other-tabular-datasets-with-fastai2/62627
- https://github.com/muellerzr/fastai2-Tabular-Baselines
Temporal Series
➕ Feature engineering
Get information about the current date (date variable)
Date | Day | Month | Year | Weekday | Weeknum | IsHoliday | |
---|---|---|---|---|---|---|---|
1/1/2018 | 1 | 1 | 2018 | 2 | 1 | 1 | |
2/1/2018 | 2 | 1 | 2018 | 3 | 1 | 0 | |
3/1/2018 | 3 | 1 | 2018 | 4 | 1 | 0 | |
4/1/2018 | 4 | 1 | 2018 | 5 | 1 | 0 | |
5/1/2018 | 5 | 1 | 2018 | 6 | 1 | 0 | |
6/1/2018 | 6 | 1 | 2018 | 7 | 1 | 0 | |
7/1/2018 | 7 | 1 | 2018 | 1 | 2 | 0 | |
8/1/2018 | 8 | 1 | 2018 | 2 | 2 | 0 | |
9/1/2018 | 9 | 1 | 2018 | 3 | 2 | 0 |
Get information about the past (continuous variable)
Date | Sales | Lag1 | Lag2 | Moving average (2) | |
---|---|---|---|---|---|
1/1/2018 | 100 | - | - | - | |
2/1/2018 | 150 | 100 | - | 100 | |
3/1/2018 | 160 | 150 | 100 | 125 | |
4/1/2018 | 200 | 160 | 150 | 155 | |
5/1/2018 | 210 | 200 | 160 | 180 | |
6/1/2018 | 150 | 210 | 200 | 205 | |
7/1/2018 | 160 | 150 | 210 | 180 | |
8/1/2018 | 120 | 160 | 150 | 155 | |
9/1/2018 | 80 | 120 | 160 | 140 |
- Lag variables (autoregressive elements)
- Aggregated features on lagged variables:
- Moving Average (MA): Average of Lags.
- Exponential Weighting Moving Average (EWMA): More recent values have higher weight.
- Others like mean, std, sum, substraction
- Regression on lags (slope, intercep)
Reference
- Time Series in Driverless AI
- MLcourse.ai Time series analysis (Topic 9)