Tabular

Advice for Tabular data

Neural Network

Take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.

Conclusion

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try Gradient Boosting and Neural Nets, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.

Neural Nets

TO-DO Read

Tweet

Papers with Code Newsletter #12: Deep Learning for Tabular Data (1 July 2021)

Arxiv paper: Deep Neural Networks and Tabular Data: A Survey (5 Oct 2021)

Factorization Machines
- Read notes on matrix factorization machines
- Code: LibFM in Keras
- TF implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM.
DeepFM (Mar 2017)
xDeepFM (Mar 2018)
Neural nets for Airbnb search (Oct 2018)
TabNet: Attentive Interpretable Tabular Learning (Aug 2019)
NODE: Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data (Sep 2019)
Graph NNs: DL on Relational DBs with Graph NNs (Feb 2020)
GrowNet: Gradient Boosting Neural Networks (Feb 2020)
- Shallow NNs as “weak learners” in gradient boosting framework
- Incorporates 2nd order stats, corrective step & dynamic boost rate to remedy pitfalls of gradient boosting tree
- Outperforms XGBoost
TabTransformer: Tabular Data Modeling Using Contextual Embeddings (Dec 2020)
Idea mia: Residual conecctions on MLP ???
Frameworks
- DeepTables
- Pytorch tabular
- Microsoft hummingbird: Convert trees to NNs

Datasets

UCI
- Adult Binary Classification
- Glass Identification Multi-class Classification
Sklearn atasets
- Boston regression, 506 samples
Kaggle competitions

Benchmark

https://forums.fast.ai/t/tabnet-with-fastai-v2/62600/19
https://forums.fast.ai/t/some-baselines-for-other-tabular-datasets-with-fastai2/62627
https://github.com/muellerzr/fastai2-Tabular-Baselines

Temporal Series

➕ Feature engineering

Get information about the current date (date variable)

Date	Day	Month	Year	Weekday	Weeknum	IsHoliday
1/1/2018	1	1	2018	2	1	1
2/1/2018	2	1	2018	3	1	0
3/1/2018	3	1	2018	4	1	0
4/1/2018	4	1	2018	5	1	0
5/1/2018	5	1	2018	6	1	0
6/1/2018	6	1	2018	7	1	0
7/1/2018	7	1	2018	1	2	0
8/1/2018	8	1	2018	2	2	0
9/1/2018	9	1	2018	3	2	0

Get information about the past (continuous variable)

Date	Sales	Lag1	Lag2	Moving average (2)
1/1/2018	100	-	-	-
2/1/2018	150	100	-	100
3/1/2018	160	150	100	125
4/1/2018	200	160	150	155
5/1/2018	210	200	160	180
6/1/2018	150	210	200	205
7/1/2018	160	150	210	180
8/1/2018	120	160	150	155
9/1/2018	80	120	160	140

Lag variables (autoregressive elements)
Aggregated features on lagged variables:
- Moving Average (MA): Average of Lags.
- Exponential Weighting Moving Average (EWMA): More recent values have higher weight.
- Others like mean, std, sum, substraction
- Regression on lags (slope, intercep)

Reference

Time Series in Driverless AI
- documentation
- video
MLcourse.ai Time series analysis (Topic 9)
- Part 1
- Part 2: Facebook Prophet