Tabular

Advice for Tabular data

Neural Network

Take the longest time to train, and require extra preprocessing such as normalisation; this normalisation needs to be used at inference time as well. They can provide great results, and extrapolate well, but only if you are careful with your hyperparameters, and are careful to avoid overfitting.

Conclusion

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try Gradient Boosting and Neural Nets, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.

Neural Nets

TO-DO Read

Datasets

Benchmark

  • https://forums.fast.ai/t/tabnet-with-fastai-v2/62600/19
  • https://forums.fast.ai/t/some-baselines-for-other-tabular-datasets-with-fastai2/62627
  • https://github.com/muellerzr/fastai2-Tabular-Baselines

Temporal Series

➕ Feature engineering

Get information about the current date (date variable)

Date   Day Month Year Weekday Weeknum IsHoliday
1/1/2018   1 1 2018 2 1 1
2/1/2018   2 1 2018 3 1 0
3/1/2018   3 1 2018 4 1 0
4/1/2018   4 1 2018 5 1 0
5/1/2018   5 1 2018 6 1 0
6/1/2018   6 1 2018 7 1 0
7/1/2018   7 1 2018 1 2 0
8/1/2018   8 1 2018 2 2 0
9/1/2018   9 1 2018 3 2 0

Get information about the past (continuous variable)

Date Sales   Lag1 Lag2 Moving average (2)
1/1/2018 100   - - -
2/1/2018 150   100 - 100
3/1/2018 160   150 100 125
4/1/2018 200   160 150 155
5/1/2018 210   200 160 180
6/1/2018 150   210 200 205
7/1/2018 160   150 210 180
8/1/2018 120   160 150 155
9/1/2018 80   120 160 140
  • Lag variables (autoregressive elements)
  • Aggregated features on lagged variables:
    • Moving Average (MA): Average of Lags.
    • Exponential Weighting Moving Average (EWMA): More recent values have higher weight.
    • Others like mean, std, sum, substraction
    • Regression on lags (slope, intercep)

Reference