đź”§ Hyperparameters
Data augmenttion (High impact)
RandAugment MixUp CutMix
Input data representation
| Hyperparameter | Impact | Examples |
|---|---|---|
| Numerical encoding | High | Normalization: StandardScaler, QuantileTransformer, RankGauss |
| Categorical encoding | High | Onehot with embedding layer (squre root od number of categories) |
| Image enc. (from scratch) | High | StandardScaling: Compute manually (per RGB channel) the mean and std |
| Image enc. (transfer learning) | High | StandardScaling: Use the original mean and std !!! |
| Image data agumentation | High | See … |
| Image size (resolution) | Medium | 120x120, 224x224, 512x512 (the higher the better) |
| CoordConv | Low | Encode sequence order for CNNs or tranformers (useful for spectograms) |
| Positional Encoding | Low | Encode sequence order for CNNs or tranformers |
Model hyperparameters
| Hyperparameter | Impact | Examples |
|---|---|---|
| Layer size | High | Â |
| Num of layers (depth) | Medium | Â |
| Weight Initialization | Medium | Xavier (Dense+tanh), Kaiming He (Dense+ReLU) |
| Transfer Learning | High | Pretrained model frozen, new layers unfrozen |
| Nonlinearity (act fn) | Low | ReLU, GELU, Swish, Mish, GLU |
| Residual connections | Low | Needed if there are a lot of layers (>3) |
| - Stochastic Depth | Low | If there are residual cons, add this. paper |
| Regularization | Medium | Â |
| - Dropout | Medium | Scale after dropout to maintain std=1 |
| - Dropconect | Low | Â |
| Inner normalization | Â | Â |
| - Batch normalization | Â | Â |
| - Layer normalization | Â | Usually before each layer |
| Weght tiying | Â | If same input & output: Langmodel, Autoencoder |
| Squeeze and Excitation | Â | paper |
Training hyperparameters
| Hyperparameter | Impact | Examples |
|---|---|---|
| Epochs | High | 3..6 if transfer learning, 20..40 if from scratch |
| BS | Medium | 16,32,64 |
| LR | High | 0.001 is usally good but use Learning Rate Finder |
| LR sched | Medium | Constant, linear warmup + cosine decay |
| Opimizer | Low | Adam or LAMB are good |
| Early stopping | Low | Not neccesary usaully |
| loss (Regression) | High | MAE, MSE, MAPE, MSLE, Huber |
| loss (Classification) | High | BinaryCrossentropy, CategCrossentropy, Hinge, Focal |
| loss + weight decay | Low | Â |
| loss + label smoothing | Low | Â |
LR Finder
There are sevaral ways of computing the lr automatically:


| Â | Developed by | Task in mind |
|---|---|---|
| Steep (green) | Leslie Smith (Fastai) | All tasks |
| Minimun (orange) | Leslie Smith (Fastai) | All tasks |
| Valley (red) | ESRI (defualt Fastai) | Vision tasks |
| Slide (purple) | Novetta | NLP tasks |
- https://www.novetta.com/2021/03/learning-rate/
- https://forums.fast.ai/t/automated-learning-rate-suggester/44199
References
- Paper: Revisiting ResNets: Improved Training and Scaling Strategies (Mar 2021)
- Paper: ResNet strikes back: An improved training procedure in timm (Oct 2021)