Preprocesamiento

	Modelos basados en árbol Decission Tree Random Forest Extra Trees Adaboost Gradient Boosting XGBoost LightGBM CatBoost	Modelos "mutiplicativos" Linear Models (LM) Generalized Additive Model (GAM) Neural Networks (NN) K-Nearest Neighbors (KNN) Suport Vector Machines (SVM) Naive Bayes (NB) Dimensionality Reduction models PCA t-SNE UMAP
Variable Categórica Ordinal	Ordinal encoding Other: Frequency encoding	One Hot encoding Other: Embedding
Variable Numérica	Nada	StandarScaler (Normalizar) MinMaxScaler Si no sigue una distribución norma (Skewed): np.log(1+x) np.sqrt(x+2/3) Box-Cox
Texto	CountVectorizer, TfidfVectorizer, HashingVectorizer, Word embeddings

from sklearn import preprocessing, compose

x_preprocessing_tree = compose.ColumnTransformer(transformers=[
    ('cat', preprocessing.OrdinalEncoder(),  cat_vars),
], remainder='passthrough')

x_preprocessing_mult = compose.ColumnTransformer(transformers=[
    ('cat', preprocessing.OneHotEncoder(),  cat_vars),
    ('num', preprocessing.StandardScaler(), num_vars),
], remainder='drop') 

Variables numéricas

TO-DO: Scaling and Normalization

Feature Scaling and the effect of standardization for machine learning algorithms

RankGauss (aka QuantileTransformer)

Its based on rank transformation.

Assign a linspace to the sorted features from 0..1,
Apply the inverse of error function ErfInv to shape them like gaussians,
Substract the mean.

This works usually much better than standard mean/std scaler or min/max.

RankGauss = QuantileTransformer(n_quantiles=100, random_state=0, output_distribution="normal")

Variance Threshold

VarianceThreshold is a method of feature selection.
It removes all features whose variance doesn’t meet some threshold.

Map data to a normal distribution: Box-Cox

A Box Cox transformation is a generic way to transform non-normal variables into a normal shape.

Lambda value (λ)	Transformed data
-3	Y⁻³ = 1/Y³
-2	Y⁻² = 1/Y²
-1	Y⁻¹ = 1/Y¹
-0.5	Y⁻⁰·⁵ = 1/√Y
0	log(Y)
0.5	Y⁰·⁵ = √Y
1	Y¹
2	Y²
3	Y³

Categorical features

Ordinal Encoding o Label Encoding	One-Hot Encoding

Target Encoding o Mean Encoding

Ingeniería de características = CREATIVIDAD + CONOCIMIENTO DEL DOMINIO

La ingeniería de características (Feature Engineering) es la generación de nuevas características en base a las ya existentes. Esto facilita el trabajo a nuestros modelos.

Si tienes el precio de la casa y los metros cuadrados, puedes añadir el precio del metro cuadrado.
Si tines la distancia en el eje x e y, puedes añadir la distancia directa por pitagoras.
Si tines precios, puedes añanir la parte fraccionaria pq es muy subjetiva en la gente.

Discover Feature Engineering, How to Engineer Features and How to Get Good at It

Discussion of feature engineering on Quora