Scaling features before splitting into train/test.
→Split first, then fit transformers on train only and apply (`transform`) to test. Wrap steps in a scikit-learn Pipeline.
Why: Fitting on the full dataset leaks test statistics into training and inflates evaluation scores.
A numeric column has 8% missing values.
→Impute with median (robust to skew) via `SimpleImputer`; consider a missing-indicator flag.
Why: Median resists outliers; an indicator preserves signal when missingness itself is informative.
A categorical column has gaps.
→Impute with the mode or an explicit "Unknown" / "Missing" category.
Why: An explicit category keeps the missingness pattern as a usable signal rather than discarding rows.
Low-cardinality nominal feature (e.g. region with 5 values).
→Apply one-hot encoding (`OneHotEncoder`); drop one column if the model needs no collinearity.
Why: One-hot avoids imposing a false order on nominal categories; dropping a level prevents the dummy trap.
Feature has a natural order (low / medium / high).
→Use ordinal encoding that preserves rank.
Why: One-hot would discard the ordering; rank-aware encoding lets the model exploit it.
Categorical with thousands of levels (e.g. ZIP code).
→Use target/frequency encoding or grouping rather than one-hot.
Why: One-hot explodes dimensionality; target encoding is compact but must be fit inside CV to avoid leakage.
Features span very different scales before a distance-based model.
→StandardScaler (zero mean, unit variance) for roughly Gaussian features; MinMaxScaler to bound [0,1].
Why: KNN, SVM, PCA, and gradient descent are scale-sensitive; tree models are not.
A right-skewed positive feature hurts a linear model.
→Apply a log or Box-Cox/Yeo-Johnson power transform to compress the tail.
Why: Reducing skew stabilises variance and linearises relationships for linear and distance-based models.
Want to capture a non-linear age effect in a linear model.
→Bin the continuous feature into ranges (equal-width or quantile) and treat as categorical.
Why: Binning lets linear models capture step changes, at the cost of some information loss.
Genuine extreme values destabilise model training.
→Cap/winsorise at a percentile or use a robust scaler; delete only confirmed errors.
Why: Capping limits leverage of extremes while keeping the records; deletion loses real rare-event signal.
Positive class is only 3% of training rows.
→Resample — SMOTE/oversample minority or undersample majority — fitting only on the training fold; or set class weights.
Why: Balancing the test set would give a false read; resampling belongs inside the training pipeline.
Raw timestamps and amounts under-perform.
→Engineer features — day-of-week, time-since-last-event, ratios, aggregates per customer.
Why: Domain-informed derived features often add more lift than swapping the algorithm.
Hundreds of features, many redundant or noisy.
→Select via filter (correlation/mutual information), wrapper (RFE), or embedded (L1/tree importances) methods.
Why: Fewer, relevant features cut overfitting, training cost, and improve interpretability.
Many correlated numeric features slow training and overfit.
→Apply PCA to project onto top components capturing most variance; scale first.
Why: PCA removes multicollinearity and compresses dimensionality, trading some interpretability for stability.
Multiple preprocessing steps must apply identically in train and serving.
→Chain imputers, encoders, and scalers in a `Pipeline` / `ColumnTransformer` fit only on training data.
Why: A single fitted pipeline guarantees consistent transforms and prevents leakage across folds.
Reference↗
A raw date column adds little predictive value.
→Decompose into year, month, day-of-week, is-weekend, and cyclical sin/cos encodings.
Why: Models cannot read calendar semantics from a raw timestamp; explicit parts expose seasonality.