Machine Learning Cheatsheet

This cheatsheet attempts to give a high-level overview of the incredibly large field of Machine Learning. Please contact me for corrections/omissions.

Last updated: 1 January 2024

Background

Artificial Intelligence is the ability of a machine to demonstrate human-like intelligence.
Machine Learning is the field of study that gives computers the ability to learn without explicitly being programmed.
Machine Learning has become possible because of:
- Massive labelled datasets, e.g. ImageNet
- Improved hardware and compute, e.g. GPUs
- Algorithms advancements, e.g. backpropagation

Machine Learning Lifecycle

Problem Framing
Data Assembly
Model Training
Model Evaluation
Model Deployment

Problem Framing

State the goal of the system
Define success metrics
Verify you have the data needed to train a model
Identify the model’s inputs and outputs
Detail the constraints of the system¹
- model performance
- usefulness threshold
- false negatives vs. false positives
- inference latency
- inference cost
- interpretability
- freshness requirements and training frequency
- online or batch
  - online: generate predictions after requests arrive, e.g. speech recognition
  - batch: generate predictions periodically before requests arrive, e.g. Netflix recommendations
- cloud vs. edge vs. hybrid
  - cloud: no energy, power or memory constraints
  - edge: can work without unreliable connections, no network latency, fewer privacy concerns, cheaper
  - hybrid: common predictions are precomputed and stored on device
- peak requests
- number of users
- confidence measurement (if confidence below threshold, discard, clarify or refer to humans)
- privacy
  - annotation: can data be shipped to outside organisations?
  - storage: what data are you allowed to store and for how long?
  - third-party: can you share data with a third-party?
  - regulations

Data Assembly

Workflow

Data Collection
Exploratory Data Analysis (EDA)
Data Preprocessing
Feature Engineering
Feature selection: remove features with low variance, recursive feature elimination, sequential feature selection
Sampling Strategy: sequential, random, subset, weighted
Data Splits: train-test-validation split, windows splitting of time series data

Data Collection²

Good data should:

have good predictive power (an expert should be able to make a correct prediction with the data)
have very little missing values (when missing values do occur, they should be explainable and occur randomly)
- Missing Not At Random (MNAR): missing due to the value itself
- Missing At Random (MAR): missing due to another observed variable
- Missing Completely At Random (MCAR): no pattern to which values are missing
be labelled
be correct and accurate
be documented
be unbiased

Data Biases¹

Sampling/selection bias
Under/over representation of subgroups
Human biases embedded in the data
Labelling bias
Algorithmic bias

Data Labelling¹

Hand-labelling, data lineage (track where data/labels come from)
Use Labelling Functions (LFs) to label training data programmatic using different heuristics, including pattern matching, boolean search, database lookup, prediction from legacy system, third-party model, crowd labels. LFs can be noisy, correlated and conflicting
Weak supervision, semi supervision, active learning, transfer learning

Class Imbalance¹

Collect more data
Data-level methods
- Undersample majority class (can cause overfitting), e.g. Tomek Links makes decision boundaries clearer by finding pairs of close samples from opposite classes and removes the majority sample
- Oversample minority class (can cause loss of information), e.g. generate synthetic minority oversampling (SMOTE)
Algorithm-level methods
- Cost-sensitive learning penalises the misclassification of minority class samples more heavily than majority class samples
- Class-balance loss by giving more weight to rare classes
- Focal loss by giving more weight to difficult samples

Data Preprocessing³

Missing Data
- Collect more data
- Drop row/column
- Constant imputation
- Univariate imputation: replace missing values with the column mean/median/mode
- Multivariate imputation: use all available features to estimate missing values
- Nearest neighbours imputation: use an euclidean distance metric to find nearest neighbors
- Add missing indicator column
Structured Data
- Categorical: ordinal encoding, one-hot encoding
- Numeric: discretisation, min-max normalisation, z-score normalisation (when variables follow a normal distribution), log scaling (when variables follow an exponential distribution), power transform (mapping to Gaussian distribution using Yeo-Johnson or Box-Cox transforms)
Unstructured Data
- Audio:
- Images: decode, resize, normalise
- Text: normalisation (lower-casing, punctuation removal, strip whitespaces, strip accents, lemmatisation and stemming), tokenisation, token to IDs
- Videos:
Dimensionality Reduction
- Principal Component Analysis (PCA): find a subset of features that capture the variance of the original features
- Feature agglomeration: use Hierarchical Clustering to group features that behave similarly
Feature Crossing
- Combine two or more features to create a new feature
Positional Embeddings
- Can be either learned or fixed (Fourier features)

Data Leakage¹

Splitting time-correlated data randomly instead of by time
Preprocessing data before splitting, e.g. using the whole dataset to generate global statistics like the mean and using it to impute missing values
Poorly handling of data duplication before splitting
Group leakage, group of examples have strongly correlated labels but are divided into different splits
Leakage from data collection process, e.g. doctors sending high-risk patients to a better scanner
Detect data leakage by measuring correlation of a feature with labels, feature ablation study, monitoring model performance when new features are added

Feature Engineering

Events: event price,
Location: walk score, transit score, same region
Social: how many people attending event, attendance by friends, invited by other user?, hosted by a friend?, attendance of events by same host
Time: remaining time until event begins, estimated travel time
User: age, gender

Model Training: Overview

Workflow

Decide whether to train from scratch or fine-tune existing model
Choose loss function
Establish a simple baseline
Experiment with simple models
Switch to more complex models
Use an ensemble of models

Key Concepts

Curse of Dimensionality: As the number of features in a dataset increases, the volume of the feature space increases so fast that the available data becomes sparse. This makes it hard to have enough data to give meaningful results, leading to overfitting.
Overfitting and Underfitting: overfitting occurs when a model learns the training data too well and can’t generalise to unseen data, while underfitting happens when a model isn’t powerful enough to model the training data.

Classification

Regression

Underfit

Good Fit

Overfit
Bias and Variance: Bias refers to the error introduced by approximating a real-world problem with a simplified model. High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting). Variance refers to the amount by which a model would change if estimated using a different training dataset. High variance can cause an algorithm to model random noise in the training data, not the intended outputs (overfitting).

Low Bias

High Bias

Low Variance

High Variance
Bias-Variance Trade-off: As you increase the complexity of your model, you will typically decrease bias but increase variance. On the other hand, if you decrease the complexity of your model, you increase bias and decrease variance.
Vanishing/Exploding Gradients: When training a deep neural network, if the gradient values become very small, they get “squashed” due to the activation functions resulting in vanishing gradients. When these small values get multiplied during backpropagation they can become near zero, which results in a lack of updates to the network weights and the training stalling. On the other hand, if the gradients become too large, they “explode”, causing model weights to update too drastically and making model training unstable.
Universal Approximation Theorem: A neural network with a single hidden layer can approximate any continuous function for inputs within a specific range
Learning Curve: Model performance as a function of number of training examples, can be good for estimating if performance can be improved with more data

Debugging

Overfit model on a subset of data
Look out for exploding gradients (use gradient clipping)
Turn on detect_anomaly so that any backward computation that generates NaN will raise an error.

Performance Tuning⁴⁵

Enable asynchronous data loading and augmentation using num_workers > 0 and pin_memory = True
Disable bias for convolutions before batch norms
Use learning rate scheduler
Use mixed precision
Accumulate gradients by running a few small batches before doing a backward pass
Saturate GPU by maxing-out batch size (downside: higher batch sizes may cause training to get stuck in local minima)
Use Distributed Data Parallel (DDP) for multi-GPU training
Clip gradients to avoid exploding gradients
Disable gradient calculation for val/test/predict

Hyperparameter Optimisation (HPO)

Grid search: exhaustively search within bounds
Random search: randomly search within bounds
Bayesian search: modeled as Gaussian process

Neural Architecture Search¹ (NAS)

Search Space: set of operations, (e.g. convolutions, fully-connected layers, pooling, etc.) and how they can be connected
Search Strategy: random, reinforcement learning, evolution

Cross-validation³ (CV)

K-fold: divide samples into \(k\) folds; train model on \(k-1\) folds and evaluate using the left out fold
Leave One Out (LOO): train model on all samples except one and evaluate using the left out sample
Stratified K-fold: similar to K-fold, but each fold contains the same class balance as the full dataset
Group K-fold: similar to K-fold, but ensure that groups (samples from the same data source) do not span different folds
Time Series Split: to ensure only past observations are used to predict future observations, train model on first \(n\) folds and evaluate on the \(n+1\)-th fold

Model Training: PyTorch⁴

Optimisers

Adam
Momentum
RMSProp
Stochastic Gradient Descent (SGD)

Initialisations

Kaiming
Xavier

Regularisation

Augmentation
- Image: random crop, saturation, flip, rotation, translation, perturb using random noise
- Text: swap with synonyms, add degree adverbs, perturb with random word replacements
Data synthesis
- Image: mixup (inputs and labels are linear combination of multiple classes)
- Text: template-based, language model-based
Dropout
Early stopping
L1 regularisation (Lasso) tends to lead to sparsity because it penalises the absolute value of the weights, thereby encouraging some weights to go to zero (it can also be used for feature selection).
L2 regularisation (Ridge) penalises the square of the weights, thereby pushing them closer to zero but not necessarily to zero. This leads to models that are less sparse and can better manage multicollinearity.

Pooling

Average pool
Min pool
Max pool

Normalisation

Batch Norm

Layer Norm

Group Norm

Instance Norm

Activations

Sigmoid
1 / (1 + e^-x)

ReLU
max(0,x)

Tanh
tanh(x)

Leaky ReLU
max(0.1x,x)

Loss Functions

Cross-Entropy measures how close the predicted probability distribution is with the true distribution
\[l_n = - w_{y_n} \log \frac{\exp(x_{n,y_n})}{\sum_{c=1}^C \exp(x_{n,c})}\]
Connectionist Temporal Classification (CTC) is used where the alignment between input and output sequences is unknown
Kullback–Leibler (KL) Divergence measures of how one probability distribution diverges or is different from a second, expected probability distribution.
\[L(y_{\text{pred}},\ y_{\text{true}}) = y_{\text{true}} \cdot (\log y_{\text{true}} - \log y_{\text{pred}})\]
Mean Absolute Error (L1) measures
\[l_n = \left| x_n - y_n \right|\]
Mean Squared Error (Squared L2 Norm) measures the average squared difference between the estimated values and the actual value
\[l_n = \left( x_n - y_n \right)^2\]
Negative Log Likelihood (NLL) measures the disagreement between the true labels and the predicted probability distributions,
\[l_n = - w_{y_n} x_{n,y_n}\]

Distributed Training

Data parallelism: split the data across devices so that each device sees a fraction of the batch
Model parallelism: split the model across devices so that each device runs a fragment of the model

Model Evaluation: Responsible AI

Fairness

Slice-based evaluation, e.g. when working with website traffic, slice data among: gender, mobile vs. desktop, browser, location
Check for consistency over time
Determine slices by heuristics or error analysis

Explainability

Integrated Gradients: compute the contribution of each feature to a prediction by integrating gradients over the path from the baseline
LIME (Local Interpretable Model-agnostic Explanations): creates a simpler, interpretable model around a single prediction to explain how the model behaves at that specific instance.
Sampled Shapley: estimates the contribution of each feature by averaging over subsets of features sampled from the input data.
SHAP (SHapley Additive exPlanations): assigns each feature an importance value for a particular prediction, based on the concept of Shapley values from cooperative game theory
XRAI (eXplanation with Ranked Area Integrals): segments an input image and ranks the segments based on their contribution to the model’s prediction

Compactness

Reduces memory footprint and increases computation speed
Quantisation: reduce model size by using fewer bits to represent parameters
Knowledge distillation: train a small model (student) to mimic the results of a larger model (teacher)
Pruning: remove nodes or set least useful parameters to zero
Low-ranked factorisation: replace convolution filters with compact blocks

Robustness

Determinism Test: ensure same outputs when predicting using same model
Retraining Invariance Test: ensure similar outputs when predicting using re-trained model
Perturbation Test: ensure small changes to numeric inputs don’t cause big changes to outputs
Input Invariance Test: ensure changes to certain inputs don’t cause changes to outputs
Directional Expectation Test: ensure changes to certain inputs cause predictable changes to outputs
Ablation Test: ensure all parts of the model are relevant for model performance
Fairness Test: ensure different slices have similar model performance
Model Calibration Test: ensure events should happen according to the proportion predicted

Safety

Alignment:
Red Teaming: experts simulate potential attacks on a system to identify vulnerabilities, test defenses, and improve system security before actual attackers do

Model Evaluation: Metrics

Offline Metrics (Before Deployment)

Baseline
- Predict at random (uniformly or following label distribution)
- Zero rule baseline (always predict the most common class)
- Simple heuristics
- Human baseline
- Existing solutions

Classification

Confusion Matrix

	Class 1	Class 2
Class 1	True-positive (TP)	False-positive (FP)
Class 2	False-negative (FN)	True-negative (TN)

Type I error: FP
Type II error: FN
Precision: TP / (TP + FP), i.e. a classifier’s ability not to label a negative sample as positive
Recall or True-positive rate: TP / (TP + FN), i.e. a classifier’s ability to find all positive samples
False-positive rate: FP / (FP + TN), i.e. a classifier’s inability to find all negative samples
F1 score (harmonic mean of precision and recall): 2 × precision × recall / (precision + recall)
Precision-recall curve: trade-off between precision and recall, a higher PR-AUC indicates a more accurate model
Receiver operator characteristic (ROC) curve: trade-off between true-positive rate (recall) and false-positive rate, a higher ROC-AUC indicates a model better at distinguishing positive and negative classes

Regression
- Mean squared error (MSE): average of the squared differences between the predicted and actual values, emphasising larger errors
- Mean absolute error (MAE): average of the absolute differences between the predicted and actual values, treating all errors equally
- Root mean square error (RMSE): square root of the MSE, providing error in the same units as the predicted and actual values and emphasizing larger errors like MSE
Object Recognition
- Intersection over union (IOU): ratio of overlap area with union area
Ranking
- Recall@k: proportion of relevant items that are included in the top-k recommendations
- Precision@k: proportion of top-k recommendations that are relevant
- Mean reciprocal rank (MRR): \(\frac{1}{m} \sum_{i=1}^m \frac{1}{\textrm{rank}_i}\), i.e. where is the first relevant item in the list of recommendations?
- Hit rate: how often does the list of recommendations include something that’s actually relevant?
- Mean average precision (mAP): mean of the average precision scores for each query
- Diversity: measure of how different the recommended items are from each other
- Coverage: what’s the percentage of items seen in training data that are also seen in recommendations?
- Cumulative gain (CG): \(\sum_{i=1}^p rel_i\), i.e. sum of relevance scores obtained by a set of recommendations
- Discounted cumulative gain (DCG): \(\sum_{i=1}^p \frac{\textrm{rel}_i}{\log_2(i+1)}\), i.e. CG discounted by position
- Normalised discounted cumulative gain (nDCG): \(\frac{\textrm{DCG}_p}{\textrm{IDCG}_p}\), i.e. extension of CG that accounts for the position of the recommendations (discounting the value of items appearing lower in the list), normalised by maximum possible score
Image Generation
- FID
- Inception score
Natural Language
- BLEU
- Perplexity: average “branching factor” per token

Online Metrics (After Deployment)

Model’s impact of user behavior or system performance
Event recommendation: conversion rate, bookmark rate, revenue lift
Safety: prevalence, harmful impressions, valid appeals, proactive rate, user reports per harmful class
Video recommendations: click-through-rate, video completion rate, total watch time, explicit user feedback

Model Deployment

Deployment Strategies (B to replace A)

Recreate strategy: stop A, start B
Ramped strategy: shift traffic from A to B behind same endpoint
Blue/green: shift traffic from A to B using different endpoint

Testing Strategies

Canary: targeting small set of users with latest version
Shadow: mirror incoming requests and route to shadow application
A/B: route to new application depending on rules or contextual data
Interleave: mix recommendations from A and B and see which recommendations are clicked on

Model Monitoring

Model monitoring is essential because while traditional software systems fail explicitly (error messages), Machine Learning systems fail silently (bad outputs)
Operation-related metrics:
- Latency
- Throughput
- Requests per minute/hour/day
- Percentage of successful requests
- CPU/GPU utilisation
- Memory utilisation
- Availability
Machine Learning metrics¹:
- Feature and label statistics, e.g. mean, median, variance, quantiles, skewness, kurtosis, etc.
- Task-specific online metrics

Continual Learning

Continually adapt models to changing data distributions

ML-specific Failures¹

Train-serving skew is when a model performs well during development but poorly after production

Upstream Drift: caused by a discrepancy between how data is handled in the training and serving pipelines (should log features at serving time)⁶
Data Distribution Shifts: model may perform well when first deployed, but poorly over time (can be sudden, cyclic or gradual)
- Feature/covariate shift
  - Change in the distribution of input data, P(X), but relationship between input and output, P(Y|X), remains the same
  - E.g. when predicting sales based on weather, if weather patterns change (e.g. more rainy days), but the relationship between weather and sales remains constant (e.g. rainy days always lead to fewer sales)
  - In training, can be caused by changes to data collection, while in production, can be caused by changes to environment
- Label shift
  - Change in the distribution of output labels, P(Y), but relationship between output and input, P(X|Y), remains the same
  - E.g. when predicting diseases, if a disease becomes more common, but symptoms for each disease remains constant
- Concept drift
  - Change in the relationship between input and output, P(Y|X), but the distribution of input data, P(X), remains the same
  - E.g. when predicting rain from cloud patterns, if the cloud patterns remain the same but their association with rain changes (maybe due to climate change)
Degenerate Feedback Loops: when predictions influence the feedback, which is then used to extract labels to train the next iteration of the model
- Recommender system example: originally, A is ranked marginally higher than B, so the model recommends A. After a while, A is ranked much higher than B. Can be detected using Average Recommended Popularity (ARP) and Average Percentage of Long Tail Items (APLT).
- Resume screening example: originally, model thinks X is a good feature, so the model recommends resume with X. After a while, hiring managers only hires people with X and model confirms X is good. Can be mitigated using randomisation and positional features.

Models: Supervised Learning

Supervised learning models make predictions after seeing lots of data with the correct answers. The model discovers the relationship between the data and the correct answers.

Models: Unsupervised Learning

Unsupervised learning involves finding patterns and structure in input data without any corresponding output data.

Models: Reinforcement Learning⁷

Models: Language Models

Models: Recommender Systems⁸

Behavioural principles
- Similar items are symmetric, e.g. white polo shirts
- Complementary items are asymmetric, e.g. buying a television, suggest a HDMI cable

Rule-based

Embedding-based

Content-based Filtering
- Item feature similarities
- Pros: ability to recommend new videos, ability the capture unique user interests
- Cons: difficult to discover a user’s new interests, requires domain knowledge to engineer features
Collaborative Filtering
- User-to-user similarities or item-to-item similarities
- Pros: no domain knowledge needed, easy to discover users’ new areas of interest, efficient
- Cons: cold-start problem, cannot handle niche interests
Hybrid Filtering
- Parallel or sequential combination of content-based and collaborative filtering

Learning-to-Rank

Point-wise: model takes each item individually and learns to predict an absolute relevancy score
Pair-wise: model takes two ranked items and learns to predict which item is more relevant (RankNet, LambdaRank, LambdaMART)
List-wise: model takes optimal ordering of items and learns the ordering (SoftRank, ListNet, AdaRank)

Models: Ensembles

Bagging (Bootstrap Aggregation): reduces model variance by training identical models in parallel on different data subsets (random forests)
- Pros: reduces overfitting, parallel training means little increase in training/inference time
- Cons: not helpful for underfit models
Boosting: reduces model bias and variance by training several weak classifiers sequentially (Adaboost, XGBoost)
- Pros: reduces bias and variance
- Cons: slower training and inference
Stacking (Stacked Generalisation): reduces model bias and variance by training different models in parallel on the same dataset and using a meta-learner model to combine the results
- Pros: reduces bias and variance, parallel training means little increase in training/inference time
- Cons: prone to overfitting
Multiple Objective Optimisation (MOO) combines results of different models with weightings; decouple models with different objectives for easier training, tweaking and maintenance

Machine Learning Cheatsheet

Background

Machine Learning Lifecycle

Problem Framing

Data Assembly

Workflow

Data Collection2

Data Biases1

Data Labelling1

Class Imbalance1

Data Preprocessing3

Data Leakage1

Detect data leakage by measuring correlation of a feature with labels, feature ablation study, monitoring model performance when new features are added

Feature Engineering

Model Training: Overview

Workflow

Key Concepts

Debugging

Performance Tuning45

Hyperparameter Optimisation (HPO)

Neural Architecture Search1 (NAS)

Cross-validation3 (CV)

Model Training: PyTorch4

Optimisers

Initialisations

Regularisation

Pooling

Normalisation

Activations

Loss Functions

Distributed Training

Model Evaluation: Responsible AI

Fairness

Explainability

Compactness

Robustness

Safety

Model Evaluation: Metrics

Offline Metrics (Before Deployment)

Online Metrics (After Deployment)

Model Deployment

Deployment Strategies (B to replace A)

Testing Strategies

Model Monitoring

ML-specific Failures1

Models: Supervised Learning

Models: Unsupervised Learning

Models: Reinforcement Learning7

Models: Language Models

Models: Recommender Systems8

Rule-based

Embedding-based

Learning-to-Rank

Models: Ensembles

Data Collection²

Data Biases¹

Data Labelling¹

Class Imbalance¹

Data Preprocessing³

Data Leakage¹

Performance Tuning⁴⁵

Neural Architecture Search¹ (NAS)

Cross-validation³ (CV)

Model Training: PyTorch⁴

ML-specific Failures¹

Models: Reinforcement Learning⁷

Models: Recommender Systems⁸