Machine Learning
This cheatsheet attempts to give a highlevel overview of the incredibly large field of Machine Learning. Please contact me for corrections/omissions.
Last updated: 1 January 2024
Background
 Artificial Intelligence is the ability of a machine to demonstrate humanlike intelligence.
 Machine Learning is the field of study that gives computers the ability to learn without explicitly being programmed.
 Machine Learning has become possible because of:
 Massive labelled datasets, e.g. ImageNet
 Improved hardware and compute, e.g. GPUs
 Algorithms advancements, e.g. backpropagation
Machine Learning Lifecycle
 Problem Framing
 Data Assembly
 Model Training
 Model Evaluation
 Model Deployment
Problem Framing
 Identify business objective
 Review existing datasets
 Determine constraints (see below)
 Frame business object as a Machine Learning task
 Define success metrics
 Specify the model’s inputs and outputs
When to use Machine Learning?^{1}
Machine Learning is an approach to learn complex patterns from existing data and use these patterns to make predictions on unseen data.
 Learn: the system has the capacity to learn.
 Complex patterns: there are patterns to learn and they are complex.
 Existing data: data is available or it’s possible to collect data.
 Predictions: it’s a predictive problem.
 Unseen data: unseen data shares patterns with the training data.
Machine Learning systems work best when:
 Data is repetitive
 Cost of wrong predictions is cheap
 Problem is at scale
 Patterns are constantly changing
Constraints^{2}^{3}
Performance
 Cost of wrong predictions
 False negatives vs. false positives
 Interpretability
 Usefulness threshold
Training
 Freshness requirements
 Training frequency
Inference
 Computing power
 Confidence measurement (if confidence is below threshold: discard, clarify or refer to humans?)
 Cost
 Number of items
 Number of users
 Latency
 Peak requests
Online vs. Batch
 Online: generate predictions as requests arrive, e.g. speech recognition
 Batch: generate predictions periodically before requests arrive, e.g. Netflix recommendations
Cloud vs. Edge vs. Hybrid
 Cloud: no energy, power or memory constraints
 Edge: can work without unreliable connections, no network latency, fewer privacy concerns, cheaper
 Hybrid: common predictions are precomputed and stored on device
Privacy
 Annotations: can data be shipped to outside organisations?
 Storage: what data are you allowed to store and for how long?
 Thirdparty: can you share data with a thirdparty?
 Regulations: is the data complying with relevant data protection laws?
Data Assembly
 Data Collection: (see below)
 Exploratory Data Analysis (EDA): use visualisation and statistical techniques to understand the data’s structure, detect patterns and spot anomalies
 Data Preprocessing: (see below)
 Feature Engineering: (see below)
 Feature Selection: remove features with low variance, recursive feature elimination, sequential feature selection
 Sampling Strategy: sequential, random, stratified, weighted, reservoir, importance
 Data Splits: traintestvalidation split, windows splitting of time series data
Class Imbalance^{2}
Collect More Data
 Target underrepresented classes
Datalevel Methods
 Undersample majority class (can cause overfitting), e.g. Tomek Links makes decision boundaries clearer by finding pairs of close samples from opposite classes and removes the majority sample
 Oversample minority class (can cause loss of information), e.g. generate synthetic minority oversampling (SMOTE)
Algorithmlevel Methods
 Costsensitive learning penalises the misclassification of minority class samples more heavily than majority class samples
 Classbalance loss by giving more weight to rare classes
 Focal loss by giving more weight to difficult samples
Data Collection^{4}
Good data should:
 have good predictive power (an expert should be able to make a correct prediction with the data)
 have very little missing values (when missing values do occur, they should be explainable and occur randomly)
 be labelled
 be correct and accurate
 be documented
 be unbiased
Data Biases^{2}
 Sampling/selection bias
 Under/over representation of subgroups
 Human biases embedded in the data
 Labelling bias
 Algorithmic bias
Data Labelling^{2}
 Handlabelling, data lineage (track where data/labels come from)
 Use Labelling Functions (LFs) to label training data programmatic using different heuristics, including pattern matching, boolean search, database lookup, prediction from legacy system, thirdparty model, crowd labels
 Weak supervision, semi supervision, active learning, transfer learning
Data Leakage^{2}
 Splitting timecorrelated data randomly instead of by time
 Preprocessing data before splitting, e.g. using the whole dataset to generate global statistics like the mean and using it to impute missing values
 Poorly handling of data duplication before splitting
 Group leakage, group of examples have strongly correlated labels but are divided into different splits
 Leakage from data collection process, e.g. doctors sending highrisk patients to a better scanner
 Detect data leakage by measuring correlation of a feature with labels, feature ablation study, monitoring model performance when new features are added
Data Preprocessing^{5}
Missing Data
 Collect more data
 Drop row/column
 Constant imputation
 Univariate imputation: replace missing values with the column mean/median/mode
 Multivariate imputation: use all available features to estimate missing values
 Nearest neighbours imputation: use an euclidean distance metric to find nearest neighbors
 Add missing indicator column
Missing Values
 Missing Not At Random (MNAR): missing due to the value itself
 Missing At Random (MAR): missing due to another observed variable
 Missing Completely At Random (MCAR): no pattern to which values are missing
Structured Data
 Categorical: ordinal encoding, onehot encoding
 Numeric: discretisation, minmax normalisation, zscore normalisation (when variables follow a normal distribution), log scaling (when variables follow an exponential distribution), power transform (mapping to Gaussian distribution using YeoJohnson or BoxCox transforms)
Unstructured Data
 Audio: sampling, noise reduction, normalisation, feature extraction, silence removal
 Images: decoding, resizing, normalisation, augmentation
 Text: normalisation (lowercasing, punctuation removal, strip whitespaces, strip accents, lemmatisation and stemming), tokenisation, token to IDs, stopword removal
 Videos: frame extraction, resizing, normalisation, optical flow analysis, augmentation
Feature Engineering^{5}
Use domain knowledge to extract and transform predictive features from raw data into a format usable by the model.
 Dimensionality Reduction: use Principal Component Analysis (PCA) to find a subset of features that capture the variance of the original features or Hierarchical Clustering to group features that behave similarly.
 Feature Crossing: combine two or more features to create a new feature.
 Positional Embeddings: can be either learned or fixed at test time.
Event Recommendation Example Features
 Events: type, price
 Location: walk score, transit score, same region, distance to event, venue capacity,
 Social: total attendance, friend attendance, invited by other user, hosted by a friend, attendance of events by same host, social media engagement
 Time: remaining time until event begins, estimated travel time, time of day, weekday vs. weekend, seasonality, frequency, duration
 User: age, gender, past attendance, event preferences, income bracket
Model Training: Overview
 Decide whether to train from scratch or finetune existing model
 Choose loss function
 Establish a simple baseline
 Experiment with simple models
 Switch to more complex models
 Use an ensemble of models
 Employ distributed training
Key Concepts

Bias and Variance: Bias refers to the error introduced by approximating a realworld problem with a simplified model. High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting). Variance refers to the amount by which a model would change if estimated using a different training dataset. High variance can cause an algorithm to model random noise in the training data, not the intended outputs (overfitting).
Low BiasHigh BiasLow VarianceHigh Variance 
BiasVariance Tradeoff: As you increase the complexity of your model, you will typically decrease bias but increase variance. On the other hand, if you decrease the complexity of your model, you increase bias and decrease variance.

Curse of Dimensionality: As the number of features in a dataset increases, the volume of the feature space increases so fast that the available data becomes sparse. This makes it hard to have enough data to give meaningful results, leading to overfitting.

Learning Curve: Model performance as a function of number of training examples, can be good for estimating if performance can be improved with more data

Overfitting and Underfitting: overfitting occurs when a model learns the training data too well and can’t generalise to unseen data, while underfitting happens when a model isn’t powerful enough to model the training data.
ClassificationRegressionUnderfitGood FitOverfit 
Universal Approximation Theorem: A neural network with a single hidden layer can approximate any continuous function for inputs within a specific range

Vanishing/Exploding Gradients: When training a deep neural network, if the gradient values become very small, they get “squashed” due to the activation functions resulting in vanishing gradients. When these small values get multiplied during backpropagation they can become near zero, which results in a lack of updates to the network weights and the training stalling. On the other hand, if the gradients become too large, they “explode”, causing model weights to update too drastically and making model training unstable.
Crossvalidation^{5} (CV)
 Kfold: divide samples into \(k\) folds; train model on \(k1\) folds and evaluate using the left out fold
 Leave One Out (LOO): train model on all samples except one and evaluate using the left out sample
 Stratified Kfold: similar to Kfold, but each fold contains the same class balance as the full dataset
 Group Kfold: similar to Kfold, but ensure that groups (samples from the same data source) do not span different folds
 Time Series Split: to ensure only past observations are used to predict future observations, train model on first \(n\) folds and evaluate on the \(n+1\)th fold
Hyperparameter Optimisation (HPO)
 Grid search: exhaustively search within bounds
 Random search: randomly search within bounds
 Bayesian search: modeled as Gaussian process
Model Selection
 Avoid the stateoftheart trap; a stateoftheart model only means that it performs better than existing models on some static datasets
 Start with the simplest models, since they are: (i) easier to deploy; (ii) can be iterated on which aids interpretability; and (iii) can serve as a baseline
 Avoid human biases in selecting models
 Evaluate good performance now versus good performance later
 Evaluate tradeoffs, e.g. false positives vs. false negatives or compute requirement vs. accuracy
 Understand your model’s assumptions, e.g. prediction assumption, IID, smoothness, tractability, boundaries, conditional independence, normally distributed
Neural Architecture Search^{2} (NAS)
 Search Space: set of operations, (e.g. convolutions, fullyconnected layers, pooling, etc.) and how they can be connected
 Search Strategy: random, reinforcement learning, evolution
Model Training: PyTorch^{6}
Activations
1 / (1 + e^{x})
max(0,x)
tanh(x)
max(0.1x,x)
Debugging
 Overfit model on a subset of data
 Look out for exploding gradients (use gradient clipping)
 Turn on
detect_anomaly
so that any backward computation that generatesNaN
will raise an error.
Distances
Cosine Distance
\[d_{\text{Cosine}}(p, q) = 1  \frac{p \cdot q}{\p\ \q\}\]Use for highdimensional, sparse data where only the direction matters, e.g., document similarity, word embeddings.
Manhattan Distance (L1 Norm or Taxicab Distance)
\[d_{\text{Manhattan}}(p, q) = \sum_{i=1}^{n} p_i  q_i\]Use when movement is constrained to a grid, e.g. pathfinding, recommendation systems.
Euclidean Distance (L2 Norm or Pairwise Distance)
\[d_{\text{Euclidean}}(p, q) = \sqrt{\sum_{i=1}^{n} (p_i  q_i)^2}\]Use when physical/geometric distance matters, e.g. dense data, clustering.
Distributed Training
 Data parallelism: split the data across devices so that each device sees a fraction of the batch
 Model parallelism: split the model across devices so that each device runs a fragment of the model
Initialisations
 Kaiming: sets the initial weights to account for the ReLU activation function by scaling the variance based on the number of input units, preventing vanishing gradients in deep networks.
 Xavier: initialises weights to keep the variance of activations uniform across layers by scaling the weights based on the number of input and output units, making it suitable for sigmoid and tanh activations.
Layers
Convolution Layers
 Convolution: Convolutional layers in PyTorch apply convolution operations to input data, using learnable filters (kernels) that slide over the input, detecting local features such as edges or textures in the case of images.
Linear Layers

Linear: A fully connected (dense) layer in PyTorch performs a linear transformation on the input by applying a weight matrix and adding a bias vector:
\[\text{output} = \text{input} \times W^T + b\]It’s typically used in the final layers of a neural network, allowing every input feature to connect to every output feature.
Pooling Layers
 Average pool: takes the average of values within a pooling window, preserving spatial information by smoothing feature maps.
 Max pool: selects the maximum value within a pooling window, retaining the most prominent features while reducing dimensionality.
 Adaptive max pool: adjusts the pooling window size dynamically to output a fixedsized feature map, regardless of input size.
 Fractional max pool: pools using noninteger strides, allowing for more flexible downsampling that preserves more information in deeper layers.
Recurrent Layers
 Recurrent Neural Network (RNN): The basic RNN layer processes sequence data by maintaining a hidden state across time steps. It’s useful for tasks involving timeseries, language, or any sequential data, however, they suffer from the vanishing gradient problem, which can hinder learning over long sequences.
 Long shortterm memory (LSTM): LSTMs improve upon standard RNNs by using three gates (input gate, forget gate, and output gate) and a memory cell to selectively retain or discard information over time, which helps avoid the vanishing gradient problem.
 Gated recurrent unit (GRU): GRUs improve upon standard RNNs by using two gates (reset gate and update gate) to control the flow of information, which helps solve the vanishing gradient problem, making it easier to capture dependencies over longer sequences without needing separate memory cells like LSTMs.
Transformer Layers
 Transformer: The transformer layer is designed to handle sequential data without relying on recurrent structures. It uses a selfattention mechanism to learn relationships between different positions in the sequence, making it highly parallelisable and better suited for longrange dependencies than RNNbased models.
 Transformer Encoder: The encoder is one half of the transformer architecture, which processes input sequences into a rich set of representations. It consists of multiple layers of multihead selfattention and positionwise feedforward networks, which allow the model to understand context across the entire sequence.
 Transformer Decode: The decoder is the other half of the transformer architecture, used for tasks like sequencetosequence modeling (e.g., machine translation). It generates output sequences by attending to both the encoder’s output and previous tokens of the output sequence, enabling it to produce contextaware predictions.
Loss Functions

CrossEntropy measures how close the predicted probability distribution is with the true distribution. It is widely used in classification tasks, especially for multiclass problems, where it penalizes incorrect classifications based on how confident the model was in its predictions.
\[l_n =  w_{y_n} \log \frac{\exp(x_{n,y_n})}{\sum_{c=1}^C \exp(x_{n,c})}\] 
Connectionist Temporal Classification (CTC) is used where the alignment between input and output sequences is unknown, such as in speech recognition or handwriting recognition, where the timing of outputs may not correspond directly to the inputs.

Huber Loss is a combination of Mean Squared Error and Mean Absolute Error that is less sensitive to outliers than MSE. It behaves as MSE when the error is small and as MAE when the error is large.
\[l_n = \begin{cases} \frac{1}{2}(x_n  y_n)^2 & \text{if } x_n  y_n < \delta, \\ \delta(x_n  y_n  \frac{1}{2}\delta) & \text{otherwise,}\end{cases}\]where \(\delta\) is a threshold defining when to switch between the two behaviors.

Kullback–Leibler (KL) Divergence measures of how one probability distribution diverges or is different from a second, expected probability distribution. It is used for comparing probability distributions, often in generative models.
\[L(y_{\text{pred}},\ y_{\text{true}}) = y_{\text{true}} \cdot (\log y_{\text{true}}  \log y_{\text{pred}})\] 
Mean Absolute Error (L1) measures the average of the absolute differences between the predicted and actual values. It is robust to outliers and focuses on minimizing large deviations.
\[l_n = \left x_n  y_n \right\] 
Mean Squared Error (Squared L2 Norm) measures the average squared difference between the estimated values and the actual values. It penalizes larger errors more than smaller ones.
\[l_n = \left( x_n  y_n \right)^2\] 
Negative Log Likelihood (NLL) measures the disagreement between the true labels and the predicted probability distributions, assigning a high penalty to incorrect classifications where the predicted probability was high.
\[l_n =  w_{y_n} x_{n,y_n}\]
Normalisation
Optimisers
 Adagrad: adapts the learning rate for each parameter based on its past gradients, making frequent updates smaller and rare updates larger.
 Adam: combines the benefits of momentum and RMSProp, using moving averages of both gradients and squared gradients to adapt learning rates.
 Momentum: accelerates gradients in the relevant direction by combining the current gradient with a fraction of the previous gradient, helping to avoid local minima.
 RMSProp: adjusts the learning rate for each parameter based on the moving average of squared gradients, preventing large oscillations in the update step.
 Stochastic Gradient Descent (SGD): updates parameters using only a random subset of data, reducing computation per update but introducing noise to the gradient estimation.
Performance Tuning^{6}^{7}
 Enable asynchronous data loading and augmentation using
num_workers > 0
andpin_memory = True
 Disable bias for convolutions before batch norms
 Use learning rate scheduler
 Use mixed precision
 Accumulate gradients by running a few small batches before doing a backward pass
 Saturate GPU by maxingout batch size (downside: higher batch sizes may cause training to get stuck in local minima)
 Use Distributed Data Parallel (DDP) for multiGPU training
 Clip gradients to avoid exploding gradients
 Disable gradient calculation for val/test/predict
Regularisation
Augmentation
 Image: random crop, saturation, flip, rotation, translation, perturb using random noise
 Text: swap with synonyms, add degree adverbs, perturb with random word replacements
Data Synthesis
 Image: mixup (inputs and labels are linear combination of multiple classes)
 Text: templatebased, language modelbased
Dropout
Randomly zeroes some elements of the input tensor with probability \(p\), forcing the model to learn redundant representations and reducing the risk of overfitting.
Early Stopping
Stops training when the model’s performance on a validation set stops improving, preventing overfitting by stopping before the model starts memorizing noise.
L1 Regularisation (Lasso)
Adds a penalty equal to the absolute value of the magnitude of the coefficients to the loss function. This results in sparse solutions by driving less important feature weights to zero, making it useful for feature selection
L2 Regularisation (Ridge)
Adds a penalty equal to the square of the magnitude of the coefficients to the loss function. This discourages large weights and helps in reducing overfitting without driving coefficients to exactly zero.
Model Evaluation: Responsible AI
Compactness
Reduces memory footprint and increases computation speed
 Quantisation: reduce model size by using fewer bits to represent parameters
 Knowledge distillation: train a small model (student) to mimic the results of a larger model (teacher)
 Pruning: remove nodes or set least useful parameters to zero
 Lowranked factorisation: replace convolution filters with compact blocks
Explainability
 Integrated Gradients: compute the contribution of each feature to a prediction by integrating gradients over the path from the baseline
 LIME (Local Interpretable Modelagnostic Explanations): creates a simpler, interpretable model around a single prediction to explain how the model behaves at that specific instance.
 Sampled Shapley: estimates the contribution of each feature by averaging over subsets of features sampled from the input data.
 SHAP (SHapley Additive exPlanations): assigns each feature an importance value for a particular prediction, based on the concept of Shapley values from cooperative game theory
 XRAI (eXplanation with Ranked Area Integrals): segments an input image and ranks the segments based on their contribution to the model’s prediction
Fairness
 Slicebased evaluation, e.g. when working with website traffic, slice data among: gender, mobile vs. desktop, browser, location
 Check for consistency over time
 Determine slices by heuristics or error analysis
Robustness
 Determinism Test: ensure same outputs when predicting using same model
 Retraining Invariance Test: ensure similar outputs when predicting using retrained model
 Perturbation Test: ensure small changes to numeric inputs don’t cause big changes to outputs
 Input Invariance Test: ensure changes to certain inputs don’t cause changes to outputs
 Directional Expectation Test: ensure changes to certain inputs cause predictable changes to outputs
 Ablation Test: ensure all parts of the model are relevant for model performance
 Fairness Test: ensure different slices have similar model performance
 Model Calibration Test: ensure events should happen according to the proportion predicted
Safety
 Alignment: ensuring that AI systems’ goals and behaviors are in accordance with human values and intentions, preventing them from acting in ways that could harm or be misaligned with user interests.
 Existential Risk: the potential risk that AI systems could lead to catastrophic outcomes that threaten the longterm survival or flourishing of humanity, such as uncontrolled superintelligent AI.
 Red Teaming: experts simulate potential attacks on a system to identify vulnerabilities, test defenses, and improve system security before actual attackers do.
 Reward Hacking: when an AI system finds unintended ways to maximize its reward function, leading to harmful or suboptimal outcomes that violate the intended goals.
Model Evaluation: Metrics
Offline Metrics (Before Deployment)
Baselines
 Predict at random (uniformly or following label distribution)
 Zero rule baseline (always predict the most common class)
 Simple heuristics
 Human baseline
 Existing solutions
Classification

Confusion Matrix
Class 1 Class 2 Class 1 Truepositive (TP) Falsepositive (FP) Class 2 Falsenegative (FN) Truenegative (TN)  Type I error: FP
 Type II error: FN
 Precision: TP / (TP + FP), i.e. a classifier’s ability not to label a negative sample as positive
 Truepositive rate (Recall or Sensitivity): TP / (TP + FN), i.e. a classifier’s ability to find all positive samples
 Truenegative rate (Specificity): TN / (TN + FP), i.e. a classifier’s ability to identify all negative samples
 Falsepositive rate: FP / (FP + TN), i.e. a classifier’s inability to find all negative samples
 F1 score: 2 × precision × recall / (precision + recall), i.e. the harmonic mean of precision and recall
 Precisionrecall curve: tradeoff between precision and recall, a higher PRAUC indicates a more accurate model
 Receiver operator characteristic (ROC) curve: tradeoff between truepositive rate (recall) and falsepositive rate, a higher ROCAUC indicates a model better at distinguishing positive and negative classes
Regression
 Mean squared error (MSE): average of the squared differences between the predicted and actual values, emphasising larger errors
 Mean absolute error (MAE): average of the absolute differences between the predicted and actual values, treating all errors equally
 Root mean square error (RMSE): square root of the MSE, providing error in the same units as the predicted and actual values and emphasizing larger errors like MSE
Object Recognition
 Intersection over union (IOU): ratio of overlap area with union area
Ranking
 Recall@k: proportion of relevant items that are included in the topk recommendations
 Precision@k: proportion of topk recommendations that are relevant
 Mean reciprocal rank (MRR): \(\frac{1}{m} \sum_{i=1}^m \frac{1}{\textrm{rank}_i}\), i.e. where is the first relevant item in the list of recommendations?
 Hit rate: how often does the list of recommendations include something that’s actually relevant?
 Mean average precision (mAP): mean of the average precision scores for each query
 Diversity: measure of how different the recommended items are from each other
 Coverage: what’s the percentage of items seen in training data that are also seen in recommendations?
 Cumulative gain (CG): \(\sum_{i=1}^p rel_i\), i.e. sum of relevance scores obtained by a set of recommendations
 Discounted cumulative gain (DCG): \(\sum_{i=1}^p \frac{\textrm{rel}_i}{\log_2(i+1)}\), i.e. CG discounted by position
 Normalised discounted cumulative gain (nDCG): \(\frac{\textrm{DCG}_p}{\textrm{IDCG}_p}\), i.e. extension of CG that accounts for the position of the recommendations (discounting the value of items appearing lower in the list), normalised by maximum possible score
Image Generation
 Fréchet Inception Distance (FID): measures the distance between the feature distributions of generated images and real images using a pretrained Inception model. Used to assess the quality and realism of generated images by comparing them with realworld images.
 Inception score: calculates how confidently the model can classify generated images and measures how diverse the generated images are across categories. Used to assess how well the generated images align with recognizable classes, balancing image quality and diversity.
Natural Language
 Bilingual Evaluation Understudy Score (BLEU): measures the overlap between generated text and reference text by comparing ngrams (sequences of words). Used to evaluate the accuracy and fluency of text generated by models in tasks like translation or summarisation.
 Perplexity: measures how well a language model predicts the next token in a sequence, representing the model’s uncertainty. Used to evaluate the fluency and coherence of a language model in predicting sequences of words or tokens.
 RecallOriented Understudy for Gisting Evaluation (ROUGE): measures the overlap between generated text and reference summaries, focusing on recall by comparing ngrams, word sequences, or longest common subsequences. Used in text summarisation tasks to assess how well the generated summary captures the key points of the reference summary.
Online Metrics (After Deployment)^{3}
Examples
 Event recommendation: conversion rate, bookmark rate, revenue lift
 Friend recommendation: number of requests per day, number of requests accepted per day
 Harmful content detection: prevalence, harmful impressions, valid appeals, proactive rate, user reports per harmful class
 Video recommendations: clickthroughrate, video completion rate, total watch time, explicit user feedback
Model Deployment
Continual Learning
 Continually adapt models to changing data distributions
 Faced with challenges of access to fresh data, continuous evaluation and algorithms suited to finetuning and incremental learning
Stages
 Manual, stateless retraining: initial manual workflow.
 Automated retraining: requires writing a script to automate workflow and configure infrastructure automatically, good availability and accessibility of data, and a model store to automatically version and store all the artefacts needed to reproduce a model.
 Automated, stateful retraining: requires reconfiguring the updating script and the ability to track data and model lineage.
 Continual learning: requires mechanism to trigger model updates and a pipeline to continually evaluated model updates.
Deployment Strategies (B to replace A)
 Recreate strategy: stop A, start B
 Ramped strategy: shift traffic from A to B behind same endpoint
 Blue/green: shift traffic from A to B using different endpoint
MLspecific Failures^{2}
Trainserving skew is when a model performs well during development but poorly after production. It can be caused by:
Upstream Drift
Caused by a discrepancy between how data is handled in the training and serving pipelines (should log features at serving time)^{8}
Data Distribution Shifts
Model may perform well when first deployed, but poorly over time (can be sudden, cyclic or gradual).
 Feature/covariate shift: change in the distribution of input data, \(P(X)\), but relationship between input and output, \(P(Y \vert X)\), remains the same.
 In training, can be caused by changes to data collection, e.g. if early data is from urban customers, and later data comes from rural customers.
 In production, can be caused by changes to external factors, e.g. when predicting sales from weather, if weather patterns change (more rainy days), but the relationship between weather and sales remains constant (rainy days always lead to fewer sales).
 Label shift: change in the distribution of output labels, \(P(Y)\), but relationship between output and input, \(P(X \vert Y)\), remains the same.
 E.g. when predicting diseases, if a disease becomes more common, but symptoms for each disease remains constant.
 Concept drift: change in the relationship between input and output, \(P(Y \vert X)\), but the distribution of input data, \(P(X)\), remains the same.
 E.g. when predicting rain from cloud patterns, if the cloud patterns remain the same but their association with rain changes (maybe due to climate change).
Degenerate Feedback Loops
When predictions influence the feedback, which is then used to extract labels to train the next iteration of the model,
Examples
 Recommender system: originally, A is ranked marginally higher than B, so the model recommends A. After a while, A is ranked much higher than B. Can be detected using Average Recommended Popularity (ARP) and Average Percentage of Long Tail Items (APLT).
 Resume screening: originally, model thinks X is a good feature, so the model recommends resume with X. After a while, hiring managers only hires people with X and model confirms X is good. Can be mitigated using randomisation and positional features.
Model Monitoring
Model monitoring is essential because while traditional software systems fail explicitly (error messages), Machine Learning systems fail silently (bad outputs)
Operationrelated Metrics
 Latency
 Throughput
 Requests per minute/hour/day
 Percentage of successful requests
 CPU/GPU utilisation
 Memory utilisation
 Availability
MLrelated Metrics^{2}
 Feature and label statistics, e.g. mean, median, variance, quantiles, skewness, kurtosis, etc.
 Taskspecific online metrics
Myths^{1}
 You only deploy one or two models at a time
 If we don’t do anything, model performance remains the same
 You won’t need to update your models as much
 Most ML engineers don’t need to worry about scale
Testing Strategies
 Canary: targeting small set of users with latest version.
 Shadow: mirror incoming requests and route to shadow application.
 A/B: route to new application depending on rules or contextual data.
 Interleave: mix recommendations from A and B and see which recommendations are clicked on.
Models: Supervised Learning^{9}^{5}
Supervised learning models make predictions after seeing lots of data with the correct answers. The model discovers the relationship between the data and the correct answers.
Regression
Regression models predict a numeric value.
Linear Regression
\[\hat{y}(x) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\] Linear regression models the relationship between the target and predictors as a straight line.
 Parameters are estimated by minimising the Residual Sum of Squares (RSS): \(RSS = \sum_{i=1}^{n} (y_i  \hat{y}_i)^2\)
 Use when the relationship between features and the target is approximately linear.
Regression Trees
\[\hat{y}(x) = \frac{1}{R_j} \sum_{i \in R_j} y_i \quad \text{if} \, x \in R_j\] Regression trees split the feature space into regions and predict the average of observations within each region.
 The tree splits are chosen to minimise the RSS in the resulting regions.
 Use when the relationship between features and the target is nonlinear, and you prefer a model that is easy to interpret.
Support Vector Regressor (SVR)
\[\hat{y}(x) = w^T x + b \quad \text{for points within the} \, \epsilon \, \text{margin}\] The Support Vector Regressor (SVR) fits a margin around the data points and penalises observations that fall outside the margin.
 Parameters are estimated by solving a quadratic optimisation problem to maximize the margin and penalize points outside the margin.
 Use when the relationship between features and the target is complex and when outliers need to be managed by a margin.
Classification
Classification models predict the likelihood that something belongs to a category.
Nearest Neighbours
\[\hat{y}(x) = \arg\max_{c \in C} \sum_{i \in N_k(x)} I(y_i = c)\] The kNearest Neighbours (kNN) algorithm assigns the class that is most frequent among the \(k\) closest observations in the feature space.
 No explicit training; classification is based on the majority class among the nearest neighbors using distance metrics.
 Use when there’s no assumption about the underlying distribution of the data and when simplicity is desired.
Logistic Regression
\[\hat{y}(x) = \frac{1}{1 + e^{(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p)}}\] Logistic regression models the probability that an observation belongs to a particular class, with the logistic function constraining the output between 0 and 1.
 Parameters are estimated by Maximum Likelihood Estimation (MLE), which maximises the probability of the observed data.
 Use when the relationship between the features and the probability of class membership is approximately linear on the logodds scale.
Linear Discriminant Analysis (LDA)
\[\hat{y}(x) = \arg\max_k \left( \log(\pi_k)  \frac{1}{2} \log \Sigma  \frac{1}{2} (x  \mu_k)^T \Sigma^{1} (x  \mu_k) \right)\] Linear Discriminant Analysis (LDA) assumes that different classes generate data based on multivariate normal distributions with classspecific means and a shared covariance matrix.
 Parameters are estimated by calculating class means, the shared covariance matrix, and the prior probabilities for each class.
 Use when the data is approximately normally distributed, and the classes have similar covariance matrices.
Classification Trees
\[\hat{y}(x) = \arg\max_{c \in C} \hat{p}_{mc} \quad \text{where} \, x \in R_m\] Classification trees split the feature space into regions where each region is assigned the most common class.
 The tree splits are chosen to maximise information gain, using metrics such as Gini impurity or entropy.
 Use when you need a simple, interpretable model that can handle both nonlinear relationships and categorical features.
Support Vector Classifier (SVC)
\[\hat{y}(x) = \text{sign}(w^T x + b)\] The Support Vector Classifier (SVC) finds the hyperplane that best separates the data into classes by maximising the margin between them.
 Parameters are estimated by solving a quadratic optimization problem to maximize the margin while allowing for some misclassification via slack variables.
 Use when the data is complex and not linearly separable, and you need a robust classifier with regularization to avoid overfitting.
Models: Unsupervised Learning
Unsupervised learning involves finding patterns and structure in input data without any corresponding output data.
Clustering
Clustering is used to group observations into clusters, where observations within the same cluster are more similar to each other than to those in different clusters.
Hierarchical Clustering
\[\hat{y}(x) = \text{Cluster assignment based on dendrogram}\] Hierarchical clustering builds a treelike structure (dendrogram) of nested clusters either from individual points up (agglomerative) or from one large cluster down (divisive).
 No explicit parameter estimation required; clusters are formed by successively merging or splitting based on a linkage criterion (e.g., complete, single, or average linkage).
 Use when you want to explore data structure at different levels of granularity and do not need to prespecify the number of clusters.
KMeans Clustering
\[\hat{y}(x) = \arg\min_{k} x  \mu_k^2\] KMeans Clustering partitions the data into \(K\) clusters by assigning each point to the nearest cluster centroid and updating centroids iteratively to minimise the withincluster sum of squares (WCSS).
 Centroids are updated iteratively by minimising the withincluster sum of squares (WCSS): \(WCSS = \sum_{k=1}^{K} \sum_{i \in C_k} \lVert x_i  \mu_k \rVert ^2\), where \(\mu_k\) is the centroid of cluster \(C_k\).
 Use when you know the number of clusters in advance and the clusters are roughly spherical and evenly sized.
Latent Dirichlet Allocation (LDA)
\[\hat{y}(x) = \arg\max_{z_i} P(z_i  d, \theta_d, \phi_z)\] Latent Dirichlet Allocation (LDA) assumes that documents are mixtures of topics, and topics are distributions over words. It assigns each word in a document to a latent topic.
 Parameters (topic distributions and word distributions) are estimated using variational inference or Gibbs sampling. The key parameters are \(\theta_d\) (the distribution of topics in document \(d\) and \(\phi_z\) (the distribution of words in topic \(z\)).
 Use LDA when you want to uncover latent topics in a large corpus of text and when documents are assumed to have multiple topics.
Dimensionality Reduction
Reduce the number of features (or dimensions) in the data while retaining as much of the variance or structure as possible.
Principal Components Analysis (PCA)
\[\hat{y}(x) = Z_1 = \phi_{11} X_1 + \phi_{12} X_2 + \ldots + \phi_{1p} X_p\] Principal Components Analysis (PCA) finds a new set of uncorrelated variables (principal components) that successively explain the maximum variance in the data.
 Principal components are the eigenvectors of the covariance matrix \(\Sigma\), and the corresponding eigenvalues represent the variance explained by each component. PCA maximizes the variance explained: \(\text{Maximize } \text{Var}(Z_k) = \phi_k^T \Sigma \phi_k\), subject to \(\phi_k^T \phi_k = 1\), where \(Z_k\) is the \(k\)th principal component.
 Use when you need to reduce dimensionality while preserving as much variance as possible, particularly when features are correlated.
Models: Reinforcement Learning^{10}
Reinforcement Learning models help agents learn the best action to take in an environment in order to achieve its goal.
Key Concepts
AgentEnvironment Interaction
The environment is the world where the agent operates. At each step, the agent perceives a state (complete world description) or an observation (partial information) and selects an action from the action space (discrete or continuous). A policy is a rule the agent follows to choose actions, aiming to maximise rewards.
[Figure 3.1 from Sutton and Barto]
Terminology
 Trajectory: A sequence of states and actions, \(\tau = (s_0, a_0, s_1, a_1, \ldots)\).
 Reward Function: Determines the reward based on the state and action, \(r_t = R(s_t, a_t)\).
 Return: Cumulative reward over time. It can be finitehorizon (sum over a fixed window) or infinitehorizon discounted return (\(\gamma\)discounted future rewards).
 Value Function: Measures the expected return starting from a state or stateaction pair and acting according to a particular policy forever. The Bellman Equation captures the recursive nature of value estimation: \(V^\pi(s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s' \vert s) V^\pi(s')\)
 Policy: A policy \(\pi\) is a rule or function that the agent uses to decide which action to take given a state. It can be deterministic, mapping a state to a specific action, \(\mu(s)\), or stochastic, mapping a state to a probability distribution over actions, \(\pi(a \vert s)\).
 Optimisation Goal: The agent learns to select a policy that maximises expected return over time.
Kinds of Reinforcement Learning Algorithms
ModelFree
Does not use a model of the environment, which is often unavailable anyway. It focuses on learning through direct interaction.
 Policy Optimisation: Directly optimises the policy typically by estimating the expected future rewards. E.g. A3C, PPO.
 QLearning: Learns an actionvalue function to estimate the optimal value for each action. E.g. C51, DQN.
ModelBased
Uses a model to predict state transitions and rewards, enabling planning ahead. It gains sample efficiency but is prone to bias if the model is inaccurate.
 Learn the Model: The agent learns the environment’s dynamics from experience. This enables planning but can introduce bias due to model inaccuracies. E.g. I2A, MBMF, MBVE.
 Given the Model: The agent is provided with an accurate environment model, allowing for optimal planning. This is rare in realworld scenarios but useful in controlled environments like games. E.g. AlphaZero.
Models: Recommender Systems^{3}
Behavioural principles
 Similar items are symmetric, e.g. white polo shirts
 Complementary items are asymmetric, e.g. buying a television, suggest a HDMI cable
Rulebased
Rulebased recommender systems rely on predefined rules and heuristics to make recommendations based on explicit logic and user behavior patterns. These systems do not involve machine learning but instead use a fixed set of ifthen conditions to guide recommendations.
Embeddingbased
Contentbased Filtering
Item feature similarities
 Pros: ability to recommend new videos, ability the capture unique user interests
 Cons: difficult to discover a user’s new interests, requires domain knowledge to engineer features
Collaborative Filtering
Usertouser similarities or itemtoitem similarities
 Pros: no domain knowledge needed, easy to discover users’ new areas of interest, efficient
 Cons: coldstart problem, cannot handle niche interests
Hybrid Filtering
Parallel or sequential combination of contentbased and collaborative filtering
 Pros: combines strengths of both methods for better recommendations
 Cons: more complex to implement
LearningtoRank
 Pointwise: model takes each item individually and learns to predict an absolute relevancy score
 Pairwise: model takes two ranked items and learns to predict which item is more relevant (RankNet, LambdaRank, LambdaMART)
 Listwise: model takes optimal ordering of items and learns the ordering (SoftRank, ListNet, AdaRank)
Models: Ensembles
Bagging (Bootstrap Aggregation)
Reduces model variance by training identical models in parallel on different data subsets (random forests).
 Pros: reduces overfitting, parallel training means little increase in training/inference time
 Cons: not helpful for underfit models
Boosting
Reduces model bias and variance by training several weak classifiers sequentially (Adaboost, XGBoost).
 Pros: reduces bias and variance
 Cons: slower training and inference
Stacking (Stacked Generalisation):
Reduces model bias and variance by training different models in parallel on the same dataset and using a metalearner model to combine the results.
 Pros: reduces bias and variance, parallel training means little increase in training/inference time
 Cons: prone to overfitting
Models: Tasks
Audio
 Audio generation: WaveNet, Tacotron, Jukebox
 Classification: VGGish, SoundNet, YAMNet
 Speaker diarization: Xvector, DIHARD, VBx
 Speaker identification: SincNet, xvector, ECAPATDNN
 Speech recognition: DeepSpeech, Wav2Vec, RNNT (Recurrent Neural Network Transducer)
 Source separation: Demucs, ConvTasNet, OpenUnmix
 Texttospeech: Tacotron 2, FastSpeech, WaveGlow
Computer Vision
 3D reconstruction: AtlasNet, DeepVoxels, NeRF (Neural Radiance Fields)
 Action recognition: I3D (Inflated 3D ConvNet), C3D (Convolutional 3D), SlowFast Networks
 Classification: AlexNet, Inception, ResNet, VGG
 Depth estimation: Monodepth, DPT (Dense Prediction Transformers), SfMNet (Structure from Motion Network)
 Image captioning: Show and Tell, Show, Attend and Tell, OSCAR
 Image denoising: DnCNN, N2V (Noise2Void), BM3D
 Image inpainting: DeepFill, Context Encoders, LaMa
 Imagetoimage translation: Pix2Pix, CycleGAN, UNIT (Unified Image Translation)
 Object detection: Faster RCNN, YOLO, SSD (Single Shot Multibox Detector)
 Object tracking: SORT, DeepSORT, SiamRPN (Siamese Region Proposal Network)
 Optical character recognition (OCR): CRNN (Convolutional Recurrent Neural Network), Tesseract, Rosetta
 Optical flow: FlowNet, PWCNet, RAFT (Recurrent AllPairs Field Transforms)
 Pose estimation: OpenPose, PoseNet, HRNet
 Semantic segmentation: UNet, DeepLab, SegNet
 Superresolution: SRGAN, ESRGAN (Enhanced SRGAN), VDSR (Very Deep SuperResolution)
 Texttoimage generation: DALLE, Parti, Imagen
 Visual odometry: ORBSLAM, VISO2, DeepVO
Graphs
 Community detection: Louvain, Label Propagation, Infomap
 Graph classification: Graph Convolutional Networks (GCNs), GraphSAGE, GIN (Graph Isomorphism Network)
 Link prediction: Node2Vec, DeepWalk, SEAL
 Node prediction: Graph Attention Networks (GAT), GCN, GraphSAGE
Natural Language Processing
 Classification: BERT, RoBERTa, XLNet
 Question answering: BERT, T5, ALBERT
 Language modeling: GPT, GPT2, GPT3, Transformer
 Machine translation: Transformer, MarianMT, mBART
 Named entity recognition: BERT, Flair, SpaCy’s CNN model
 Partofspeech tagging: BiLSTMCRF, BERT, Flair
 Sentiment analysis: BERT, XLNet, RoBERTa
 Text generation: GPT2, GPT3, T5
 Text summarisation: BART, PEGASUS, T5
Miscellaneous
 Anomaly detection: Isolation Forest, OneClass SVM, Autoencoders
 Autonomous driving: MobileNet, YOLO (You Only Look Once), PointPillars
 Code generation: GPT3, Codex, AlphaCode
 Timeseries forecasting: ARIMA, Prophet, LSTM (Long ShortTerm Memory)