Mathematics
This cheatsheet attempts to give a highlevel overview of the mathematics used in Machine Learning. Please contact me for corrections/omissions.
Last updated: 1 January 2024
Probability^{1}
Probabilities arise when we perform an experiment:
 The set of all possible outcomes of the experiment is the sample space.
 A subset of the sample space is an event.
 The union \((A \cup B)\) is the event that occurs if and only if at least one of \(A\), \(B\) occurs.
 The intersection \((A \cap B)\) is the event that occurs if and only if both \(A\) and \(B\) occur.
 The complement \((A^C)\) is the event that occurs if and only if \(A\) does not occur.
Combinatorics

Multiplication Rule: The total outcomes of two subexperiments with \(a\) and \(b\) outcomes are \(ab\).

Sampling with Replacement: Choosing \(k\) times from \(n\) objects with replacement gives \(n^k\) outcomes.

Sampling without Replacement: Choosing \(k\) times from\(n\) objects without replacement gives \(n(n1)\cdots(nk+1)\) outcomes.

Permutations: There are \(n!\) permutations of \(n\) distinct items, reflecting all possible orderings.

Binomial Coefficient Formula: The binomial coefficient \(\binom{n}{k}\) is \(\frac{n(n1)\ldots(nk+1)}{k!} = \frac{n!}{(nk)!k!}\) for \(k \leq n\), and 0 otherwise.
Axioms
 For any event \(A\), \(P(A) \geq 0\) (nonnegativity).
 Probability of sample space \(S\) is \(P(S) = 1\) (completeness).
 If \(A_1, A_2, \ldots\) are disjoint events, then \(P(A_1 \cup A_2 \cup \ldots) = P(A_1) + P(A_2) + \ldots\) (countable additivity).
Consequences
Monotonicity
If \(A \subseteq B\), then \(P(A) \leq P(B)\).
Probability of the Empty Set
\[P(\emptyset) = 0\]Complement Rule
\[P(A^C) = 1  P(A)\]InclusionExclusion Formula \((n=2)\)
\[P(A_i \cup A_j) = P(A_1) + P(A_2)  (A_i \cap A_j)\]InclusionExclusion Formula
\[\begin{eqnarray} P\left(\bigcup_{i=1}^{n} A_i\right) &=& \sum_{i} P(A_i)  \sum_{i<j} P(A_i \cap A_j) \\ && + \sum_{i<j<k} P(A_i \cap A_j \cap A_k)  \cdots \\ && + (1)^{n+1} P(A_1 \cap \cdots \cap A_n) \end{eqnarray}\]Probability: Conditional Probability^{1}
The conditional probability of \(A\) given \(B\) is:
 \[P(A \vert B) = \frac{A \cap B}{P(B)}\]

\(P(A)\) is the prior probability of \(A\)
 \(P(A \vert B)\) is the posterior probability of \(A\)
Probability of the Intersection of Two Events
\[P(A \cap B) = P(B) P(A \vert B) = P(A) P(B \vert A)\]Probability of the Intersection of \(n\) Events
\[P(A_1, A_2, \ldots) = P(A_1)P(A_2 \vert A_1)\ldots P(A_n \vert A_1, \ldots A_{n1})\]Bayes’ Rule
Update the probability of a hypothesis based on new evidence, balancing prior beliefs with the likelihood of observed data.
\[P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)}\]Law of Total Probability
Calculate the overall probability of an event by considering all possible ways it can occur across different conditions.
\[P(B) = \sum^{n}_{i1} P(B \vert A_i)P(A_i)\]Independent Events
Independent events provide no information about each other.
Independence of Two Events
Events \(A\) and \(B\) are independent if \(P(A \cap B) = P(A)P(B)\).
Independence of \(n\) Events
Events \(A_1, A_2, \ldots A_n\) are independent if any pair satisfies \(P(A_i \cap A_j) = P(A_i)P(A_j)\) for \(i \neq j\), any triplet satisfies \(P(A_i \cap A_j \cap A_k) = P(A_i)P(A_j)P(A_k)\) for \(i \neq j \neq k\), and similarly for all quadruplets, quintuplets and so on.
Common Mistakes
 Confusing the prior probability \(P(A)\) with the posterior probability \(P(A \vert B)\).
 Confusing \(P(A \vert B)\) with \(P(B \vert A)\) (the prosecutor’s fallacy).
 Failing to condition on all the evidence (the defense attorney’s fallacy).
 Overlooking a scenario where a trend in individual groups disappears or reverses when the groups are combined (Simpson’s paradox).
Probability: Random Variables^{1}
A random variable is a function assigning a real number to every possible outcome of an experiment.
Probability Mass Function (PMF)
 The probability mass function (PMF) of a discrete random variable \(X\) is the function \(p(x) = P(X=x)\).
 For a PMF to be valid, it must be nonnegative and sum to 1.
Probability Density Function (PDF)
 The probability density function (PDF) of a continuous random variable \(X\) is the derivative \(f(x) = F'(x)\) of the CDF.
 For a PDF to be valid, it must be \(f(x) \geq 0\) for all \(x\) and \(\int_{\infty}^{\infty} f(x)dx = 1\).
 To obtain a probability, we need to integrate the PDF.
Cumulative Distribution Function (CDF)
 The cumulative distribution function (CDF) of a random variable \(X\) is the function \(F(x) = P(X \leq x)\).
 For a CDF to be valid, it must be continuous, rightcontinuous, converge to 0 as \(x \rightarrow \infty\) and converge to 1 as \(x \rightarrow \infty\).
Independence
Independence of two random variables
\[P(X \leq x, Y \leq y) = P(X \leq x)P(Y \leq y)\]Independence of \(n\) random variables
\[P(X_1 \leq x_1, \ldots, X_n \leq x_n) = P(X_1 \leq x_1) \ldots P(X_n \leq x_n)\]Independent and identically distributed (IID) random variables
Random variables are independent if they provide no information about each other and are identically distributed if they have the same PMF or CDF.
Expectation
Expectation
Center of mass of a distribution.
\[E(X) = \sum_x x P(X=x)\]Variance
Spread of a distribution around its mean.
\[\textrm{Var}(X) = E(X  E(X))^2 = E(X^2) E^2(X)\]Standard Deviation
Average distance of each data point from the mean.
\[\textrm{Stdev}(X) = \sqrt{\textrm{Var}(X)}\]Law of the Unconscious Statistician (LOTUS)
Expectation of \(g(X)\) can be calculated using only the PMF of \(X\), without first having the find the distribution of \(g(x)\).
\[E(g(X)) = \sum_x g(x)P(X=x)\]Covariance
Covariance is the tendency of two random variables to go up or down together, relative to their means.
\[\textrm{Cov}(X, Y) = E((XE(X))(YE(Y))) = E(XY)  E(X)E(Y)\]Correlation
Like Covariance but scaled to be unitless with values always ranging from 1 to 1.
\[\textrm{Corr}(X, Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\]Law of Large Numbers
The mean of a large number of independent random samples converges to the true mean.
Central Limit Theorem
The sum of a large number of IID random variables has an approximately Normal distribution, regardless of the distribution of the individual random variables.
Discrete Random Variables
Bernoulli Distribution, \(X \sim \textrm{Bern}(p)\)
Indicator of success in a Bernoulli trial with probability of success \(p\).
\[P(X=0) = 1p\] \[P(X=1) = p\]Binomial Distribution, \(X \sim \textrm{Bin}(n, p)\)
Number of successes in \(n\) independent Bernoulli trials, all with the same probability \(p\) of success.
\(P(X=k) = \binom{n}{k} p^k (1p)^{nk}\) for \(k = 0, 1, \ldots, n\)
Hypergeometric Distribution, \(X \sim \textrm{HGeom}(w, b, n)\)
Number of white balls obtained in a sample of size \(n\) drawn without replacement from an urn of \(w\) white and \(b\) black balls.
\[P(X=k) = \frac{\binom{w}{k}\binom{b}{nk}}{\binom{w+b}{n}}\]Discrete Uniform Distribution, \(X \sim \textrm{DUnif}(C)\)
Obtained by randomly choosing an element of the finite set \(C\), with equal probabilities for each element.
\[P(X=x) = \frac{1}{C}\]Geometric Distribution, \(X \sim \textrm{Geom}(p)\)
Number of failures before the first success in a sequence of independent Bernoulli trials with probability \(p\) of success.
Negative Binomial Distribution, \(X \sim \textrm{NBin}(r, p)\)
Number of failures before \(r\) successes with replacement.
\(P(X=n) = \binom{n+r1}{r1} p^r (1p)^{n}\) for \(n = 0, 1, \ldots\)
Negative Hypergeometric Distribution, \(X \sim \textrm{NHGeom}(w, b, r)\)
Number of failures before \(r\) successes without replacement.
\[P(X=k) = \frac{\binom{w}{r1}\binom{b}{k}}{\binom{w+b}{r+k1}} \cdot \frac{wr+1}{w+brk+1}\]Poisson Distribution, \(X \sim \textrm{Pois}(\lambda)\)
Number of events occurring in a fixed interval of time, given arrival rate \(\lambda\).
\(P(X=k) = \frac{e^{\lambda}\lambda^k}{k!}\) for \(k = 0, 1, \ldots,\)
Continuous Random Variables
Uniform Distribution, \(X \sim \textrm{Unif}(a, b)\)
Completely random number in the interval \((a, b)\).
\[f(x) = \frac{1}{ba}\]Normal Distribution, \(X \sim \mathcal{N}(\mu, \sigma^2)\)
A symmetric bellshaped distribution centered at \(\mu\), with \(\sigma\) determining the spread.
\[f(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left( \frac{(x\mu)^2}{2\sigma^2} \right)\]Exponential Distribution, \(X \sim \textrm{Expo}(\lambda)\)
Waiting time for the first success in continuous time, with arrival rate \(\lambda\).
\[f(x) = \lambda e^{\lambda x}\]Note: the exponential distribution has the memoryless property; given a certain wait time without success, the distribution of the remaining wait time is unchanged, i.e. \(P(X \geq s + t \vert X \geq s) = P (X \geq t)\).
Linear Algebra
Vector Addition
\[u + v = \left( a_1 + b_1, a_2 + b_2, \ldots, a_n + b_n \right)\]Example
Given \(u = \begin{bmatrix} 3 \\ 5 \end{bmatrix}\) and \(v = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), \(u + v = \begin{bmatrix} 3 \\ 5 \end{bmatrix} + \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 + 2 \\ 5 + 1 \end{bmatrix} = \begin{bmatrix} 5 \\ 4 \end{bmatrix}\).
Scalar Multiplication
\[ku = \left( ka_1, ka_2, \ldots, ka_n \right)\]Example
Given \(u = \begin{bmatrix} 3 \\ 5 \end{bmatrix}\), \(2u = 2 \begin{bmatrix} 3 \\ 5 \end{bmatrix} = \begin{bmatrix} 2 \cdot 3 \\ 2 \cdot 5 \end{bmatrix} = \begin{bmatrix} 6 \\ 10 \end{bmatrix}\).
Linear Combination
\[k_1 u_1 + k_2 u_2 + \ldots + k_n u_n\]Example
Given \(u = \begin{bmatrix} 4 \\ 1 \end{bmatrix}\) and \(v = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), \(2u  3v = 2 \begin{bmatrix} 4 \\ 1 \end{bmatrix}  3 \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 2 \\ 5 \end{bmatrix}\).
Linear Independence
A set of vectors is linearly independent if no vector can be represented as a linear combination of the remaining vectors.
Basis
Standard unit vectors \(\widehat{i} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) and \(\widehat{j} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\) are basis vectors of the \(x\textrm{}y\) coordinate system since any 2D vector can be expressed as a linear combination of them.
Span
The span of two vectors is the set of all vectors that can be formed from their linear combinations.
Linear Transformation
A linear transformation is fully described by where the two standard basis vectors land. This transformation can be captured by a 2×2 matrix: the first column is where \(\widehat{i}\) lands, and the second column is where \(\widehat{j}\) lands. Visually, a linear transformation keeps grid lines parallel and evenly spaced and the origin fixed.
Example
Rotate 90° counterclockwise
\[\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}\]Example
Shear
\[\begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}\]Rank
 The column rank of a matrix is the largest number of columns that constitute a linearly independent set.
 The row rank of a matrix is the largest number of rows that constitute a linearly independent set.
 Since the column rank and row rank are always equal, the rank of a matrix is the largest number of rows or columns that constitute a linearly independent set.
Properties
 \[\textrm{rank}(A) \leq min(m, n)\]
 \[\textrm{rank}(A) = \textrm{rank}(A^T)\]
 \[\textrm{rank}(AB) \leq min(rank(A), rank(B))\]
 \[\textrm{rank}(A+B) \leq rank(A) + rank(B)\]
Linear Algebra: Products
Dot (Inner) Product
\[u \cdot v = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n\]Properties
 Orthogonality: \(u \cdot v = 0\)
 Angle between vectors: \(\cos \theta = \frac{v \cdot w}{\v\ \w\}\)
 Projection: \(proj(u, v) = \frac{u \cdot v}{\v\^2} v = \frac{u \cdot v}{v \cdot v} v\)
 Schwarz inequality: \(\lvert v \cdot w \rvert \leq \v\ \w\\)
 Triangle inequality: \(\v + w\ \leq \v\ + \w\\)
Example
Given \(u = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\) and \(v = \begin{bmatrix} 4 \\ 6 \end{bmatrix}\), \(u \cdot v = (1)(4) + (2)(6) = 4 + 12 = 16\).
Example
Given \(u = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) and \(v = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), \(\cos \theta = \frac{u \cdot v}{\u\ \v\} = \frac{1}{(1)(\sqrt{2})}\), which means \(\theta = 45^\circ\).
Cross Product
The cross product between two vectors gives a third vector whose length is the parallelogram’s area and points in a direction that is orthogonal to both vectors, following the righthand rule.
\[u \times v = \begin{vmatrix} i & j & k \\ a_1 & a_2 & a_3 \\ b_1 & b_2 & b_3 \end{vmatrix}\]Properties
 \[\u \times v\ = \u\ \v\ \sin \theta\]
 \[u \times u = 0\]
 \[u \times v = (v \times u)\]
 \[u \times (v + w) = (u \times v) + (u \times w)\]
Outer Product
Each element of \(u\) is multiplied by each element of \(v\).
\[u \otimes v = \begin{bmatrix} a_1 b_1 & a_1 b_2 & \ldots & a_1 b_n \\ a_2 b_1 & a_2 b_2 & \ldots & a_2 b_n \\ \vdots & \vdots & \ddots & \vdots \\ a_m b_1 & a_m b_2 & \ldots & a_m b_n \end{bmatrix}\]Properties
 \[(u \otimes v)^T = (v \otimes u)\]
 \[(v + w) \otimes u = v \otimes u + w \otimes u\]
 \[u \otimes (v + w) = u \otimes v + u \otimes w\]
 \[c (v \otimes u) = (cv) \otimes u = v \otimes (cu)\]
Linear Algebra: Operations and Properties^{2}
Identity Matrix
A square matrix with ones on the diagonal and zeros everywhere else.
\[I_{ij} = \begin{cases}1 & \text{if } i = j \\ 0 & \text{if } i \neq j\end{cases}\]Example
\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]Diagonal Matrix
A square matrix where all nondiagonal elements are zero.
\[D_{ij} = \begin{cases}d_i & \text{if } i = j \\ 0 & \text{if } i \neq j\end{cases}\]Example
\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{bmatrix}\]Transpose
Flip the rows and columns.
\[A_{ij}^{T} = A_{ji}\]Properties
 \[\left(A^T\right)^T = A\]
 \[\left(AB\right)^T = B^TA^T\]
 \[\left(A+B\right)^T = A^T + B^T\]
Example
\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix}^T\]Symmetric Matrix
\[A = A^T\]Example
\[\begin{bmatrix} 2 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 2 \end{bmatrix}^T = \begin{bmatrix} 2 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 2 \end{bmatrix}\]Antisymmetric Matrix
\[A = A^T\]Example
\[\begin{bmatrix} 0 & 1 & 2 \\ 1 & 0 & 1 \\ 2 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 1 & 2 \\ 1 & 0 & 1 \\ 2 & 1 & 0 \end{bmatrix}^T\]Triangular Matrix
A square matrix is called lower triangular if all the entries above the main diagonal are zero. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are zero.
Example
\[L = \begin{bmatrix} 1 & 0 & 0 \\ 4 & 5 & 0 \\ 7 & 8 & 9 \end{bmatrix}\] \[U = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 5 & 6 \\ 0 & 0 & 9 \end{bmatrix}\]Orthogonal Matrix
A square matrix \(Q\) such that its transpose is equal to its inverse.
\[Q^T Q = Q Q^T = I\]Example
\[Q = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ 0 & 0 & 1 \end{bmatrix}\]Normal Matrix
A square matrix \(A\) is called a normal matrix if it commutes with its conjugate transpose.
\[AA^* = A^*A\]Example
Given the matrix:
\[A = \begin{bmatrix} 1 & i \\ i & 1 \end{bmatrix}\]The conjugate transpose \(A^*\) is:
\[A^* = \begin{bmatrix} 1 & i \\ i & 1 \end{bmatrix}\]Trace
Sum of diagonal elements in the matrix.
\[\textrm{tr}(A) = \sum_{i=1}^{n} A_{ii}\]Properties
 \[\textrm{tr}(A) = \textrm{tr}(A)^T\]
 \[\textrm{tr}(A+B) = \textrm{tr}(A) + \textrm{tr}(B)\]
 \[\textrm{tr}(kA) = k \textrm{tr}(A)\]
Example
\[\textrm{tr}\left(\begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{bmatrix}\right) = 1 + 2 + 3 = 6\]Norm
Length of a vector.
\[\u\ = \sqrt{u \cdot u} = \sqrt{a_1^2 + a_2^2 + \ldots + a_n^2}\]\(u\) is a unit vector if it has length 1, i.e., \(\u\ = 1\)
Properties
 For all \(x \in \mathbb{R}^n, f(x) \geq 0\) (nonnegativity).
 \(f(x) = 0\) if and only if \(x = 0\) (definiteness).
 For all \(x \in \mathbb{R}^n, t \in \mathbb{R}, f(tx)  \lvert t \rvert f(x)\) (homogeneity).
 For all \(x,y \in \mathbb{R}^n, f(x+y) \leq f(x) + f(y)\) (triangle inequality).
\(\ell_1\) Norm
\[\x\_1 = \sum_{i=1}^{n}\lvert x_i \rvert\]\(\ell_2\) Norm
\[\x\_2 = \sqrt{\sum_{i=1}^{n}x_i^2}\]\(\ell_\infty\) Norm
\[\x\_\infty = \max_i \lvert x_i \rvert\]\(\ell_p\) Norm
\[\x\_p = \left( \sum_{i=1}^{n} \lvert x_i \rvert^p \right)^{1/p}\]Inverse Matrix
The inverse of a square matrix \(A\) is denoted \(A^{1}\) and is the unique matrix such that: \(A^{1}A = I\).
Not all matrices have inverses. \(A\) is invertible or nonsingular if the inverse exists and noninvertible or singular otherwise.
Properties
 \[\left(A^{1}\right)^{1} = A\]
 \[(AB)^{1} = B^{1}A^{1}\]
 \[\left(A^{1}\right)^T = \left(A^T\right)^{1}\]
Determinant
The determinant of a square matrix can be understood as the volume scaling factor for the linear transformation described by the matrix.
For any matrix, the absolute value of the determinant gives the volume of the shape formed by its column vectors: a parallelogram in 2D, a parallelepiped in 3D, and an \(n\)dimensional parallelotope in higher dimensions.
The absolute value of the determinant, combined with its sign, represents the oriented area (or volume in higher dimensions), indicating whether the transformation preserves or reverses orientation.
A determinant of zero means the matrix is not invertible, as the linear transformation collapses the original space into a lower dimension with zero volume (such as a flat plane, a line, or a point). Once space is compressed in this way, it cannot be expanded back to its original form. For example, you cannot “unsquish” a line back into a plane, or a plane into a volume.
Properties
 \(\ I \ = 1\).
 The determinant of a matrix, where one of its columns is a linear combination of other vectors, is also a linear combination of those vectors.
 Whenever two columns of a matrix are identical, its determinant is 0.
Example
\[\begin{vmatrix} a & b \\ c & d \end{vmatrix} = adbc\]Example
\[\begin{vmatrix} a & b & c \\ d & e & f \\ g & h & i \end{vmatrix} = a \begin{vmatrix} e & f \\ h & i \end{vmatrix}  b \begin{vmatrix} d & f \\ g & i \end{vmatrix} + c \begin{vmatrix} d & e \\ g & h \end{vmatrix}\]Eigenvectors and Eigenvalues
Given a square matrix \(A\), we say that \(\lambda\) is an eigenvalue of \(A\) and \(x\) is the corresponding eigenvector if \(Ax = \lambda x\). In other words, when the matrix \(A\) is applied to the vector \(x\), the result is a new vector that points in the same direction as \(x\), but scaled by a factor of \(\lambda\).
In a typical linear transformation, most vectors are “knocked off” the line they span, meaning they change direction. However, the vectors that remain along the same line after the transformation are called eigenvectors of that transformation. The scaling factor by which these vectors are stretched or compressed is the corresponding eigenvalue. In 3D, this can be visualized as an axis of rotation.
Properties
 \[\textrm{tr}(A) = \sum_{i=1}^{n} \lambda_i\]
 \[\lvert A \rvert = \prod_{i=1}^{n} \lambda_i\]