Pattern Recognition and Machine Learning -- Chris Bishop

Introduction

  • Supervised learning: classification, regression.
  • Unsupervised learning: clustering, density estimation, visualization.
  • Reinforcement learning: find suitable actions to take by a process of trial and error in a given situation in order to maximize a reward; a trade-off between exploration, in which the system ties out new actions to see their effectiveness, and exploitation, in which the system makes use of actions known to yield a high reward.

Polynomial Curve Fitting

Linear models: functions that are linear in the unknown parameters.

The more flexible polynomials with larger values of power order, corresponding more complex models, are becoming increasingly tuned to the random noise on the target values, which reduce its generalization ability and cause overfitting.

For a given complex model, overfitting will become less severe with the increase size of dataset, since the larger the dataset, the more complex/flexible the model that we can afford to fit the data. We will even suffer underfitting when the size of dataset is large enough because the model is no longer complex for current dataset.

section 3.4 introduces how to avoid overfitting by adopting Bayesian model.

Regularization is used to control overfitting by adding a penalty term to the loss function in order to discourage the coefficients/weights from reaching large values. The importance of regularization is controlled by hyper-parameter $\lambda$. The larger $\lambda$ the more penalty we add to the model, the more overfitting we reduced, which may even cause underfitting though.

Probability Theory

Three types of probabilities: joint probability, marginal probability, conditional probability.

Two rules of probabilities:

  1. Sum rule: $P(X) = \sum_Y P(X, Y)$.
  2. Product rule: $P(X, Y) = P(Y\mid X)P(X)$.

Bayes’ theorem: $P(Y\mid X) = \frac{P(X\mid Y)P(Y)}{P(X)}$, where $P(X)$ can also be written as $\sum_Y P(X\mid Y)P(Y)$.

  1. Prior probability: $P(Y)$, which is given before the observation happens.
  2. Posterior probability: $P(Y\mid X)$, obtained after X is observed.

Expectation of $f(x)$ is the average value of $f(x)$ under a probability distribution $p(x)$ as defined below:

\[\begin{align} E[f] &= \sum_x p(x)f(x) \\ E[f] &= \ p(x)f(x)dx \\ E[f] &\approx \frac1N \sum_{n=1}^N f(x_n) \\ E_x[f\mid y] &= \sum_x p(x\mid y)f(x). \end{align}\]

Variance of $f(x)$ (or x itself) is the variability of $f(x)$ or x around its mean value $E[f]$:

\[\begin{align} var[f] &= E[(f(x)-E[f(x)])^2] \\ ​ &= E[f(x)^2] - E[f(x)]^2. \\ var[x] &= E[x^2] - E[x]^2. \end{align}\]

Covariance of x and y measures the extent of x and y vary together:

\[\begin{align} cov[x, y] &= E_{x,y}[(x-E[x])(y-E[y])] \\ ​ &= E_{x,y}[xy] - E[x]E[y] \end{align}\]

The convariance of two vector $\textit{x}$ and $\textit{y}$ is a matrix computed by:

\[\begin{align} cov[x, y] &= E_{x,y}[(x-E[x])(y^T-E[y^T])] \\ ​ &= E_{x,y}[xy^T]-E[x]E[y^T]. \end{align}​\]

Gaussian distribution is defined as follows by two parameters, mean $\mu$ and variance $\sigma^2$: \(G(x\mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma^2}\exp^{\frac{(x-\mu)^2}{2\sigma^2}}.\) Why parameters called mean and variance?

\[\begin{align} E[x] &= \int{G(x\mid \mu, \sigma^2) x}dx = \mu. \\ E[x^2] &= \int{G(x\mid \mu, \sigma^2) x^2}dx = \mu^2 + \sigma^2. \\ var[x] &= E[x^2] - E[x]^2 = \sigma^2. \end{align}\]

Gaussian distribution over D-dimensional vector $\textbf{x}$: \(G(\textbf{x}\mid \mu,\Sigma) = \frac{1}{(2\pi)^{D/2}\mid \Sigma\mid ^{1/2}}\exp^{-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)}.\) where mean $\mu$ is a D-dim vector, covariance $\Sigma$ is a D*D matrix.

Independent and identically distribution refers to data that are independently drawn from the same distribution.

Maximum likelihood (MLE): estimate parameter $w$ to the value that maximizes the likelihood function $p(D\mid w)$, (the prob that the observed data appears).

Drawback of MLE: underestimate the variance of data. Suppose given n values of variable x sampled from a Gaussian distribution, $x=(x_1,…,x_n)$, our goal is find $\mu, \sigma^2$ satisfying:

\[\begin{align} (\mu,\sigma^2) &= argmin_{\mu,\sigma^2}\sum_n \ln p(x_i\mid \mu,\sigma^2),\\ \mu_{ML} &= \frac1N\sum_{n=1}{N}x_n, \\ \sigma^2_{ML} &= \frac1N\sum_{n=1}{N}(x_n-\mu_{ML})^2,\\ \end{align}\]

Whereas

\[\begin{align} E[\mu_{ML}] &= \mu, \\ E[\sigma^2_{ML}] &= (\frac{N-1}{N}) \sigma^2. \end{align}\]

This phenomenon is called bias of MLE in which the true variance of observed data is underestimated.

Maximum posterior (MAP): determine parameter $w$ by finding the most probable value of $w$ given data, in other words, by maximizing the posterior distribution.

Model Selection

S-fold Cross-validation: partition data into S groups, each time S-1 groups were used for training and the remaining one was used for evaluation. Repeat S times and then average performance scores. When S equals N, size of data, called leave-one-out.

Decision Theory

Getting an output given an input x and a trained model can be seen as two steps: inference step, in which we determine the joint distribution $p(x, t)$ by using training data to learn a model for $p(C_k\mid x)$; decision step, in which we make optimal decisions given the posterior probabilities.

Three types of decision making approaches**

In a decreasing order of complexity

Generative models

In inference step, determine the class-conditional densities $p(x\mid C_k)$ for each class $C_k$, then infer the prior class probabilities $p(C_k)$. Lastly use Bayes’s theorem to find posterior class probabilities $p(C_k\mid x)$:

\[\begin{align} p(C_k\mid x) &= \frac{p(x\mid C_k)p(C_k)}{P(x)}, \\ p(x) &= \sum_k p(x\mid C_k)p(C_k). \end{align}\]

In decision step, we use decision theory to determine class membership. For instance, choose the class with highest probability or set a error matrix.

It’s called so if approaches model the distribution of inputs and outputs, because by sampling from them it’s possible to generate synthetic data in input space.

Discriminative models

In inference step, determine the posterior class probabilities $p(C_k\mid x)$ directly (e.g. MLE), and then assign new x to a class using decision theory.

No-probability models

Find a discriminant function f(x) which maps input x directly to a class label, in which probabilities play no rule.

Information Theory

We can see the information we received, given a value of a variable x, as the ‘degree of surprise’ because, we will receive more information if we were told that an improbable event happened than if we were told that a very likely event happened (since we have known it will happen!).

So the amount of information, $h(x)$, is related to $p(x)$. For two independent event x and y, we should have:

\[\begin{align} h(x,y) &= h(x)+h(y),\\ p(x,y) &= p(x)p(y). \end{align}\]

So obviously $h(x)$ must be given by the logarithm of $p(x)$, so: \(h(x) = -\log_2p(x).\) And more importantly, the amount of information of x should be given by the expectation of $h(x)$ and $p(x)$: \(H[x] = -\sum_x p(x)\log_2p(x).\) Where $H[x]​$ is called entropy of x.

Probability Distributions

  • Density estimation: model the probability distribution $p(x)$ given a finite observation set $x_1,…,x_N$.
  • Parametric density estimation: determine the suitable values of parameters given an observed dataset, such as $(\mu,\sigma^2)$ in Gaussian distribution.
  • Nonparametric density estimation: the form of distribution depends on the size of dataset; parameters in such distributions only control the model complexity (e.g. histograms, nearest-neighbours).

Binary Variables

Bernoulli distribution: given $p(x=1\mid \mu)=\mu,0 \leq \mu \leq \mu$, we have:

\[\begin{aligned} Bern(x\mid \mu) &= \mu^x (1-\mu)^{1-x} \\ E[x] &= \mu \\ var[x] &= \mu (1-\mu) \end{aligned}​\]

Given a dataset $D={x_1,…x_n}$, the likelihood function is:

\[\begin{aligned} p(D\mid \mu) = \prod_{n=1}^{N}p(x_n\mid \mu)=\prod_{n=1}^{N}\mu^{x_n}(1-\mu)^{1-x_n}. \end{aligned}\]

By using a Maximum (logarithm) Likelihood Estimation (MLE), from the perspective of frequentist, we have

\[\begin{aligned} \mu_{ML} &= \arg\max_{\mu} \ln p(D\mid \mu) \\ &= \arg\max_{\mu} \sum_{n=1}^{N}(x_n\ln \mu + (1-x_n)\ln (1-\mu)).\\ \end{aligned}\]

By letting the partial derivative equals to 0, and assume the number of observations of $x=1$ within the dataset equals to $m$, we have: \(\begin{aligned} \mu_{ML} = \frac1N \sum_1^N x_n = \frac{m}{N}. \end{aligned}\)

Binomial distribution: N times of repeated and independent Bernoulli:

\[\begin{aligned} Bin(m\mid N,\mu) &= C_m^N \mu^m (1-\mu)^{N-m}, \\ E[m] &= \sum_{m=0}^N mBin(m\mid N, \mu) = N\mu, \\ var[m] &= \sum_{m=0}^N (m-E[m])^2Bin(m\mid N,\mu) = N\mu (1-\mu). \end{aligned}\]

Beta distribution

Problem of MLE for Bernoulli&Binomial: if all observations of $x_i$ equal to 1, $\mu$ will equal to 1 as well by MLE, which obviously causes the overfitting.

Solution: introduce a prior distribution $p(\mu)$ to do density estimation from the perspective of Bayesian treatment.

Conjugacy: choose a prior to be proportional to the likelihood so that the posterior distribution, which is proportional to the product of prior of likelihood, will have the same functional form as the prior.

Beta distribution: hyperparameters a and b are used to control the distribution of $\mu$.

\[\begin{aligned} Beta(\mu\mid a,b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1} (1-\mu)^{b-1}, \\ E[\mu] &= \frac{a}{a+b}, \\ var[\mu] &= \frac{ab}{(a+b)^2(a+b+1)}. \end{aligned}\]

Combining beta distribution, as prior, with Binomial distribution, as a likelihood, we have the posterior distribution of $\mu$:

\[\begin{aligned} p(\mu\mid m,l,a,b) &\propto \mu^{m+a-1}(1-\mu)^{l+b-1} \\ &= \frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1}. \end{aligned}\]

Where $l=N-m$ corresponding the number of $x=0$.

The result of posterior shows that it is actually an updated Beta distribution given some new data, so that it can further act as the prior if there are some subsequently observe additional data, which gives us a sequential approach to learning from the Bayesian viewpoint.

Prediction: given the dataset by now, predict the next trial using the posterior,

\(\begin{aligned} p(x=1\mid D) &= \int_0^1 p(x=1\mid \mu)p(\mu\mid D)d\mu = \int_0^1 \mu p(\mu\mid D) d\mu = E[\mu\mid D] \\ &= \frac{m+a}{m+a+l+b}. \end{aligned}\)

Remember the expectation of beta distribution.

Multinomial Variables

X can take one of K possible exclusive states instead of only two. Usually we represent x using one-hot encoder, a K-dimensional vector in which $x_k=1$ and all remaining equals to 0. Suppose we use $\mu_k$ to represent the probability of $x_k$, we have

\[\begin{aligned} p(x\mid \mu) = \prod_{k=1}^{K} \mu_k^{x_k}. \end{aligned}\]

Where $\sum_k \mu_k = 1$. This distribution can be regarded as a generalization of Bernoulli distribution to more than 2 outcomes.

Now given a dataset $D={x_1,…x_N}$, the likelihood function is:

\[\begin{aligned} p(D\mid \mu) &= \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k}, \\ m_k &= \sum_n x_{nk}. \end{aligned}\]

Considering the constraint that $\sum_k \mu_k=1$, we cannot directly let partial derivative equal to 0. Instead, we use a Lagrange multiplier $\lambda$:

\[\begin{aligned} \mu_k &= \arg\max_\mu \sum_k m_k \ln \mu_k + \lambda (\sum_k \mu_k -1) \\ &= -m_k / \lambda. \end{aligned}\]

Then we feed the result to the constraint $\sum_k \mu_k=1$ and get $\lambda = -N$. Thus the MLE result is:

\[\begin{aligned} \mu_k = m_k / N. \end{aligned}\]

Multinomial distribution: The joint distribution of the quantities $m_1,…,m_K$ given $N$ observations:

\[\begin{aligned} Mult(m_1,..,m_K\mid \mu,N) &= \frac{N!}{m_1!...m_K!} \prod_k \mu_k^{m_k},\\ \sum_k \mu_k &= N. \end{aligned}\]

Dirichlet distribution

Prior distributions for Multinomial distribution.

The conjugate prior can be given by

\[\begin{aligned} p(\mu\mid \alpha) &\propto \prod_k \mu_k^{\alpha_{k-1}}, \\ Dir(\mu\mid \alpha) &= \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)...\Gamma(\alpha_K)} \prod_k \mu_k^{\alpha_{k-1}},\\ \alpha_0 &= \sum_k \alpha_k. \end{aligned}\]

Again, we will get a sequential learning with multinomial distribution, which can be seen from the form of posterior below:

\[\begin{aligned} p(\mu \mid D,\alpha) &\propto p(\mu\mid \alpha)p(D\mid \mu) \propto \prod_k \mu_k^{\alpha_k+m_k-1} \\ &= Dir(\mu\mid \alpha+m) \\ &= \frac{\Gamma(\alpha_0+m)}{\Gamma(\alpha_1+m_1)...\Gamma(\alpha_K+m_K)} \prod_k \mu_k^{\alpha_k+m_k-1}. \end{aligned}\]

Gaussian Distribution

The Gaussian distribution for a single variable is given by:

\[\mathcal{N}(x\mid\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^{1/2}} \exp \lbrace\frac{(x-\mu)^2}{2\sigma^2}\rbrace\]

For a D-dimensional vector $\mathrm{x}$, the multivariate Gaussian distribution is given by:

\[\mathcal{N}(\mathrm{x}\mid\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}} \frac{1}{\mid \Sigma \mid^{1/2}} \exp \{\frac{-1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\}.\]

where $\mu$ is a D-dim mean vector, $\Sigma$ is a $D \times D$ covariance matrix.


Exponential Family

The exponential family of distributions over $x$, given parameters $\eta$, is a set of distributions in forms of:

\[p(x \mid \eta) = h(x)g(\eta)\exp \{\eta^Tu(x)\}.\]

where $x$ may be scalar or vector, $\eta$ are called natural parameters of the distribution, $u(x)$ is some function of $x$, $g(\eta)$ can be seen as the coefficient that ensure the distribution is normalized and satisfies

\[g(\eta) \int h(x)\exp \{\eta^Tu(x)\} dx = 1.\]

We begin by checking that Bernoulli distribution indeed belongs to exponential family

\[\begin{aligned} p(x \mid \mu) &= Bern(x \mid \mu) = \mu^x (1-\mu)^{1-x} \\ &= \exp \{x\ln\mu + (1-x)\ln(1-\mu)\}\\ &= \exp \{\ln(1-\mu)+\ln(\frac{\mu}{1-\mu})x\}\\ &= (1-\mu)\exp \{ln(\frac{\mu}{1-\mu})x\}. \end{aligned}\]

where, compared with equation above, we have $\eta = \ln(\frac{\mu}{1-\mu})$. By letting $\mu=\sigma(\eta)$, we have

\[\begin{aligned} \sigma(\eta) = \frac{1}{1+\exp(-\eta)}. \end{aligned}\]

which is called logistic sigmoid function. Now we can rewrite Bernoulli distribution in forms of

\[\begin{aligned} p(x \mid \eta) &= \sigma(-\eta)\exp(\eta x),\\ u(x) &= x,\\ h(x) &= 1,\\ g(\eta) &= \sigma(-\eta). \end{aligned}\]

Conjugate priors

For a given probability distribution $p(x \mid \eta)$, we can seek a prior $p(\eta)$ that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. For any member of the exponential family, there exists a conjugate prior in the form of

\[\begin{aligned} p(\eta \mid \mathscr{X}, v) = f(\mathscr{X},v) g(\eta)^v \exp\{v\eta^T\mathscr{X}\}. \end{aligned}\]

Nonparametric methods

Limitation of parametric methods: the chosen density might be a poor model of the distribution that generates the data, causing poor predictive performance.

Examples of nonparametric methods: histogram, Kernel density estimator, Nearest-neighbour methods.


TODO:

  • 2.3 Gaussian distribution;
  • 2.4&2.5 details.

Linear Models for Regression

Linear models: linear functions of the adjustable parameters (instead of input variables, which is just the simplest form of linear models).

From linear regression to linear models for regression

Generally, linear regression models have a form of

\[y(x, w) = \sum_{i=0}^{M-1}w_ix_i\]

where $x_0=1$, corresponding bias $w_0$.

We can extend this class of models by adding some nonlinearity to input $X$ and only keeping the linearity in terms of weight $W$, through which we get general Linear models, of the form

\[y(x, w) = \sum_{j=0}^{M-1}w_j \phi_j(x) = w^T\phi_j(x)\]

where $\phi_j(x)$ are called basis functions (as before, we can define $\phi_0(x)=1$). One particular example of linear models is polynomial regression, in which we have $\phi_j(x)=x^j$ as the basis function.

There are still other basis functions including

\[\phi_j(x) = \exp\{-\frac{(x-\mu_j)^2}{2s^2}\}\]

where $\mu_j$ controls the locations of the basis functions in input space and $s$ controls the spatial space. They are referred to as ‘Gaussian basis functions’ because of the similarity with Gaussian function except that they have no normalization coefficients, which are not important because each of them has an adjustable parameter $w_j$.

Another example is the sigmoid basis function of the form

\[\phi_j(x) = \sigma(\frac{x-\mu_j}{s})\]

where sigmoid function is defined by

\[\sigma(a) = \frac{1}{1+\exp(a)}.\]

A regularization technique is adding a regularization term to a cost function in order to control overfitting. A particular choice is known as Weight decay of the form (suppose the original cost function is least squares)

\[\frac{1}{2}\sum_{n-1}^N \{t_n-w^T\phi(x_n)\}^2 + \frac{\lambda}{2}w^Tw.\]

It is so named because it encourages weights to decay towards zero unless supported by data.

updated_at 18-06-2021