on

# Notes of COMP551 Applied Machine Learning

# Lecture 2&3: Linear Regression

**i.i.d assumption**: the examples, $x_i$, in the training set are *independently* and *identically distributed*:

- Independently: each $x_i$ is freshly sampled according to some probability distribution
*D*over the data domain*X*, which means $x_i$ is not dependent to any $x_j$ in the same training set. - Identically: the distribution
*D*is the same probability distribution for all examples.

**Solving linear regression analytically**
Suppose we have $y_i = \sum_{i=0}^{m} w_ix_i$ where $x_i=1$ for each example corresponding bias parameter $w_0$. A most common choice is minimizing *mean-square error (MSE)*:

We can minimize the loss function by taking the derivatives with w and setting to zero:

**Problem of analytical solution: computation extensive**
Overall we have 3 matrix multiplication and 1 matrix inversion. Suppose we have m examples, each with n features. $X^TX$ is the most expensive multiplication among the 3 and takes $nm^2$ operations. $(X^TX)^{-1}$ will take $m^3$ operations. So the overall time complexity is $\Omega{m^3}$ (or just say polynomial), which is problematic for large scale datasets, say, $m=10^6$.

**Gradient descent solution**
After getting the derivative, instead of computing $w$ directly, we update $w$ a bit according to the derivative&learning rate, and iterate until it falls into a local minima (convergence).

# Lecture 4: Linear classification

**Evaluation**: use cross-validation for model selection:

- Training set: select a hypothesis $f$ from the hypotheses space $F$ (parameter tuning, e.g. regression for a given degree).
- Validation set: compare best $f$ from each hypothesis class across different classes (e.g. different degree regression).
- Test set: get a true estimate of the generalization error.

**Ridge regression**: add L2 regularization as a penalty for model complexity in loss function to reduce overfitting.

**Lasso regression**: add L1 regularization as a penalty.

# Lecture 5: Generative models for linear classification

Two approaches for linear classification:

- Discriminative learning: directly estimate $P(y \mid x)$.
- Generative learning: separately model $P(x \mid y)$ and $P(y)$; use Bayes rule to estimate $P(y \mid x)$.

## Linear discriminant analysis (LDA)

According to Bayes rule:

LDA makes an explicit assumption about $p(x \mid y)$

which is a multivariate Gaussian with mean $\mu$ and covariance matrix $\Omega$. $x$ here is an $m*1$ vector.

A key assumption of LDA is that both resulting classes have the

samecovariance matrix $\Omega$. Parameters to learn include $p(y), \mu, \Sigma$.

## Scale up generative learning: Naive Bayes

model $P(x \mid y)$ and $P(y)$ and then compute $P(y \mid x)$.

**Naive bayes assumption**: assume $x_j$s are conditionally independent given $y$, that is, $P(x_j \mid y) = P(x_j \mid y, x_k)$ for all $j,k$.

With naive bayes assumption, we have:

## Gaussian Naive Bayes

Extending Naive Bayes to continuous inputs.

$P(y)$ is still a binomial distribution, but $P(x \mid y)$ is a multivariate (normal) Gaussian distribution with mean $\mu \in R^n$ and covariance matrix $\Sigma \in R^n * R^n$:

- If all classes have the same $\Sigma$, it is a
*LDA*. - If $\Sigma$ is distinct between classes, it is a
*QDA*. - If $\Sigma$ is diagonal (e.g. features are independent), it is
*Gaussian Naive Bayes*.