Notes of COMP551 Applied Machine Learning

Lecture 2&3: Linear Regression

i.i.d assumption: the examples, $x_i$, in the training set are independently and identically distributed:

Solving linear regression analytically Suppose we have $y_i = \sum_{i=0}^{m} w_ix_i$ where $x_i=1$ for each example corresponding bias parameter $w_0$. A most common choice is minimizing mean-square error (MSE):

We can minimize the loss function by taking the derivatives with w and setting to zero:

Problem of analytical solution: computation extensive Overall we have 3 matrix multiplication and 1 matrix inversion. Suppose we have m examples, each with n features. $X^TX$ is the most expensive multiplication among the 3 and takes $nm^2$ operations. $(X^TX)^{-1}$ will take $m^3$ operations. So the overall time complexity is $\Omega{m^3}$ (or just say polynomial), which is problematic for large scale datasets, say, $m=10^6$.

Gradient descent solution After getting the derivative, instead of computing $w$ directly, we update $w$ a bit according to the derivative&learning rate, and iterate until it falls into a local minima (convergence).

Lecture 4: Linear classification

Evaluation: use cross-validation for model selection:

Ridge regression: add L2 regularization as a penalty for model complexity in loss function to reduce overfitting.

Lasso regression: add L1 regularization as a penalty.

Lecture 5: Generative models for linear classification

Two approaches for linear classification:

Linear discriminant analysis (LDA)

According to Bayes rule:

LDA makes an explicit assumption about $p(x \mid y)$

which is a multivariate Gaussian with mean $\mu$ and covariance matrix $\Omega$. $x$ here is an $m*1$ vector.

A key assumption of LDA is that both resulting classes have the same covariance matrix $\Omega$. Parameters to learn include $p(y), \mu, \Sigma$.

Scale up generative learning: Naive Bayes

model $P(x \mid y)$ and $P(y)$ and then compute $P(y \mid x)$.

Naive bayes assumption: assume $x_j$s are conditionally independent given $y$, that is, $P(x_j \mid y) = P(x_j \mid y, x_k)$ for all $j,k$.

With naive bayes assumption, we have:

Gaussian Naive Bayes

Extending Naive Bayes to continuous inputs.

$P(y)$ is still a binomial distribution, but $P(x \mid y)$ is a multivariate (normal) Gaussian distribution with mean $\mu \in R^n$ and covariance matrix $\Sigma \in R^n * R^n$: