I already covered just about all of this material during my EE pre- and post-grad studies, but since that was about 15 years ago I still watched all the content. I did fall asleep a few times... but it was worth it to refresh all the concepts.
The course has a Honour Code that you must follow which includes not providing code answers to the problems. Since I'm not sure whether that includes blogs outside coursera I will refrain just the same to post any code related to the questions posed in the course. However I will summarize some of the content and discuss some of the concepts here.
Week 1:
Introduction
There are 2 types of Machine Learning, Supervised and Unsupervised.
Supervised learning is where you are given past datasets and want to make predictions based on the past results. There are two types, Regression and Classification. Regression is where the outcomes/predictions are real numbers (or a range of values) while Classification has a binary Yes No outcome/predication.
Unsupervised learning is where you are given unclassified data and you try to find structure in the data which can be used to classify the data into different clusters.
Linear Regression
Linear regression is where you try to fit a linear function (the hypothesis \(h(\overrightarrow{x})\) to represent the data points provided. In two dimensions, if you have an x input providing a y output, this would be the straight line that passes the closest to all of the data points. In the general case (multi-variable) the hypothesis (or prediction) function would be given by the following equation for each new input dataset where each input variable is given by \(x_1..x_n\):
\[ \hat{y}=h_{\theta_0,\theta_1,\theta_2,...,\theta_n}(x_1,x_2,...,x_n)=\theta_0\cdot1 + \theta_1x_1+\theta_2x_2+...+\theta_nx_n \]
written for \(\theta\) and \( x\) as vectors
\[ \hat{y}=h_{\overline{ \theta}}({\overline{x}}) = \sum_{i=0}^n\theta_ix_i \begin{cases}x_0 = 1\end{cases} \]\[ \hat{y}=h_{\overline{ \theta}}({\overline{x}}) =\overline{x}^{T}\overline\theta \]if we let \( X \) be the matrix with each column equal to different measurements of the feature \(x_n\), setting the first column \(x_0=1\), and thus each row representing a sample data set, we can write it as:
\[\hat{\overline y}=h_{\overline{ \theta}}({X}) =X\cdot\overline\theta\begin{cases}\hat{\overline y} = m \times 1 \space vector\\ X = m\times (n+1) \space matrix \\\overline\theta = (n+1)\times 1 \space vector \\ with \space m \space samples \\ and \space n \space features\end{cases}\]
There are many benefits of writing the equation as a linear algebra vector/matrix operation, in particular it makes the syntax much more compact and using a linear algebra package in programming languages makes the implementation also more compact and usually also more efficient.
Instead of writing multiple nested for loops to compute this function for each value of each input variable we can compute it by a simple matrix multiplication.
Now that we have defined the hypothesis or prediction function for linear regression we can derive the cost function.
We will choose the cost function to be the sum of the square differences between the predicted and measured results divided by \(2m\). This can be thought of as the square of the \(n\)-dimensional distance between the predicted points and the measured points in \(n\)-dimensional space normalized by the amount of samples:
\[CostFunction=J(\theta_0,\theta_1,\theta_2,...,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(y^{(i)}-\hat{y}^{(i)})^2\\=\frac{1}{2m}\sum_{i=1}^m(y^{(i)} -h_{\theta_0,\theta_1,\theta_2,...,\theta_n}(x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n))^2\\=\frac{1}{2m}\sum_{i=1}^m(y^{(i)}-(\theta_0\cdot1 +\theta_1x^{(i)}_1+\theta_2x^{(i)}_2+...+\theta_nx^{(i)}_n))^2\\J(\theta_0,\theta_1,\theta_2,...,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})^2 \begin{cases}x_0 = 1\end{cases}\\J(\overline\theta)=\frac{1}{2m}\sum_{i=1}^m(y^{(i)}-\overline{x^{(i)}}^{T}\overline\theta)^2\\J(\overline\theta)=\frac{1}{2m}sum(\overline y - \hat{\overline y})^2\\J(\overline\theta)=\frac{1}{2m}sum(\overline y-X\cdot\overline\theta)^2\begin{cases}\hat{\overline y} = m \times 1 \space vector\\ X = m\times (n+1) \space matrix \\\overline\theta = (n+1)\times 1 \space vector \\ with \space m \space samples \\ and \space n \space features\end{cases}
\]
\]
Again we have reached a matrix multiplication for calculating the cost efficiently.
With Gradient Descent we want to choose different \(\theta\) values, progressively closer to the best \(\theta\) values until we reach the optimal \(\theta\) values where the cost \(J(\theta)\) is at a minimum.
The way we will do this is to calculate the gradient of the cost function \(J(\theta)\) and adjust the \(\theta\) values in the direction of this gradient. Do calculate the gradient we must take the differential of \(J(\theta)\) .
\[\frac{\partial }{\partial \theta_k}J(\theta_0,\theta_1,\theta_2,...,\theta_n)=\frac{\partial }{\partial \theta_k}\frac{1}{2m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})^2 \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{2m}\sum_{i=1}^m\frac{\partial }{\partial \theta_k}(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})^2 \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})\frac{\partial }{\partial \theta_k}(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})(-\sum_{j=0}^n\frac{\partial }{\partial \theta_k}\theta_jx_j^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})(-\frac{\partial }{\partial \theta_k}\theta_kx_k^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(y^{(i)}-\sum_{j=0}^n\theta_jx_j^{(i)})(-x_k^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(x_k^{(i)}\sum_{j=0}^n\theta_jx_j^{(i)}-y^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
=\frac{1}{m}\sum_{i=1}^m(x_k^{(i)}\sum_{j=0}^n\theta_jx_j^{(i)}-y^{(i)}) \begin{cases}x_0 = 1\end{cases}
\\
\frac{\partial }{\partial \overline\theta}J(\overline\theta)=\frac{1}{m}\sum_{i=1}^m(x_k^{(i)}(\hat{y}^{(i)}-y^{(i)})
\\
\frac{\partial }{\partial \overline\theta}J(\overline\theta)=\frac{1}{m}X^{T}(X\theta-\overline y)\]
We get another very compact and efficient matrix operation for the partial derivatives of the cost function.
Now, to implement gradient descent we will iterate until convergence updating our \(\theta\) values every iteration in the following way:
Now, to implement gradient descent we will iterate until convergence updating our \(\theta\) values every iteration in the following way:
\[\overline\theta_{i+1}=\overline\theta_i -\alpha\frac{\partial }{\partial \overline\theta}J(\overline\theta)\]
OR
\[\theta := \theta - \alpha \nabla J(\theta)\]
Where \(\nabla J(\theta)\) is:
\[\nabla J(\theta) = \begin{bmatrix}
\frac{\partial J(\theta)}{\partial \theta_0} \newline
\frac{\partial J(\theta)}{\partial \theta_1} \newline
\vdots \newline
\frac{\partial J(\theta)}{\partial \theta_n}
\end{bmatrix}\]
where \(\alpha\) is a constant value, called the learning rate, which must be chosen to an appropriate value. If the learning rate is too large, the descent will overshoot and might not converge while if it is too small it will converge very slowly.
One final point about "linear" regression is that one can use it for just about any type of hypothesis function by just defining the feature \(x_k\) to the function of the actual feature that you would like to use. For instance, one could set \(x_2=x_1^2\) and \(x_3=\sin(x_1)\) and then apply the exact same linear regression gradient descent as explained above.
Finally the Normal Equation was presented, which is a direct method of solving for the optimal values of \(\theta\). The equation is given by:
$$
\theta = (X^T X)^{-1}X^T y
$$
For the second week's assignment you had to basically implement the cost function, gradient descent and the normal equation. With the matrix representations of the calculations it is very easy to implement these algorithms in Octave.
After completing the exercises I continued to try and optimize the algorithm.
To me it seems silly that one must manually choose a good \(\alpha\), so I created a version of gradient descent with an adaptive \(\alpha\) which would automatically adjust \(\alpha\) to a good value for each iteration.
Basically my technique is as follows:
Calculate \(deltaTheta = \nabla J(\theta)\)
Calculate \(cost = J(\theta - \alpha*deltaTheta)\)
Calculate \(newCost = J(\theta - changeFactor*\alpha*deltaTheta)\)
If newCost > cost,
changeFactor = 1/changeFactor;
Calculate \(newCost = J(\theta - changeFactor*\alpha*deltaTheta)\)
while (newCost < cost),
Update \(\alpha\) to \(changeFactor*\alpha\)
Set cost to newCost
Calculate \(newCost = J(\theta - changeFactor*\alpha*deltaTheta)\)
In this way I am either increasing or decreasing the \(\alpha\) by my changeFactor (which works well with a value of 2) iteratively until the change does not result in a reduction in cost.
This had a profound improvement on the test dataset provided:
Static Alpa = 0.01 |
Adaptive Alpha starting at 0.01 |
Note that in the plot above I am also considering each calculation of the cost function (during optimization of Alpha) as an iteration. In reality the amount of Gradient Descent iterations is much lower and also uniformly descending.
In the case where an optimal Alpha is chosen to start with the normal Gradient Descent should be a bit more efficient than the Adaptive Alpha method but in general the Adaptive method will perform much better.