It all starts with y=mx+b


It all starts with the equation of a straight line y=mx+B. The simplest of data fitting is, using the least squares, that is minimizing the sum of squared errors(or difference between the actual and predicted).

The Least Squares is in fact a partial derivative of  \(f(a,b) = a + bx \)  [Ref:]



\[\frac{\delta(R^2)}{ \delta a} = 2 \sum_{i=1}^{n}[y_i-(a+bx_i) ]= 0\] 

\[\frac{\delta(R^2)}{ \delta b} = 2 \sum_{i=1}^{n}[y_i-(a+bx_i) ]x_i = 0\] 

skipping to the final equation


\[a = \frac {\hat{y}(\sum_{i=1}^{n} x_i^2) – \hat{x}\sum_{i=1}^{n} x_i y_i} {\sum_{i=1}^{n}x_i^2 – n \hat{x}^2}\]

\[b = \frac {(\sum_{i=1}^{n} x_i y_i) – n \hat{x} \hat{y}} {\sum_{i=1}^{n}x_i^2 – n \hat{x}^2}\]

Optimization techniques use slope to minimize the error or also called cost functions in Machine learning, A cost function is something you want to minimize. In case of Least Squares, cost function is the sum of squared errors. 

This is all about one variable, what about typical data that has multiple variables or multi-variate, enter the dragon! the great Gradient descent, is a method used to minimize the cost function. The cost function would be sum of all (actual-predicted)² , defined as

\[Error_{(m,b)} = \frac 1 N \sum_{i=1}^{N} (y_{i}-(mx_i+b))^2\]

and partial differentiation for m and b would be

\[ \frac{\delta}{ \delta m} = \frac 2 N \sum_{i=1}^{N} -x_i(y_{i}-(mx_i+b))\]

\[ \frac{\delta}{ \delta b} = \frac 2 N \sum_{i=1}^{N} -(y_{i}-(mx_i+b))\]

the Update rule would be , where use α learning rate typically very small like 0.001, 0.0001

Another way of putting it would be 



Cost Function 


\[Error_{(m,b)} = \frac 1 N \sum_{i=1}^{N} (y_{i}-(mx_i+b))^2\]

Gradient Descent 


\[\theta_j := \theta_j – \alpha \frac \delta {\delta\theta_j}J(\theta_0,\theta_1 )\]

Update rule 


\[\theta_0 := \theta_0 – \alpha \frac 1 m \sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)})\]

\[\theta_1 := \theta_1 – \alpha \frac 1 m \sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y^{(i)}).x^{(i)}\]

In case of multi variate replace θ1 with θj were j is of each variable

Update rule is repeat until convergence

\[\theta_0 := \theta_0 – \alpha\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y)\]

\[\theta_{j} :=\theta_{j} – \alpha\sum_{i=1}^{n}(h_{\theta}(x^{(i)})-y)x_{\theta}^{(i)}-\frac{\lambda}{N}\theta_{j}\]

Some questions …  Why subtract  ∇f(a) slope ? Why learning rate?