Normal equation
$$ \nabla_{\theta} J(\theta) = \left[ \begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \frac{\partial J(\theta)}{\partial \theta_2} \\ \end{array} \right], \quad \theta \in \mathbb{R}^{n+1} $$
Let $A \in \mathbb{R}^{2 \times 2}$ be a $2 \times 2$ matrix:
$$ A = \left[ \begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{array} \right] $$
Let $f$ be a function of the matrix $A$: $f(A) = A_{11} + A_{12}^2$ which outputs a real number, then, for example:
$$ f \left( \left[ \begin{array}{cc} 5 & 6 \\ 7 & 8 \\ \end{array} \right] \right)= 5+6^2 $$
The gradient of the function $f$ with respect to the matrix $A$:
$$ \begin{align} \nabla_A f(A) & = \left[ \begin{array}{cc} \frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} \\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}} \\ \end{array} \right] \\[2ex] & = \left[ \begin{array}{cc} 1 & 2A_{12} \\ 0 & 0 \\ \end{array} \right] \end{align} $$
In order to minimize the cost, we want to set the gradient to zero and solve for $\theta$:
$$ \nabla_\theta f(\theta) = \vec 0 $$
Assuming $A$ is a square matrix with real values, we define $tr(A)$ as a sum of diagonal values of matrix $A$. Properties of trace operator:
- $tr(A^T) = tr(A)$
- If $f(A) = tr(AB)$, then $\nabla_A f(A) = B^T$
- $tr(AB) = tr(BA)$
- $tr(ABC) = tr(CAB)$
- $\nabla_A tr(AA^TC) = CA + C^TA$
Cost function
$$ J(\theta) = \frac{1}{2} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2 $$
Let $X$ be a design matrix of the form:
$$ X = \left[ \begin{array}{c} ---(x^{(1)})^T--- \\ ---(x^{(2)})^T--- \\ ---(x^{(m)})^T--- \\ \end{array} \right] $$
$$ \begin{align} X\theta & = \left[ \begin{array}{c} ---(x^{(1)})^T--- \\ ---(x^{(2)})^T--- \\ \cdots \\ ---(x^{(m)})^T--- \\ \end{array} \right] \left[ \begin{array}{c} \theta_0 \\ \theta_1 \\ \theta_2 \\ \end{array} \right] \\[2ex] & = \left[ \begin{array}{c} (x^{(1)})^T \theta \\ (x^{(2)})^T \theta \\ \cdots \\ (x^{(m)})^T \theta \\ \end{array} \right] \\[2ex] & = \left[ \begin{array}{c} h_\theta (x^{(1)}) \\ h_\theta (x^{(2)}) \\ \cdots \\ h_\theta (x^{(m)}) \\ \end{array} \right] \end{align} $$
$$ y = \left[ \begin{array}{c} y^{(1)}) \\ y^{(2)}) \\ \cdots \\ y^{(m)}) \\ \end{array} \right] $$
$$ \begin{align} J(\theta) & = \frac{1}{2} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2 \\ & = \frac{1}{2} (X\theta - y)^T (X\theta - y) \end{align} $$
$$ X\theta - y = \left[ \begin{array}{c} h_\theta (x^{(1)}) - y^{(1)} \\ h_\theta (x^{(2)}) - y^{(2)} \\ \cdots \\ h_\theta (x^{(m)}) - y^{(m)} \\ \end{array} \right] $$
The vector transpose times itself is the sum of its squared elements:
$$ z^T z = \sum_i z^2 $$
$$ \begin{align} \nabla_\theta J(\theta) & = \nabla_\theta \frac{1}{2} (X\theta - y)^T (X\theta - y) \\[2ex] & = \frac{1}{2} \nabla_\theta (\theta^T X^T - y^T) (X\theta - y) \\[2ex] & = \frac{1}{2} \nabla_\theta (\theta^T X^T X\theta - \theta^T X^T y - y^T X\theta + y^T y) \\[2ex] & = \frac{1}{2} (X^T X\theta + X^T X\theta - X^T y - X^T y) \\[2ex] & = X^T X\theta - X^T y \\ \end{align} $$
$$ \begin{align} \nabla_\theta J(\theta) = \vec 0 \implies X^T X\theta = X^T y \\ \end{align} $$
Normal equation:
$$ \theta = (X^T X)^{-1} X^T y $$
If $(X^T X)$ is not invertible, then it means that we have redundant features (features are linearly dependent). We can use pseudo inverse to get the right answer.