Why squared error?

Why do we use least squares as a cost function?

Assume the following:

$$y^{(i)} = \theta x^{(i)} + \epsilon^{(i)}$$

where $\epsilon^{(i)}$ is the error or unmodelled effects, random noise.

We also assume that the error is independently and identically distributed with Gaussian (normal) distribution:

$$\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$$

Why Gaussian? Because of the Central Limit Theorem! Then the probability density

$$P(\epsilon^{(i)}) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(\epsilon^{(i)})^2}{2 \sigma^2})$$

This implies that

$$P(y^{(i)}| x^{(i)}; \theta) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2})$$

so that the probability of a particular house price for given $x^{(i)}$ and $\theta$ is the Gaussian with mean $\theta^T x^{(i)}$ and standard deviation $\sigma$:

$$y^{(i)}| x^{(i)}; \theta \sim \mathcal{N}(\theta^T x^{(i)}, \sigma^2)$$

The likelihood of $\theta$

$$ \begin{align} \mathcal{L}(\theta) &= p(\vec y | x; \theta)\\[2ex] &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2}) \end{align}$$

The log likelihood

$$ \begin{align} \mathcal{l} (\theta) &= \log \mathcal{L} (\theta)\\ &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \sum_{i=1}^m \left(\log \frac{1}{\sqrt{2 \pi} \, \sigma} + \log \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] &= m \log \frac{1}{\sqrt{2 \pi} \, \sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] \end{align} $$

Maximum likelihood estimation (MLE): choose $\theta$ to maximize the likelihood $\mathcal{L} (\theta)$. However, it is easier to maximize the log likelihood $\mathcal{l} (\theta)$. So we choose $\theta$ to minimize the second term:

$$\frac 12 \sum_{i=1}^m \left(y^{(i)}-\theta^T x^{(i)}\right)^2 = J(\theta)$$

which is equal to our least squares cost function $J(\theta)$.