Why squared error?
Why do we use least squares as a cost function?
Assume the following:
$$y^{(i)} = \theta x^{(i)} + \epsilon^{(i)}$$
where \(\epsilon^{(i)}\) is the error or unmodelled effects, random noise.
We also assume that the error is independently and identically distributed with Gaussian (normal) distribution:
$$\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$$
Why Gaussian? Because of the Central Limit Theorem! Then the probability density
$$P(\epsilon^{(i)}) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(\epsilon^{(i)})^2}{2 \sigma^2})$$
This implies that
$$P(y^{(i)}| x^{(i)}; \theta) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2})$$
so that the probability of a particular house price for given \(x^{(i)}\) and \(\theta\) is the Gaussian with mean \(\theta^T x^{(i)}\) and standard deviation \(\sigma\):
$$y^{(i)}| x^{(i)}; \theta \sim \mathcal{N}(\theta^T x^{(i)}, \sigma^2)$$
The likelihood of \(\theta\)
$$ \begin{align} \mathcal{L}(\theta) &= p(\vec y | x; \theta)\\[2ex] &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2}) \end{align}$$
The log likelihood
$$ \begin{align} \mathcal{l} (\theta) &= \log \mathcal{L} (\theta)\\ &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \sum_{i=1}^m \left(\log \frac{1}{\sqrt{2 \pi} \, \sigma} + \log \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] &= m \log \frac{1}{\sqrt{2 \pi} \, \sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] \end{align} $$
Maximum likelihood estimation (MLE): choose \(\theta\) to maximize the likelihood \(\mathcal{L} (\theta)\). However, it is easier to maximize the log likelihood \(\mathcal{l} (\theta)\). So we choose \(\theta\) to minimize the second term:
$$\frac 12 \sum_{i=1}^m \left(y^{(i)}-\theta^T x^{(i)}\right)^2 = J(\theta)$$
which is equal to our least squares cost function \(J(\theta)\).