Why squared error?
Why do we use least squares as a cost function?
Assume the following:
$$y^{(i)} = \theta x^{(i)} + \epsilon^{(i)}$$
where $\epsilon^{(i)}$ is the error or unmodelled effects, random noise.
We also assume that the error is independently and identically distributed with Gaussian (normal) distribution:
$$\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$$
Why Gaussian? Because of the Central Limit Theorem! Then the probability density
$$P(\epsilon^{(i)}) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(\epsilon^{(i)})^2}{2 \sigma^2})$$
This implies that
$$P(y^{(i)}| x^{(i)}; \theta) = \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2})$$
so that the probability of a particular house price for given $x^{(i)}$ and $\theta$ is the Gaussian with mean $\theta^T x^{(i)}$ and standard deviation $\sigma$:
$$y^{(i)}| x^{(i)}; \theta \sim \mathcal{N}(\theta^T x^{(i)}, \sigma^2)$$
The likelihood of $\theta$
$$ \begin{align} \mathcal{L}(\theta) &= p(\vec y | x; \theta)\\[2ex] &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \frac{1}{\sqrt{2 \pi} \, \sigma} \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2}) \end{align}$$
The log likelihood
$$ \begin{align} \mathcal{l} (\theta) &= \log \mathcal{L} (\theta)\\ &= \prod_{i=1}^m p(y^{(i)}| x^{(i)}; \theta)\\[2ex] &= \sum_{i=1}^m \left(\log \frac{1}{\sqrt{2 \pi} \, \sigma} + \log \exp(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] &= m \log \frac{1}{\sqrt{2 \pi} \, \sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2 \sigma^2} \right)\\[2ex] \end{align} $$
Maximum likelihood estimation (MLE): choose $\theta$ to maximize the likelihood $\mathcal{L} (\theta)$. However, it is easier to maximize the log likelihood $\mathcal{l} (\theta)$. So we choose $\theta$ to minimize the second term:
$$\frac 12 \sum_{i=1}^m \left(y^{(i)}-\theta^T x^{(i)}\right)^2 = J(\theta)$$
which is equal to our least squares cost function $J(\theta)$.