Introduction

Human learning and machine learning are similar. Learning is the process by which humans gain knowledge by experience and consequently adjust their behavior. Machine learning techniques are obtained from the works of psychologists and biologists when they tried to make sense of human learning by building computational models.

Humans use symbols to simplify problem solving. Symbols denote other objects in the real world. In symbolic learning, the system learns a symbolic representation by analyzing positive and negative examples of a concept.

Yet another approach is to represent the objects directly using features. This approach is justified by the way perception organs sense the world by receptors. We create a vector where each dimension corresponds to a certain value in a receptor. In order to measure the dissimilarity between two objects, we compute the distance between two $D$-dimensional vectors that represent them. Feature extraction is the process of choosing the correct features to represent an object.

Machines learning distinguishes between supervised learning and unsupervised learning. In supervised learning, we give a machine learning algorithm some examples with corresponding desired answers so that the algorithm will learn a general rule that maps examples to answers. In unsupervised learning, a machine learning learning algorithm groups information by finding hidden structure in the data.

An artificial neuron is modeled after the biological neuron in the human brain. In a biological neuron, nervous signals enter the neuron through dendrites, soma processes the information, and the axon transmits the output to other neurons. The synapses are the points of connection between neurons.

With two $D$-dimensional vectors $x$ and $w$, we denote vector $x$ an input pattern, and $w$ a stored pattern. We define a dot product between $x$ and $w$ that measures projection of one vector onto another represented by the value $net$ as

$$net = \sum_{j=1}^D w_j \cdot x_j = \cos \omega \cdot \Vert x \Vert \cdot \Vert w \Vert$$.

The components $w_j$ are called weights, and they represent memory of a neuron. The output of a neuron is computed through a nonlinear function by thresholding the value $net$:

$$o = \text{sgn}(net) = \begin{cases} 1 & \text{if $net \ge 0$} \\ -1 & \text{if $net \lt 0$} \end{cases} $$

The $\text{sgn}(net)$ operation is related to the threshold operation of a real biological neuron with $-1$ indicating not firing and $1$ firing. Firing means that the neuron sends output information to other neurons, whereas not firing means no information is sent. Such a model is known as the linear threshold unit (LTU).

Hypothesis function

$$ h_\theta (x) = \sum_{j=0}^{n} \theta_j x_j $$

where

$$ \theta = \left( \begin{array}{c} \theta_0 \\ \theta_1 \\ \theta_2 \\ \end{array} \right) $$

$$ x = \left( \begin{array}{c} x_0 \\ x_1 \\ x_2 \\ \end{array} \right) $$

$\theta$ = parameters
$m$ = # of training examples
$x$ = inputs/features
$y$ = output/target
$(x,y)$ = training example
$n$ = # of features

Choose $\theta$ s.t. $h_\theta (x) \approx y$ for training examples. For brevity, we will write $h_\theta (x)$ as $h(x)$.

Cost function

$$ J(\theta) = \frac{1}{2} \sum_{i=1}^m (h(x) - y)^2 $$

We want to minimize $J(\theta)$ by choosing values of $\theta$. The reason why we use the squared error is because it corresponds to Gaussian in generalizing linear models, under which linear regression is a special case.

Gradient descent algorithm:

Start with some value $\theta$ (for example, the zero vector)
Keep changing $\theta$ to reduce $J(\theta)$

$\theta$ update rule:

$$ \theta_j := \theta_j - \eta \frac{\partial J(\theta)}{\partial \theta_j} $$

where $j=0,1,2$ is the number of features and $\eta$ is the learning rate.

$$ \begin{align} \frac{\partial J(\theta)}{\partial \theta_j} & = \frac{\partial}{\partial \theta_j}\frac{1}{2} \sum_{i=1}^m (h(x) - y)^2 \\ & = \sum_{i=1}^m (h(x) - y) \frac{\partial}{\partial \theta_j} (h(x) - y) \\ & = \sum_{i=1}^m (h(x) - y) x_j \end{align} $$

$\theta$ update rule:

$$ \theta_j := \theta_j - \eta \sum_{i=1}^m (h(x^{(i)}) - y^{(i)}) x_j^{(i)} $$

where we sum over $i=1 \ldots m$ training examples.

If we choose $\eta$ too large, we will overshoot, but if we choose it too small, we would need too many iterations to reach the target.

Batch gradient descent updates $\theta$ once it calculates the gradient for all of the training examples. If the number of training examples is very large, batch gradient descent becomes very slow.

Stochastic gradient descent updates $\theta$ after each training example:

Repeat 1. For i = 1 to m 1. $\theta_j := \theta_j - \eta \sum_{i=1}^m (h(x^{(i)}) - y^{(i)}) x_j^{(i)}$