Image fundamentals
Light and electromagnetic spectrum
Visible light is distributed in the electromagnetic spectrum from 0.43 \(\mu\)m (violet) to 0.79 \(\mu\)m (red). Wavelength \(\lambda\) and frequency \(\nu\) are related by the formula
$$\lambda = \frac{c}{\nu}$$
where \(c\) is the speed of light (2.998 \(\times\) 10^8 m/s). The energy of of one photon is dependent on the frequency \(\nu\) by the formula:
$$E = h\nu$$
where \(h\) is Planck's constant (6.62607004 \(\times\) \(10^{-34}\) m\(^2\) kg/s).
The quality of chromatic light are determined by the following factors:
- Wavelength in the electromagnetic spectrum;
- Radiance (measured in watts) which measures the amount of energy that the light source emits;
- Luminance (measured in lumens) which measures the amount of energy that an observer perceives from a light source;
- Brightness which measures light perception.
Image formation model
We denote images by two-dimensional function \(f(x,y)\) which is positive and finite, i.e. \(0 \lt f(x,y) \lt \infty\). The function \(f(x,y)\) is the product of the amount of illumination incident on the scene (denoted as illumination \(i(x,y)\)) and the amount of illumination reflected by the object (denoted as reflectance \(r(x,y)\)):
$$f(x,y) = i(x,y)r(x,y)$$
where \(0 \lt i(x,y) \lt \infty\) and \(0 \lt r(x,y) \lt 1\). Note that \(r(x,y) = 0\) means total absorption and \(r(x,y) = 1\) means total reflectance. In case of images formed by transmission of illumination through a medium (X-ray), the reflectivity function is changed to transmissivity function. The units for the illumination functions is \(lm/m^2\), while the reflectance function is unitless (it is a ratio). Illumination levels varies from 0.1 \(lm/m^2\) from a full moon to 90,000 \(lm/m^2\) from the sun on the surface of the Earth. Different materials have different reflectance ratio from 0.01 for black velvet to 0.93 for snow.
Let the intensity level of a monochrome image be denoted as
$$\mathscr{l} = f(x_0,y_0)$$
We observe that \(\mathscr{l}\) lies in the range \([\mathscr{l} _ {min}, \mathscr{l} _ {max}]\) which is called the gray scale.
Sampling and quantization
The output of most sensors is a continuous voltage waveform that we need to convert into a digital form, which subsumes two processes: sampling and quantization.
Sampling is the process of partitioning the \(xy\)-plane of the image coordinates into a grid. Quantization is the process of approximating the amplitude real values of the image to integer values (called intensity levels) similar to transforming a continuous function into an equally spaced step function representing shades of gray.
Spatial domain is the section of the real plane spanned by the coordinates of the image with \(x\) and \(y\) called spatial variables. An image can be represented as a matrix \(A^{m \times n}\) where each of its elements \(a_{ij} = f(x=i,y=j) = f(i,j)\) has a value representing the intensity value of the corresponding pixel. The number of intensity levels is an integer power of 2:
$$L = 2^k$$
The range of values spanned by the gray scale is called the dynamic range, which is the ratio of the maximum intensity to the minimum detectable intensity. The upped limit is called saturation and the lower limit is called noise. Image contrast is the difference in intensity between the highest and lowest intensity levels in the image.
The image size in bits required to store on disk is calculated by the formula
$$b = mnk$$
We use 8 bits to store a gray-scale image because of memory architecture in modern computers: RAM has unit sections of 8 bits.
True color image has 24 bits: 8 bits for the red color, 8 bits for the green color, and 8 bits for the blue color. 16-bit color images were used in older computers: 5 bits for red, 6 bits for green (people notice differences in green more), and 5 bits for blue color. We should not confuse 16-bit color image with Adobe Photoshop 16-bit image. In Photoshop, 16-bit image is encodes 16 bits for each color channel, totaling \(2^{48}\) colors. We save only index of the color, not color itself.
Spatial resolution is measured in line pairs per unit distance and dots per unit distance. Intensity resolution refers to the smallest discernible change in intensity level, measured in bits. False contouring happens when we use less than optimal intensity levels in the image.
Image interpolation is the process to estimate values at unknown locations of the image. It is used in image resampling methods such as resizing. Nearest neighbor interpolation assigns the closest pixel in the original image to the new pixel. Bilinear interpolation uses four nearest neighbors to estimate the intensity of a new pixel by the formula:
$$v(x,y) = ax + by + cxy + d$$
where \(a, \, b \, c \, d\) are the four coefficients determined by four nearest neighbors of a given point.
Bicubic interpolation uses 16 nearest neighbors of a point to estimate the intensity of a new point:
$$v(x,y) = \sum_{i=0}^3 \sum_{j=0}^3 a_{ij}x^iy^j$$
where \(a_{ij}\) are the 16 coefficients determined from sixteen nearest neighbors of a given point.
Nyquist criterion is the sampling theorem which says that a continuous function can be reconstructed from its samples provided that the sampling frequency is at least twice the maximum frequency in the function. Sampling an image is the same as sampling a continuous function of two variables.
Aliasing is the jagged edges of an undersampled image.
Basic relationships between pixels
A pixel \(p\) has four horizontal and vertical neighbors, denoted by \(N_4(p)\). The four diagonal neighbors of \(p\) are denoted by \(N_D(p)\). Together these points are called 8-neighbors of \(p\), denoted by \(N_8(p)\).
Let \(V\) be the set of intensity values that we use to define adjacency. There are three types of adjacency:
- 4-adjacency: two pixels \(p\) and \(q\) with values from \(V\) are 4-adjacent if \(q\) is in the set \(N_4(p)\).
- 8-adjacency: two pixels \(p\) and \(q\) with values from \(V\) are 8-adjacent if \(q\) is in the set \(N_8(p)\).
- mixed-adjacency: two pixels \(p\) and \(q\) with values from \(V\) are \(m\)-adjacent if:
- \(q\) is in \(N_4(p)\), or
- \(q\) is in \(N_D(p)\) and the set \(N_4(p) \cap N_4(q)\) has no pixels from \(V\).
The path between pixel \(p(x,y)\) and pixel \(q(s,t)\) is a sequence of distinct adjacent pixels (called connected component in the set \(S\)). There are 4-, 8-, and \(m\)-paths. If the initial pixel equals the final pixel in the path, then this path is called the closed path.
R is a region of the image if \(R\) is a connected set. Two regions are adjacent if their union forms a connected set, otherwise they are disjoint. The boundary (also called inner border) of a region \(R\) is the set of points that are adjacent to points in the complement of \(R\) (called the background). The boundary depends on how we specify the connectivity to define adjacency. The outer border is the border in the background.
The edge is is a set of points formed from pixels with derivative values that exceed a preset threshold.
Distance measures
Eucledean distance between the pixels \(p(x,y)\) and \(q(s,t)\) is defined as
$$D_e(p,q) = \sqrt{(x-s)^2 + (y-t)^2}$$
The city-block distance between the pixels \(p(x,y)\) and \(q(s,t)\) is defined as
$$D_4(p,q) = |x-s| + |y-t|$$
The chessboard distance between the pixels \(p(x,y)\) and \(q(s,t)\) is defined as
$$D_8(p,q) = \max(|x-s|, |y-t|)$$
Linear operations
A general operator \(H\) is a linear operator if
$$\begin{align} H[a_i f_i (x,y) + a_j f_j (x,y)] &= a_i H[f_i (x,y)] + a_j H[f_j (x,y)]\\[2ex] &= a_i g_i (x,y) + a_j g_j (x,y) \end{align}$$
where \(a_i\) and \(a_j\) are arbitrary constants and \(f_i (x,y)\) and \(f_j (x,y)\) are arbitrary images.
In digital image processing, altough we represent images as matrices, when we talk about performing arithmetic operations on images such as addition, multiplication, power etc., we assume that we perform element-wise operations instead of standard matrix operations.
Averaging of noisy images for noise reduction
Let \(g(x,y)\) denote a corrupted image formed by addition of noise \(\eta(x,y)\) to a noiseless image \(f(x,y)\)
$$g(x,y) = f(x,y) + \eta(x,y)$$
where we assume that the noise is uncorrelated at every pair of coordinates \((x,y)\) and has zero average value. Then
$$f(x,y) = E\{g(x,y)\} = \bar g(x,y) = \frac{1}{K} \sum_{i=1}^{K} g_i(x,y)$$
and
$$\sigma_{\bar g(x,y)}^2 = \frac{1}{K} \sigma_{\eta(x,y)}^2$$
As \(K\) increases, the variance of the pixel values at each point decreases, therefore \(\bar g(x,y)\) approaches \(f(x,y)\) as the number of noisy images increases. Note that the imgaes \(g_i(x,y)\) must be registered (aligned) in order to avoid blurring and other artifacts.
Note on image difference arithmetic
Sometimes values in the difference of two images span the range from -255 to 255, while the values of a sum of two images could be between 0 and 510. Many software packages set negative values to 0 and all values exceeding 255 as 255. We are interested in preserving the full range of an arithmetic operation, therefore we do the following. First we create an image whose minimum value is 0:
$$f_m = f - \min(f)$$
After that we create a scaled image whose values are in the range \([0,K]\):
$$f_s = K[f_m/\max(f_m)]$$
Set operations
Let \(A\) be a set whose elements are triplets of the form \((x,y,z)\) where \(x\) and \(y\) are spatial coordinates and \(z\) denotes intensity values. The union of two gray-scale sets \(A\) and \(B\) is defined as the following:
$$A \cup B = \{\max_z(a,b) | a \in A, b \in B\}$$
The intersection of two gray-scale sets \(A\) and \(B\) is defined as the following:
$$A \cap B = \{\min_z(a,b) | a \in A, b \in B\}$$
Geometric sparial transformations
The transformation of coordinates is expressed as
$$(x,y) = T\{(v,w)\}$$
where \((v,w)\) are pixel coordinates in the original image and \((x,y)\) are the corresponding pixel coordinates in the transformed image. One of the most commonly used spatial coordinate transformations is the affine transform:
$$\begin{bmatrix} x & y & 1 \end{bmatrix} = \begin{bmatrix} v & w & 1 \end{bmatrix} T$$
Identity transformation matrix:
$$T = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$$
Scaling tranformation matrix:
$$T = \begin{bmatrix} c_x & 0 & 0\\ 0 & c_y & 0\\ 0 & 0 & 1 \end{bmatrix}$$
Rotation tranformation matrix:
$$T = \begin{bmatrix} \cos\theta & \sin\theta & 0\\ -\sin\theta & \cos\theta & 0\\ 0 & 0 & 1 \end{bmatrix}$$
Translation transformation matrix:
$$T = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ t_x & t_y & 1 \end{bmatrix}$$
Vertical sheer transformation matrix:
$$T = \begin{bmatrix} 1 & 0 & 0\\ s_v & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$$
Horizontal sheer transformation matrix:
$$T = \begin{bmatrix} 1 & s_h & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$$
We perform spacial transformation in two ways. The first one is called forward mapping, where we scan the pixel of the input image and compute the location of the corresponding pixel in the output image according to the previous formula. The problem is that two or more pixels from the input image may be transformed to the same location in the output image.
A better approach is called inverse mapping, where we scan the output pixel locations, and at each location \((x,y)\) we compute the corresponding location in the input image by the formula:
$$(v,w) = T^{-1}(x,y)$$
Then we interpolate among the nearest input pixels to determine the intensity of the output pixel value.