A Guide to Convolutional Neural Networks for Computer Vision. Salman Khan
Чтение книги онлайн.

Читать онлайн книгу A Guide to Convolutional Neural Networks for Computer Vision - Salman Khan страница 9

СКАЧАТЬ Cell Orientation Histogram

      The second step is the calculation of the cell histograms. First, the image is divided into small (usually 8 × 8 pixels) cells. Each cell has a fixed number of gradient orientation bins, which are evenly spread over 0–180° or 0–360°, depending on whether the gradient is unsigned or signed. Each pixel within the cell casts a weighted vote for a gradient orientation bin based on the gradient magnitude at that pixel. For the vote weight, the pixel contribution can be the gradient magnitude, or the square root of the gradient magnitude or the square of the gradient magnitude.

       Block Descriptor

      To deal with changes in illumination and contrast, the gradient strengths are locally normalized by grouping the cells together into larger, spatially connected blocks. The HOG descriptor is then the vector of the components of the normalized cell histograms from all of the block regions.

       Block Normalization

      The final step is the normalization of the block descriptors. Let v be the non-normalized vector containing all histograms in a given block, ||v||k be its k-norm for k = 1, 2, and ϵ be a small constant. Then the normalization factor can be one of the following:

Image

      or

Image

      or

Image

      There is another normalization factor, L2-Hys, which is obtained by clipping the L2-norm of v (i.e., limiting the maximum values of v to 0.2) and then re-normalizing.

      The final image/RoI descriptor is formed by concatenating all normalized block descriptors. The experimental results in Triggs and Dalal [2005] show that all four block normalization methods achieve a very significant improvement over the non-normalized one. Moreover, the L2-norm, L2-Hys, and L1-sqrt normalization approaches provide a similar performance, while the L1-norm provides a slightly less reliable performance.

Image

      Figure 2.3: HOG descriptor. Note that for better visualization, we only show the cell orientation histogram for four cells and a block descriptor corresponding to those four cells.

      SIFT [Lowe, 2004] provides a set of features of an object that are are robust against object scaling and rotations. The SIFT algorithm consists of four main steps, which are discussed in the following subsections.

       Scale-space Extrema Detection

      The first step aims to identify potential keypoints that are invariant to scale and orientation. While several techniques can be used to detect keypoint locations in the scale-space, SIFT uses the Difference of Gaussians (DoG), which is obtained as the difference of Gaussian blurring of an image with two different scales, σ, one with scale k times the scale of the other, i.e., k × σ. This process is performed for different octaves of the image in the Gaussian Pyramid, as shown in Fig. 2.4a.

      Then, the DoG images are searched for local extrema over all scales and image locations. For instance, a pixel in an image is compared with its eight neighbors in the current image as well as nine neighbors in the scale above and below, as shown in Fig. 2.4b. If it is the minimum or maximum of all these neighbors, then it is a potential keypoint. It means that a keypoint is best represented in that scale.

      Figure 2.4: Scale-space feature detection using a sub-octave DoG pyramid. (a) Adjacent levels of a sub-octave Gaussian pyramid are subtracted to produce the DoG; and (b) extrema in the resulting 3D volume are detected by comparing a pixel to its 26 neighbors. (Figure from Lowe [2004], used with permission.)

       Accurate Keypoint Localization

      This step removes unstable points from the list of potential keypoints by finding those that have low contrast or are poorly localized on an edge. In order to reject low contrast keypoints, a Taylor series expansion of the scale space is computed to get more accurate locations of extrema, and if the intensity at each extrema is less than a threshold value, the keypoint is rejected.

      Moreover, the DoG function has a strong response along the edges, which results in a large principal curvature across the edge but a small curvature in the perpendicular direction in the DoG function. In order to remove the keypoints located on an edge, the principal curvature at the keypoint is computed from a 2 × 2 Hessian matrix at the location and scale of the keypoint. If the ratio between the first and the second eigenvalues is greater than a threshold, the keypoint is rejected.

      Remark: In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function. Specifically, suppose f(x1, x2, …, xn) is a function outputting a scalar, i.e., Image; if all the second partial derivatives of f exist and are continuous over the domain of the function, then the Hessian H of f is a square n × n matrix, defined as follows:

Image

       Orientation Assignment

      In order to achieve invariance to image rotation, a consistent orientation is assigned to each keypoint based on its local image properties. The keypoint descriptor can then be represented relative to this orientation. The algorithm used to find an orientation consists of the following steps.

      1. The scale of the keypoint is used to select the Gaussian blurred image with the closest scale.

      2. The gradient magnitude and orientation are computed for each image pixel at this scale.

      3. As shown in Fig. 2.5, an orientation histogram, which consists of 36 bins covering the 360° range of orientations, is built from the gradient orientations of pixels within a local region around the keypoint.

      4. The highest peak in the local orientation histogram corresponds to the dominant direction of the local gradients. Moreover, any other local peak that is within 80% of the highest peak is also considered as a keypoint with that orientation.

       Keypoint Descriptor

      The dominant direction (the highest peak in the histogram) of the local gradients is also used to create keypoint descriptors. The gradient orientations are rotated relative to the orientation of the keypoint and then weighted by a Gaussian with a variance of 1.5 × keypointscale. Then, a 16 × 16 neighborhood around the keypoint is divided into 16 sub-blocks of size 4 × 4. For each sub-block, an 8 bin orientation histogram is created. This results in a feature vector, called SIFT descriptor, containing 128 elements. СКАЧАТЬ