A Guide to Convolutional Neural Networks for Computer Vision. Salman Khan
Чтение книги онлайн.

Читать онлайн книгу A Guide to Convolutional Neural Networks for Computer Vision - Salman Khan страница 10

СКАЧАТЬ href="#ulink_a9ac82f1-bf71-56d7-bc1c-b0e9bc34f998">Figure 2.6 illustrates the SIFT descriptors for keypoints extracted from an example image.

       Complexity of SIFT Descriptor

      In summary, SIFT tries to standardize all images (if the image is blown up, SIFT shrinks it; if the image is shrunk, SIFT enlarges it). This corresponds to the idea that if a keypoint can be detected in an image at scale σ, then we would need a larger dimension k σ to capture the same keypoint, if the image was up-scaled. However, the mathematical ideas of SIFT and many other hand-engineered features are quite complex and require many years of research. For example, Lowe [2004] spent almost 10 years on the design and tuning of the SIFT parameters. As we will show in Chapters 4, 5, and 6, CNNs also perform a series of transformations on the image by incorporating several convolutional layers. However, unlike SIFT, CNNs learn these transformation (e.g., scale, rotation, translation) from image data, without the need of complex mathematical ideas.

      Figure 2.5: A dominant orientation estimate is computed by creating a histogram of all the gradient orientations weighted by their magnitudes and then finding the significant peaks in this distribution.

      Figure 2.6: An example of the SIFT detector and descriptor: (left) an input image, (middle) some of the detected keypoints with their corresponding scales and orientations, and (right) SIFT descriptors–a 16 × 16 neighborhood around each keypoint is divided into 16 sub-blocks of 4 × 4 size.

      SURF [Bay et al., 2008] is a speeded up version of SIFT. In SIFT, the Laplacian of Gaussian is approximated with the DoG to construct a scale-space. SURF speeds up this process by approximating the LoG with a box filter. Thus, a convolution with a box filter can easily be computed with the help of integral images and can be performed in parallel for different scales.

       Keypoint Localization

      In the first step, a blob detector based on the Hessian matrix is used to localize keypoints. The determinant of the Hessian matrix is used to select both the location and scale of the potential keypoints. More precisely, for an image I with a given point p = (x, y), the Hessian matrix H(p, σ) at point p and scale σ, is defined as follows:

Image

      where Lxx (p, σ) is the convolution of the second-order derivative of the Gaussian, Image, with the image I at point p. However, instead of using Gaussian filters, SURF uses approximated Gaussian second-order derivatives, which can be evaluated using integral images at a very low computational cost. Thus, unlike SIFT, SURF does not require to iteratively apply the same filter to the output of a previously filtered layer, and scale-space analysis is done by keeping the same image and varying the filter size, i.e., 9 × 9, 25 × 15, 21 × 21, and 27 × 27.

      Then, a non-maximum suppression in a 3 × 3 × 3 neighborhood of each point in the image is applied to localize the keypoints in the image. The maxima of the determinant of the Hessian matrix are then interpolated in the scale and image space, using the method proposed by Brown and Lowe [2002].

       Orientation Assignment

      In order to achieve rotational invariance, the Haar wavelet responses in both the horizontal x and vertical y directions within a circular neighborhood of radius 6s around the keypoint are computed, where s is the scale at which the keypoint is detected. Then, the Haar wavelet responses in both the horizontal dx and vertical dy directions are weighted with a Gaussian centered at a keypoint, and represented as points in a 2D space. The dominant orientation of the keypoint is estimated by computing the sum of all the responses within a sliding orientation window of angle 60°. The horizontal and vertical responses within the window are then summed. The two summed responses is considered as a local vector. The longest orientation vector over all the windows determines the orientation of the keypoint. In order to achieve a balance between robustness and angular resolution, the size of the sliding window need to be chosen carefully.

       Keypoint Descriptor

      To describe the region around each keypoint p, a 20s × 20s square region around p is extracted and then oriented along the orientation of p. The normalized orientation region around p is split into smaller 4 × 4 square sub-regions. The Haar wavelet responses in both the horizontal dx and vertical dy directions are extracted at 5 × 5 regularly spaced sample points for each subregion. In order to achieve more robustness to deformations, noise and translation, The Haar wavelet responses are weighted with a Gaussian. Then, dx and dy are summed up over each subregion and the results form the first set of entries in the feature vector. The sum of the absolute values of the responses, |dx| and |dy|, are also computed and then added to the feature vector to encode information about the intensity changes. Since each sub-region has a 4D feature vector, concatenating all 4 × 4 sub-regions results in a 64D descriptor.

      Until recently, progress in computer vision was based on hand-engineering features. However, feature engineering is difficult, time-consuming, and requires expert knowledge on the problem domain. The other issue with hand-engineered features such as HOG, SIFT, SURF, or other algorithms like them, is that they are too sparse in terms of information that they are able to capture from an image. This is because the first-order image derivatives are not sufficient features for the purpose of most computer vision tasks such as image classification and object detection. Moreover, the choice of features often depends on the application. More precisely, these features do not facilitate learning from previous learnings/representations (transfer learning). In addition, the design of hand-engineered features is limited by the complexity that humans can put in it. All these issues are resolved using automatic feature learning algorithms such as deep neural networks, which will be addressed in the subsequent chapters (i.e., Chapters 3, 4, 5, and 6).

      Machine learning is usually divided into three main areas, namely supervised, unsupervised, and semi-supervised. In the case of the supervised learning approach, the goal is to learn a mapping from inputs to outputs, given a labeled set of input-output pairs. The second type of machine learning is the unsupervised learning approach, where we are only given inputs, and the goal is to automatically find interesting patterns in the data. This problem is not a well-defined problem, because we are not told what kind of patterns to look for. Moreover, unlike supervised learning, where we can compare our label prediction for a given sample to the observed value, there is no obvious error metric to use. The third type of machine learning СКАЧАТЬ