Название: A Guide to Convolutional Neural Networks for Computer Vision
Автор: Salman Khan
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Computer Vision
isbn: 9781681732824
isbn:
The combination of n features can be represented as a n-dimensional vector, called a feature vector. The quality of a feature vector is dependent on its ability to discriminate image samples from different classes. Image samples from the same class should have similar feature values and images from different classes should have different feature values. For the example shown in Fig. 2.1, all cars shown in Fig. 2.2 should have similar feature vectors, irrespective of their models, sizes, positions in the images, etc. Thus, a good feature should be informative, invariant to noise and a set of transformations (e.g., rotation and translation), and fast to compute. For instance, features such as the number of wheels in the images, the number of doors in the images could help to classify the images into two different categories, namely “car” and “non-car.” However, extracting such features is a challenging problem in computer vision and machine learning.
Figure 2.1: (a) The aim is to design an algorithm which classifies input images into two different categories: “Car” or “non-Car.” (b) Humans can easily see the car and categorize this image as “Car.” However, computers see pixel intensity values as shown in (c) for a small patch in the image. Computer vision methods process all pixel intensity values and classify the image. (d) The straightforward way is to feed the intensity values to the classifiers and the learned classifier will then perform the classification job. For better visualization, let us pick only two pixels, as shown in (e). Because pixel 1 is relatively bright and pixel 2 is relatively dark, that image has a position shown in blue plus sign in the plot shown in (f). By adding few positive and negative samples, the plot in (g) shows that the positive and negative samples are extremely jumbled together. So if this data is fed to a linear classifier, the subdivision of the feature space into two classes is not possible. (h) It turns out that a proper feature representation can overcome this problem. For example, using more informative features such as the number of wheels in the images, the number of doors in the images, the data looks like (i) and the images become much easier to classify.
Figure 2.2: Images of different classes of cars captured from different scenes and viewpoints.
2.1.2 CLASSIFIERS
Classification is at the heart of modern computer vision and pattern recognition. The task of the classifier is to use the feature vector to assign an image or region of interest (RoI) to a category. The degree of difficulty of the classification task depends on the variability in the feature values of images from the same category, relative to the difference between feature values of images from different categories. However, a perfect classification performance is often impossible. This is mainly due to the presence of noise (in the form of shadows, occlusions, perspective distortions, etc.), outliers (e.g., images from the category “buildings” might contain people, animal, building, or car category), ambiguity (e.g., the same rectangular shape could correspond to a table or a building window), the lack of labels, the availability of only small training samples, and the imbalance of positive/negative coverage in the training data samples. Thus, designing a classifier to make the best decision is a challenging task.
2.2 TRADITIONAL FEATURE DESCRIPTORS
Traditional (hand-engineered) feature extraction methods can be divided into two broad categories: global and local. The global feature extraction methods define a set of global features which effectively describe the entire image. Thus, the shape details are ignored. The global features are also not suitable for the recognition of partially occluded objects. On the other hand, the local feature extraction methods extract a local region around keypoints and, thus, can handle occlusion better [Bayramoglu and Alatan, 2010, Rahmani et al., 2014]. On that basis, the focus of this chapter is on local features/descriptors.
Various methods have been developed for detecting keypoints and constructing descriptors around them. For instance, local descriptors, such as HOG [Triggs and Dalal, 2005], SIFT [Lowe, 2004], SURF [Bay et al., 2008], FREAK [Alahi et al., 2012], ORB [Rublee et al., 2011], BRISK [Leutenegger et al., 2011], BRIEF [Calonder et al., 2010], and LIOP [Wang et al., 2011b] have been used in most computer vision applications. The considerable recent progress that has been achieved in the area of recognition is largely due to these features, e.g., optical flow estimation methods use orientation histograms to deal with large motions; image retrieval and structure from motion are based on SIFT descriptors. It is important to note that CNNs, which will be discussed in Chapter 4, are not that much different than the traditional hand-engineered features. The first layer in the CNNs learn to utilize gradients in a way that is similar to hand-engineered features such as HOG, SIFT and SURF. In order to have a better understanding of CNNs, we describe next, three important and widely used feature detectors and/or descriptors, namely HOG [Triggs and Dalal, 2005], SIFT [Lowe, 2004], and SURF [Bay et al., 2008] in some details. As you will see in Chapter 4, CNNs are also able to extract similar hand-engineered features (e.g., gradients) in their lower layers but through an automatic feature learning process.
2.2.1 HISTOGRAM OF ORIENTED GRADIENTS (HOG)
HOG [Triggs and Dalal, 2005] is a feature descriptor that is used to automatically detect objects from images. The HOG descriptor encodes the distribution of directions of gradients in localized portions of an image.
HOG features have been introduced by Triggs and Dalal [2005] who have studied the influence of several variants of HOG descriptors (R-HOG and C-HOG), with different gradient computation and normalization methods. The idea behind the HOG descriptors is that the object appearance and the shape within an image can be described by the histogram of edge directions. The implementation of these descriptors consists of the following four steps.
Gradient Computation
The first step is the computation of the gradient values. A 1D centered point discrete derivative mask is applied on an image in both the horizontal and vertical directions. Specifically, this method requires the filtering of the gray-scale image with the following filter kernels:
Thus, given an image I, the following convolution operations (denoted by *) result in the derivatives of the image I in the x and y directions:
Thus, the orientation θ and the magnitude |g| of the gradient are calculated as follows:
As you will see in Chapter 4, just like the HOG descriptor, CNNs also use convolution operations in their layers. However, the main difference is that instead of using hand-engineered filters, e.g., fx, fy in Eq. (2.1), CNNs use trainable filters which make them highly adaptive. That is why they can achieve high accuracy levels in most applications such as image recognition.