A Guide to Convolutional Neural Networks for Computer Vision. Salman Khan
Чтение книги онлайн.

Читать онлайн книгу A Guide to Convolutional Neural Networks for Computer Vision - Salman Khan страница 5

СКАЧАТЬ Thus, acquiring computer vision and machine learning (e.g., deep learning) knowledge is an important skill that is required in many modern innovative businesses and is likely to become even more important in the near future.

      Humans use their eyes and their brains to see and understand the 3D world around them. For example, given an image as shown in Fig. 1.1a, humans can easily see a “cat” in the image and thus, categorize the image (classification task); localize the cat in the image (classification plus localization task as shown in Fig. 1.1b); localize and label all objects that are present in the image (object detection task as shown in Fig. 1.1c); and segment the individual objects that are present in the image (instance segmentation task as shown in Fig. 1.1d). Computer vision is the science that aims to give a similar, if not better, capability to computers. More precisely, computer vision seeks to develop methods which are able to replicate one of the most amazing capabilities of the human visual system, i.e., inferring characteristics of the 3D real world purely using the light reflected to the eyes from various objects.

      However, recovering and understanding the 3D structure of the world from two-dimensional images captured by cameras is a challenging task. Researchers in computer vision have been developing mathematical techniques to recover the three-dimensional shape and appearance of objects/scene from images. For example, given a large enough set of images of an object captured from a variety of views (Fig. 1.2), computer vision algorithms can reconstruct an accurate dense 3D surface model of the object using dense correspondences across multiple views. However, despite all of these advances, understanding images at the same level as humans still remains challenging.

      Due to the significant progress in the field of computer vision and visual sensor technology, computer vision techniques are being used today in a wide variety of real-world applications, such as intelligent human-computer interaction, robotics, and multimedia. It is also expected that the next generation of computers could even understand human actions and languages at the same level as humans, carry out some missions on behalf of humans, and respond to human commands in a smart way.

      Figure 1.1: What do we want computers to do with the image data? To look at the image and perform classification, classification plus localization (i.e., to find a bounding box around the main object (CAT) in the image and label it), to localize all objects that are present in the image (CAT, DOG, DUCK) and to label them, or perform semantic instance segmentation, i.e., the segmentation of the individual objects within a scene, even if they are of the same type.

      Figure 1.2: Given a set of images of an object (e.g., upper human body) captured from six different viewpoints, a dense 3D model of the object can be reconstructed using computer vision algorithms.

       Human-computer Interaction

      Nowadays, video cameras are widely used for human-computer interaction and in the entertainment industry. For instance, hand gestures are used in sign language to communicate, transfer messages in noisy environments, and interact with computer games. Video cameras provide a natural and intuitive way of human communication with a device. Therefore, one of the most important aspects for these cameras is the recognition of gestures and short actions from videos.

       Robotics

      Integrating computer vision technologies with high-performance sensors and cleverly designed hardware has given rise to a new generation of robots which can work alongside humans and perform many different tasks in unpredictable environments. For example, an advanced humanoid robot can jump, talk, run, or walk up stairs in a very similar way a human does. It can also recognize and interact with people. In general, an advanced humanoid robot can perform various activities that are mere reflexes for humans and do not require a high intellectual effort.

       Multimedia

      Computer vision technology plays a key role in multimedia applications. These have led to a massive research effort in the development of computer vision algorithms for processing, analyzing, and interpreting multimedia data. For example, given a video, one can ask “What does this video mean?”, which involves a quite challenging task of image/video understanding and summarization. As another example, given a clip of video, computers could search the Internet and get millions of similar videos. More interestingly, when one gets tired of watching a long movie, computers would automatically summarize the movie for them.

      Image processing can be considered as a preprocessing step for computer vision. More precisely, the goal of image processing is to extract fundamental image primitives, including edges and corners, filtering, morphology operations, etc. These image primitives are usually represented as images. For example, in order to perform semantic image segmentation (Fig. 1.1), which is a computer vision task, one might need to apply some filtering on the image (an image processing task) during that process.

      Unlike image processing, which is mainly focused on processing raw images without giving any knowledge feedback on them, computer vision produces semantic descriptions of images. Based on the abstraction level of the output information, computer vision tasks can be divided into three different categories, namely low-level, mid-level, and high-level vision.

       Low-level Vision

      Based on the extracted image primitives, low-level vision tasks could be preformed on images/videos. Image matching is an example of low-level vision tasks. It is defined as the automatic identification of corresponding image points on a given pair of the same scene from different view points, or a moving scene captured by a fixed camera. Identifying image correspondences is an important problem in computer vision for geometry and motion recovery.

      Another fundamental low-level vision task is optical flow computation and motion analysis. Optical flow is the pattern of the apparent motion of objects, surfaces, and edges in a visual scene caused by the movement of an object or camera. Optical flow is a 2D vector field where each vector corresponds to a displacement vector showing the movement of points from one frame to the next. Most existing methods which estimate camera motion or object motion use optical flow information.

       Mid-level Vision

      Mid-level vision provides a higher level of abstraction than low-level vision. For instance, inferring the geometry of objects is one of the major aspects of mid-level vision. Geometric vision includes multi-view geometry, stereo, and structure from motion (SfM), which infer the 3D scene information from 2D images such that 3D reconstruction could be made possible. Another task of mid-level vision is visual motion capturing and tracking, which estimate 2D and 3D motions, including deformable and articulated motions. In order to answer the question “How does the object move?,” image segmentation is required to find areas in the images which belong to the object.

       High-level СКАЧАТЬ