Название: Data Science For Dummies
Автор: Lillian Pierson
Издательство: John Wiley & Sons Limited
Жанр: Базы данных
isbn: 9781119811619
isbn:
Incorporating MapReduce, the HDFS, and YARN
MapReduce is a parallel distributed processing framework that can process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples. (With respect to MapReduce, tuples refers to key-value pairs by which data is grouped, sorted, and processed.) In layperson terms, MapReduce uses parallel distributed computing to transform big data into data of a manageable size.
In Hadoop, parallel distributed processing refers to a powerful framework in which data is processed quickly via the distribution and parallel processing of tasks across clusters of commodity servers.
Storing data on the Hadoop distributed file system (HDFS)
The HDFS uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.
The HDFS is characterized by these three key features:
HDFS blocks: In data storage, a block is a storage unit that contains some maximum number of records. HDFS blocks can store 64 megabytes of data, by default.
Redundancy: Datasets that are stored in HDFS are broken up and stored on blocks. These blocks are then replicated (three times, by default) and stored on several different servers in the cluster, as backup, or as redundancy.
Fault-tolerance: As mentioned earlier, a system is described as fault-tolerant if it’s built to continue successful operations despite the failure of one or more of its subcomponents. Because the HDFS has built-in redundancy across multiple servers in a cluster, if one server fails, the system simply retrieves the data from another server.
Putting it all together on the Hadoop platform
The Hadoop platform was designed for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN (a resource manager) all working together.
Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close as possible to the data — the task processors are positioned as closely as possible to the outgoing data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.
Introducing massively parallel processing (MPP) platforms
Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional on-premise data warehouse, an MPP may be the perfect solution.
To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.
Processing big data in real-time
A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near-real-time) as the data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than the hours or days it typically takes batch processing frameworks like MapReduce. Real-time processing frameworks do one of the following:
Increase the overall time efficiency of the system: Solutions in this category include Apache Storm and Apache Spark for near-real-time stream processing.
Deploy innovative querying methods to facilitate the real-time querying of big data: Some solutions in this category are Google’s Dremel, Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.
In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides results a lot faster but cannot process much data per processing interval.
Apache Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming streaming data in near-real-time. Its power lies in its processing speed: The ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter.
Real-time, stream-processing frameworks are quite useful in a multitude of industries — from stock and financial market analyses to e-commerce optimizations and from real-time fraud detection to optimized order logistics. Regardless of the industry in which you work, if your business is impacted by real-time data streams that are generated by humans, machines, or sensors, a real-time processing framework would be helpful to you in optimizing and generating value for your organization.
Part 2
Using Data Science to Extract Meaning from Your Data
IN THIS PART …
Master the basics behind machine learning approaches.
Explore the importance of math and statistics for data science.
Work with clustering and instance-based learning algorithms.
Chapter 3
Machine Learning Means … Using a Machine to Learn from Data
IN THIS CHAPTER
Grasping the machine learning process
Exploring machine learning styles and algorithms