Data Lakes For Dummies. Alan R. Simon
Чтение книги онлайн.

Читать онлайн книгу Data Lakes For Dummies - Alan R. Simon страница 14

Название: Data Lakes For Dummies

Автор: Alan R. Simon

Издательство: John Wiley & Sons Limited

Жанр: Базы данных

Серия:

isbn: 9781119786184

isbn:

СКАЧАТЬ target="_blank" rel="nofollow" href="#ulink_c29f66c0-19b4-5125-b978-1123398a6a9c">TABLE 1-1 Data Lake Zones

Recommended Zone Name Other Names
Bronze zone Raw zone, landing zone
Silver zone Cleansed zone, refined zone
Gold zone Performance zone, curated zone, data model zone
Sandbox Experimental zone, short-term analytics zone

      

The boundaries and borders between your data lake zones can be fluid (Fluid? Get it?), especially with streaming data, as I explain in Part 2.

      The bronze zone

      You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.

      

Your catchphrase for loading data into the bronze zone is “the need for speed.” You may be trickling one piece of data at a time or bulk-loading hundreds of gigabytes or even terabytes of data. Your objective is to transmit the data into the data lake environment as quickly as possible. You’ll worry about checking out and refining that data later.

      The silver zone

      The silver zone consists of data that has been error-checked and cleansed but still remains in its original format. Data may be copied from a source application in JavaScript Object Notation (JSON) format and land in the bronze zone in raw form, looking exactly as the data was in the source system itself — errors and all.

      You’ll patch up any known errors, handle missing data, and otherwise cleanse the data. Then you’ll store the cleansed data in the silver zone, still in JSON format.

      

Not all data from your bronze zone will be cleansed and copied into your silver zone. The data lake model calls for loading massive amounts of data into the bronze zone without having to do upfront analysis to determine which data is definitely or likely needed for analysis. When you decide what data you need, you do the necessary data cleansing and move only the cleansed data into the silver zone.

      The gold zone

      The gold zone is the final home for your most valuable analytical data. You’ll curate data coming from the silver zone, meaning that you’ll group and restructure data into “packages” dedicated to your organization’s high-value analytical needs.

      LINKING THE DATA LAKE ZONES TOGETHER

      The following figure shows the progressive pipelines of data among the various zones, including the sandbox. Notice how not every piece or group of data is cleansed and then sent from the bronze zone to the silver zone. You’ll spend time refurbishing, refining, and transmitting data to the silver zone that you definitely or likely need for analytics.

Schematic illustration of the progressive pipelines of data among the various zones, including the sandbox.

      Likewise, select data sets are sent from the silver zone to the gold zone. Remember that another name for the gold zone is the curated zone, meaning that you’ve especially selected certain data to be consolidated and then placed in “packages” within the gold zone.

      You might transmit raw, uncleansed data from the bronze zone into the sandbox along with data from the silver zone, depending on the specifics of your experimental or short-term analytical needs.

You will almost certainly replicate data across the various gold zone packages, but that’s not a problem at all. As long as you carefully control the data flows and the replicated data, you’re unlikely to run into problems with uncontrolled data proliferation.

      The sandbox

      But what about shorter-term analytical needs or experiments that you want to run with your data? You may be building new machine learning models to predict customer behavior, optimize your supply chain, or determine new treatment plans for a hospital system’s patients. You need to experiment with different machine learning techniques, and you need actual data for your work.

      Head over to the sandbox and start playing. You’ll load whatever data you need for your short-term or experimental work and do your thing. The data lake isolates the sandbox from the data pipeline, so you can do whatever you need without interfering with your organization’s primary analytical work.

      Turn the clock back to the early 2010s when big data burst onto the scene. Almost every organization was exploring how this new generation of data management technology can overcome many of the barriers and constraints of relational databases, particularly for analytical storage.

      Big data promised — and delivered — significantly greater capacity than was possible with relational databases. With big data, you can store unstructured and semi-structured data alongside your structured data. You can also bring new data into a big data environment with lower latency than with relational databases.

      Wait a minute! That sounds just like the description of a data lake! So, is a data lake just another name for big data?

      Well, sort of … possibly … or maybe not… .

      The best way to think of the two disciplines in relation to one another is as follows:

       Big data is the underlying core technology used to build a data lake.

       A data lake is an environment that includes big data but also potentially other data management technologies along with services for data transmission and data governance.

      THE THREE (OR FOUR OR FIVE OR MORE) VS OF BIG DATA AND DATA LAKES

СКАЧАТЬ