Название: Official Google Cloud Certified Professional Data Engineer Study Guide
Автор: Dan Sullivan
Издательство: John Wiley & Sons Limited
Жанр: Зарубежная компьютерная литература
isbn: 9781119618454
isbn:
Semi-Structured Data Semi-structured data has attributes like structured data, but the set of attributes can vary from one instance to another. For example, a product description of an appliance might include length, width, height, weight, and power consumption. A chair in the same catalog might have length, width, height, color, and style as attributes. Semi-structured data may be organized using arrays or sets of key-value pairs.
Unstructured Data Unstructured data does not fit into a tabular structure. Images and audio files are good examples of unstructured data. In between these two extremes lies semi-structured data, which has characteristics of both structured and unstructured.
Structured: Transactional vs. Analytical
Structured data can be represented in tables of columns and rows, where columns are attributes and rows are records or entities. Table 1.1 showed an example of structured data. Structured data may be oriented to transactional processes or analytical use cases.
Transactional structured data is often operated on one row at a time. For example, a business application may look up a customer’s account information from a customer table when displaying data about the customer’s shipping address. Multiple columns from a single row will be used, so it is efficient to store all row attributes together in a data block. Retrieving a single data block will retrieve all the needed data. This is a common pattern in transactional databases such as Cloud SQL and Cloud Spanner, which use row-oriented storage.
Now consider a data warehousing example. A business analyst is working with a sales data mart and wants to understand how sales last month compare to the same period last year. The data mart has one row for each product on each date, which include the following attributes in addition to product and date: number of units sold, total revenue for units sold, average unit price, average marginal revenue, and total marginal revenue. The analyst is only interested in the monthly sums of total revenue for units sold for each product. In this case, the analyst would query many rows and only three columns. Instead of retrieving the full row for all rows selected, it is more efficient to retrieve only the date, product, and total revenue of units sold columns. This is a common pattern in analytical applications and the reason why BigQuery uses a column-oriented storage mechanism.
Semi-Structured: Fully Indexed vs. Row Key Access
Semi-structured data, as noted earlier, does not follow a fixed tabular format and instead stores schema attributes along with the data. In the case of document databases, this allows developers to add attributes as needed without making changes to a fixed database schema. Two ways of storing semi-structured data are as documents or as wide columns. An important distinction between the two is how data is retrieved from them.
Fully Indexed, Semi-Structured Data
Let’s consider the simple product catalog example again. There are many ways that shoppers might want to search for information about products. If they are looking for a dishwasher, for example, they might want to search based on size or power consumption. When searching for furniture, style and color are important considerations.
{ {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }
To search efficiently by attributes, document databases allow for indexes. If you use Cloud Datastore, for example, you could create indexes on each of the attributes as well as a combination of attributes. Indexes should be designed to support the way that data is queried. If you expect users to search for chairs by specifying style and color together, then you should create a style and color index. If you expect customers to search for appliances by their power consumption, then you should create an index on power.
Creating a large number of indexes can significantly increase the amount of storage used. In fact, it is not surprising to have total index storage greater than the amount of storage used to store documents. Also, additional indexes can negatively impact performance for insert, update, and delete operations, because the indexes need to be revised to reflect those operations.
Row Key Access
Wide-column databases usually take a different approach to querying. Rather than using indexes to allow efficient lookup of rows with needed data, wide-column databases organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational databases, to retrieve data. This has two implications.
Tables in wide-column databases are designed to respond to particular queries. Although relational databases are designed according to forms of normalization that minimize the risk of data anomalies, wide-column databases are designed for low-latency reads and writes at high volumes. This can lead to duplication of data. Consider IoT sensor data stored in a wide-column database. Table 1.2 shows IoT data organized by sensor ID and timestamp (milliseconds since January 1, 1970 00:00:00 UTC). Future rows would feature the same sensor ID but different corresponding timestamps, and the row key would be determined by both.
Table 1.2 IoT data by sensor ID and timestamp
Sensor ID | Timestamp | Temperature | Relative humidity | Pressure |
789 | 1571760690 | 40 | 35 | 28.2 |
790 | 1571760698 | 42.5 | 50 | 29.1 |
791 | 1571760676 | 37 | 61 | 28.6 |
Table 1.2 is organized to answer queries that require looking up data by sensor ID and then time. It is not well suited for looking up data by time—for example, all readings over the past hour. Rather than create an index on timestamp, wide-column databases duplicate data in a different row key order. Table 1.3, for example, is designed to answer time range queries. Note that a new table must be created with the desired schema to accomplish this—there is no index that is used to support the query pattern.
Table 1.3 IoT data by timestamp and sensor ID
Timestamp | Sensor ID | Temperature | Relative humidity | Pressure |