Название: Practical Data Analysis with JMP, Third Edition
Автор: Robert Carver
Издательство: Ingram
Жанр: Программы
isbn: 9781642956122
isbn:
As an example of a well-designed survey data table, we have a small portion of the responses provided in the 2006 administration of the National Health and Nutrition Examination Survey (NHANES). NHANES is an annual survey of people in the U.S. that looks at their diet, health, and lifestyle choices. The Centers for Disease Control (CDC) posts the raw survey data for public access.2
1. Open the NHANES 2016 data table.
As shown in Figure 2.7, this data table contains some features that we have not seen before. As usual, each column holds observations for a single variable, but these column names are not very informative. Typical of many large-scale surveys, the NHANES website provides both the data and a data dictionary that defines each variable, including information about coding and units of measurement. For our use in this book, we have added some notations to each column.
Figure 2.7: NHANES Data Table
13. In the Columns panel, move your cursor to highlight the column named RIAGENDR and right-click. Select Column info… to open the dialog box shown in Figure 2.8.
Figure 2.8: Customizing Column Metadata
When this dialog box opens, we find that this column holds numeric data, and its modeling type is nominal—which might seem surprising since the rows in the data table all say Male or Female. In fact, the lower portion of the column info dialog box shows us what is going on here. This is an example of coded data: the actual values within the column are 1s and 2s, but when displayed in the data table, a 1 appears as the word Male and a 2 as the word Female. This recoding is a column property within JMP.
The asterisk next to RIAGENDR in the Columns panel indicates that this column has one or more special column properties defined.
14. At the bottom of the dialog box, clear the check box marked Use Value Labels and click Apply. Now, look in the Data Table (NHANES 2016 tab) and see that the column displays 1s and 2s rather than the value labels.
15. Click the word Notes under Column Properties (middle left of the dialog box). You will see a brief description of the contents of this column. As you work with data tables and create your own columns, you should get into the habit of annotating columns with informative notes.
In this table, we also encounter missing observations once again. Missing data is a common issue in survey data and it crops up whenever a respondent does not complete a question or an interviewer does not record a response. In a JMP data table, a black dot (·) indicates missing numeric data; missing character data is just a blank. As a general rule, we want to think carefully about missing observations and why they occur in a particular set of data.
For instance, look at row 1 of this data table. We have missing observations for several variables, including the columns representing respondent’s age in months (RIDAGEMN) and pregnancy status (RIDEXPRG). Further inspection shows that respondent #83732 was a 62-year-old male.
Creating a Data Table
In this book, we will almost always analyze the data tables that are available at https://support.sas.com/en/books/authors/robert-carver.html. Once you start to practice data analysis on your own, you will often need to create your own data tables. Refer to Appendix B (“Data Management”) for details about entering data from a keyboard, transferring data from the web, reading in an Excel worksheet, or assembling a data table from several other data tables.
Raw Case Data and Summary Data
This book is accompanied by more than 50 data tables, most of which contain “casewise” data, meaning that each row of the data table refers to one observational unit. For example, each row of the NHANES table represents a person. Each row of the Concrete table is one batch of concrete at a moment in time. Like most statistical software, JMP is intended to analyze raw data. However, sometimes data come to us in a partially processed or summarized state. It is still possible to construct a data table and do some limited analysis of this type of data.
For example, consider public opinion surveys that have been reported in the news. Yale University and George Mason University, collaborating in the Yale Program on Climate Change Communication, published “Politics & Global Warming, April 2019.”3 The research team used “a nationally representative survey (N = 1,291 registered voters)” (Leiserowitz et al., p. 4). Among other issues, they asked respondents if they think that global warming is happening. Among all voters, 70% expressed the view that global warming is real. The researchers broke down the total sample, as shown in Table 2.1 (created from the report’s Executive Summary).
Table 2.1: Sample of U.S. Voters’ Belief that Global Warming Is Happening
Voter Group | Percentage agreeing |
Liberal Democrats | 95 % |
Moderate/Conservative Democrats | 87 |
Liberal/Moderate Republicans | 63 |
Conservative Republicans | 38 |
Tables like this are common in news reports. We could easily transfer this into a data table with one major caveat that often confuses introductory students. It is crucial to understand how the layout of this table relates to its content. In JMP, we usually expect each row to represent one observational unit, each column to represent one variable, and each cell to contain one data value. This table does not satisfy these assumptions.
To see why this is not a raw data table, we should go back and think about how the raw data were generated. We know that respondents to this question—the observational units—were the 1,291 registered voters in the survey sample. The interviewers asked them many questions, and Table 2.1 tallies the responses to two of the questions. First, people reported their voting habits. The second variable is the responses those people gave when asked the question, “Do you think that global warming is happening?” Respondents could agree, disagree, or offer no opinion. Because their responses were categorical rather than numeric, this is a nominal variable.
So, how does this clarify the contents of Table 2.1? The first column lists four voter groups, ordered from most liberal to most conservative; the cell entries are unambiguously categorical. The second column is a little tricky. It appears to be continuous data, looking very much like numbers. However, these numbers are not measurements of observational units (people). They summarize some of the answers provided by the respondents, indicating the fraction of each voter group expressing agreement to the global warming question. More precisely, the values are relative frequencies of each “level” of the ordinal voter group variable. In short, this table represents one ordinal and one nominal variable, summarizing the responses of nearly 1,300 individuals who are “invisible” when the information is presented this way. In later chapters, we will learn how to use columns of frequencies. For now, our introduction to data types and sources is complete.
Application
Now that you have completed all the activities in this chapter, use the concepts and techniques that you have learned to respond to these questions.
1. Use your primary textbook or the daily newspaper to locate a table that summarizes some data. Read the article accompanying the table and identify the variable, data type, and observational units.
2. Return to the Concrete data table. Browse through the column notes, and explain СКАЧАТЬ