End-to-end Data Analytics for Product Development. Chris Jones
Чтение книги онлайн.

Читать онлайн книгу End-to-end Data Analytics for Product Development - Chris Jones страница 9

Информация о книге:

Название: End-to-end Data Analytics for Product Development

Автор: Chris Jones

Издательство: John Wiley & Sons Limited

Жанр: Математика

Серия:

isbn: 9781119483700

isbn:

СКАЧАТЬ

For symmetric data, mean and median tend to be close in value (Figure 1.5):

Graphical illustration of mean and median in symmetric distributions.

Figure 1.5 Mean and median in symmetric distributions.

In skewed data or data with extreme values, mean and median can be quite different. Usually for such data, the median tends to be a better indicator of the central tendency rather than the mean, because while the mean tends to be pulled in the direction of the skew, the median remains closer to the majority of the observations (Figure 1.6).

Graphical illustration of mean and median in skewed distributions.

Figure 1.6 Mean and median in skewed distributions.

Stat Tool 1.7 Measures of Non‐Central Tendency: Quartiles

Particularly when numeric data do not tend to concentrate around a unique central value (e.g. fairly uniform distributions), more than one descriptive measure is needed to summarize the data distribution. These measures are called quantiles.

Illustration of quartiles, which are three values, first quartile Q1, second quartile Q2, and third quartile Q3, corresponding to specific positions in the sorted list of data.

The most common quantiles are quartiles, which are three values (first quartile Q₁, second quartile Q₂, and third quartile Q₃) corresponding to specific positions in the sorted list of data values (Figure 1.7).

75% of the data are less than Q3 and 25% are greater than Q3.

50% of the data are less than Q2 and 50% are greater than Q2.

25% of the data are less than Q1 and 75% are greater than Q1.

Figure 1.7 Quartiles.

The first quartile is also known as the 25th percentile, the median as the 50th percentile, and the third quartile as the 75th percentile.

Stat Tool 1.8 Measures of Variability: Range and Interquartile Range

Variability refers to how spread out a set of datavalues is.

Diagrammatic illustration of measuring variability: Low variability and High variability.

Consider the following graphs (see Figure 1.8):

The two data distributions are quite different in terms of variability: the graph on the left shows more densely packed values (less variability), while the graph on the right reveals more spread out data (higher variability).

The terms variability, spread, variation, and dispersion are synonyms, and refer to how spread out a distribution is.

Graphical illustration of frequency distributions and variability.

Figure 1.8 Frequency distributions and variability.

How can the spread of a set of numeric values be quantified?

The range, commonly represented as R, is a simple way to describe the spread of data values. It is the difference between the maximum value and the minimum value in a data set. The range can also be represented as the interval: (minimum value; maximum value).

A large range value (or a wide interval) indicates greater dispersion in the data. A small range value (or a narrow interval) indicates that there is less dispersion in the data.

Note that the range only uses two data values. For this reason, it is most useful in representing dispersion when data doesn't include outliers.

A second measure of variation is the interquartile range, commonly represented as IQR. It is the difference between the third quartile Q₃ and the first quartile Q₁ in a data set. IQR can also be represented as the interval: (Q₁; Q₃). Fifty percent of the data are within this range: as the spread of these data increases, the IQR becomes larger.

The IQR is not affected by the presence of outliers.

Stat Tool 1.9 Measures of Variability: Variance and Standard Deviation

For numeric data, spread can also be measured by the variance. It accounts for all the data by measuring the distance or difference between each value and the mean. These differences are called deviations. The variance is the sum of squared deviations, divided by the number of values minus one. Roughly speaking, the variance (usually denoted by S²) is the average of the squared deviations from the mean.

1 Example 1.2. Suppose you observed the following numeric data with their dotplot (Figure 1.9):

8.1

8.2

7.6

9.0

7.5

6.9

8.1

9.0

8.3

8.1

8.2

7.6

Figure 1.9 Dotplot.

The mean is equal to 8.05. Let's calculate the deviations from the mean and their squares: