Statistical Significance Testing for Natural Language Processing. Rotem Dror
Чтение книги онлайн.

Читать онлайн книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror страница 7

СКАЧАТЬ test whether δ(X) > 0, which would indicate a higher BLEU score (i.e., better performance) for the LSTM. However, we would also like to assess whether this result is likely to happen again in a new experiment, or whether the current experiment does not reflect the actual relationship between the algorithms.

      We will refer to δ(X) as our test statistic—a quantity derived from the experiment and used for the statistical hypothesis testing. Using this notation we formulate the following statistical hypothesis testing problem3:

Image

      The null hypothesis, H0, is that δ(X) is smaller than or equal to zero, meaning that algorithm B is better than A, or that B is as good as A. In contrast, the alternative hypothesis, H1, is that there is in fact a difference in performance and that algorithm A is superior. In order to decide whether or not to reject the null hypothesis, we can ask the following question.

      Considering the test statistic that we chose and its distribution under the null hypothesis, how likely would it be to encounter the δ(X) value that we have observed in our test, given that the null hypothesis is indeed correct?

      After all, if δ(X) is a very large number, then algorithm A strongly outperformed algorithm B, and that would be unlikely under the hypothesis that algorithm B is better. To answer this question we will need to compute a probability term where δ(X) is a random variable, which requires some prior knowledge regarding its distribution under the null hypothesis—we will discuss this further later on this book. We therefore phrase our decision in terms of the probability of observing the δobserved value if the null hypothesis was in fact true. This probability is exactly the p-value of the test.

      The p-value is defined as the probability, under the null hypothesis H0, of obtaining a result equal to or even more extreme than what was actually observed. For the hypothesis testing framework defined here, the p-value is defined as:

Image

      where δobserved is the performance difference between the algorithms (according to M) when they are applied to X. Going back to our example, we could describe the p-value as the probability that the LSTM shows such stronger performance in this setting (i.e., to observe such a δobserved) when the phrase-based MT system is actually a better model. If δobserved is small, meaning the LSTM’s BLEU score is only slightly better than that of the phrase-based system, it may very well be a statistical “fluke”, such that if we were to repeat the experiment with a slightly different dataset from the same distribution we could probably encounter the opposite result of the phrase-based MT performing better. However, as δobserved increases, the probability of encountering such values under the assumption that the phrase-based MT system is better becomes smaller and smaller.

      The smaller the p-value, the stronger is the indication that the observed outcome is unlikely under the null hypothesis, H0. In order to decide whether H0 should be rejected, the researcher should pre-define an arbitrary, fixed threshold value α a.k.a the significance level. Only if p-value < α then the null hypothesis is rejected.

      For example, let us say that the probability to encounter a difference of 10 points between BLEU(LSTM) and BLEU (phrase-based) under the assumption that the phrase-based MT system is better, is 0:05. For a significance level of 0:1 we would reject the null hypothesis, since p-value < α. For a significance level of 0:03 we would not reject the null hypothesis. A lower α is a stronger demand, equivalent to saying “We need to see a stronger, more extreme improvement in the LSTM in order to determine that it is a superior model. We want to see such a strong improvement (such a large δobserved), that would only have a probability of 0:03 or less under the null hypothesis.”

      How should we choose an α? As noted above, it is impossible to actually know which hypothesis is correct, H0 or H1, and hence we can only strive to minimize the probability of choosing the wrong hypothesis. A small α ensures that we do not reject the null hypothesis easily, but it may also cause us to not reject the null hypothesis when we should. More technically, a small α yields a lower probability of a type I error and a higher probability of a type II error. A common practice is to choose an α that guarantees that the probability of making a type I error is upper bounded by a pre-defined desired value, while achieving the highest possible power, i.e., the lowest possible probability of making a type II error. Popular α values in the literature are 0.05 and 0.01.

      1 In this book we use the terms evaluation metric and evaluation measure interchangeably.

      2 To keep the discussion concise, throughout this book we assume that only one evaluation measure is used. Our framework can be easily extended to deal with multiple measures.

      3 For simplicity we consider A one-sided hypothesis, it can be easily reformulated as A two-sided hypothesis.

      Statistical Significance Tests

      In this book, we are interested in the process of comparing performance of different NLP algorithms in a statistically sound manner. How is this goal related to the calculation of the p-value? Well, calculating the p-value is inextricably linked to statistical significance testing, as we will attempt to explain next. Recall the definition of δ(X) in Equation (2.3). δ(X) is our test statistic for the hypothesis test defined in Equation (2.3).

      δ(X) is computed based on X, a specific data sample. In general, one can claim that if our data sample is representative of the data population, extreme values of δ(X) (either negative or positive) are less likely. In other words, the far left and right tails of the δ(X) distribution curve represent the unlikely events in which δ(X) obtains extreme values. What is the chance, given the null hypothesis is true, to have our δ(X) value land in those extreme tails? That probability is exactly the p-value obtained in the statistical test.

      So, we now know that the probability of obtaining a δ(X) this high (or higher) is very low under the null hypothesis. Therefore, is the null hypothesis likely given this δ(X)? Well, the answer is, most likely, no. It is much more likely that the performance of algorithm A is better. To summarize, because the probability of seeing such a δ(X) under the null hypothesis (i.e., seeing such a p-value) is very low (< α), we reject the null hypothesis and conclude that there is a statistically significant difference between the performance of the two algorithms. This shows that statistical significance tests and the calculation of the p-value are parallel tools that help quantify the likelihood of the observed results under the null hypothesis.

      In this chapter we move from describing the general framework of statistical significance testing to the specific considerations involved in the selection of a statistical significance test for an NLP application. We shall define the difference between parametric and nonparametric tests, and explore another important characteristic of the sample of scores that we work with, one that is highly critical for the design of a valid statistical test. We will present prominent tests useful for NLP setups, and conclude our discussion by providing a simple decision tree that aims to guide the process of selecting a significance СКАЧАТЬ