Active Learning. Burr Settles
Чтение книги онлайн.

Читать онлайн книгу Active Learning - Burr Settles страница 8

СКАЧАТЬ alt="Image"/>

      Figure 2.4: Illustrations of various uncertainty measures. (a–c) Binary classification tasks. Each plot shows the utility score as a function of (⊕|x), which is the posterior probability of the positive class. (d–f) Ternary classification tasks (three labels). Heatmap corners represent posterior distributions where one label is very likely, with the opposite edge plotting the probability range between the other two classes (when that label has low probability). The center of each heatmap is the uniform distribution.

      Figure 2.4 visualizes the implicit relationship among these uncertainty measures. For binary classification (top row), all three strategies are monotonic functions of one another. They are symmetric with a peak about (⊕|x) = 0.5. In effect, they all reduce to querying the instance that is closest to the decision boundary. For a three-label classification task (bottom row), the relationship begins to change. For all three measures, the most informative instance lies at the center of the triangular simplex, because this represents where the posterior label distribution is most uniform (and therefore most uncertain under the model). Similarly, the least informative instances are at the three corners, where one of the classes has extremely high probability (and thus little model uncertainty). The main differences lie in the rest of the probability space. For example, the entropy measure does not particularly favor instances for which only one of the labels is highly unlikely (i.e., along the outer side edges of the simplex), because the model is fairly certain that it is not the true label. The least confident and margin measures, on the other hand, consider such instances to be useful if the model has trouble distinguishing between the remaining two classes (i.e., at the midpoint of an outer edge). Intuitively, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.

      Not all machine learning is classification. In particular, we might want to use active learning to reduce the cost of training a model that predicts structured output, such as label sequences or trees. For example, Figure 2.5 illustrates how information extraction—the task of automatically extracting structured information like database entries from unstructured text—is framed as a sequence-labeling task. Let x = 〈x1, …, xT〉 be an observation sequence of length T with a corresponding label sequence y = 〈y1, …, yT〉. Words in a sentence correspond to tokens in x, which are mapped to labels in y.

Image

      Figure 2.5: An information extraction example viewed as a sequence labeling task. (a) A sample input sequence x and corresponding label sequence y. (b) A probabilistic sequence model represented as a finite state machine, illustrating the path of 〈x, y〉 through the model.

      Figure 2.5(a) presents an example 〈x, y〉 pair. The labels indicate whether a given word belongs to an entity class of interest (org and loc in this case, for “organization” and “location,” respectively) or null otherwise. Unlike simple classification, x is not represented by a single feature vector, but rather a sequence of feature vectors: one for each token (i.e., word). One approach is to treat each token as an instance, and train a classifier that scans through the input sequence, assigning output labels to tokens independently. However, the word “Madison,” devoid of context, might refer to an location (city), organization (university), or even a person. For tasks such as this, sequence models based on probabilistic finite state machines, such as hidden Markov models or linear-chain conditional random fields, are considered the state of the art. An example sequence model is shown in Figure 2.5(b). Such models can produce a probability distribution for every possible label sequence y, the number of which can grow exponentially in the sequence length T.

      Fortunately, uncertainty sampling generalizes fairly easily to probabilistic structured prediction models. For example, the least confident strategy is popular for information extraction using sequences (Culotta and McCallum, 2005; Settles and Craven, 2008), because the most likely output sequence ŷ and the associated (ŷ|x) can be efficiently computed with dynamic programming. Selecting the best query is generally no more complicated or expensive than the standard inference procedure. The Viterbi algorithm (Corman et al., 1992) requires O(TM) time, for example, where T is the sequence length and M is the number of label states. It is often possible to perform “N-best” inference using a beam search as well (Schwartz and Chow, 1990), which finds the N most likely output structures under the model. This makes it simple to compute the necessary probabilities for ŷ1 and ŷ2 in the margin strategy, and comes at little extra computational expense: the complexity is O(TMN) for sequences, which for N = 2 merely doubles the runtime compared to the least confident strategy. Dynamic programs have also been developed to compute the entropy over all possible sequences (Mann and McCallum, 2007) or trees (Hwa, 2004), although this approach is significantly more expensive. The fastest entropy algorithm for sequence models requires O(TM2) time, which can be very slow when the number of label states is large. Furthermore, some structured models are so complex that they require approximate inference techniques, such as loopy belief propagation or Markov chain Monte Carlo (Koller and Friedman, 2009). In such cases, the least confident strategy is still straightforward since only the “best” prediction needs to be evaluated. However, the margin and entropy heuristics cease to be tractable and exact for these more complex models.

      So far we have only discussed problems with discrete outputs—classification and structured prediction. Uncertainty sampling is also applicable to regression, i.e., learning tasks with continuous output variables. In this setting, the learner can simply query the unlabeled instance for which the learner has the highest output variance in its prediction. Under a Gaussian assumption, the entropy of a random variable is a monotonic function of its variance, so this approach is much in same the spirit as entropy-based uncertainty sampling. Another interpretation of variance is the expected squared-loss of the model’s prediction. Closed-form estimates of variance can be computed for a variety of model classes, although they can require complex and expensive computations.

      Figure 2.6 illustrates variance-based uncertainty sampling using an artificial neural network. The target function is a Gaussian in the range [-10,10], shown by the solid red line in the top row of plots. The network in this example has one hidden layer of three logistic units, and one linear output unit. The network is initially trained with two labeled instances drawn at random (upper left plot), and its variance estimate (lower left plot) is used to select the first query. This process repeats for a few iterations, and after three queries the network can approximate the target function fairly well. In general, though, estimating the output variance is nontrivial and will depend on the type of model being used. This is in contrast to the utility measures in Section 2.3 for discrete outputs, which only require that the learner produce probability estimates. In Section 3.4, we will discuss active learning approaches using ensembles as a simpler way to estimate output variance. Active learning for regression has a long history in the statistics literature, generally referred to as optimal experimental design (Federov, 1972). However, the statistics community generally eschews uncertainty sampling in lieu of more sophisticated strategies, which we will explore further in Chapter 4.

Image

      Figure СКАЧАТЬ