Название: Automatic Text Simplification
Автор: Horacio Saggion
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781681731865
isbn:
where HW is the percent of “hard” words in the document (a hard word is one with at least three syllables) and PSC is the polysyllable count—the number of words with 3 or more syllables in 30 sentences which shall be picked from the beginning, middle, and end of the document.
Work on readability assessment has also included the idea of using a vocabulary or word list which may contain words together with indications of age at which the particular words should be known [Dale and Chall, 1948a]. These lists are useful to verify whether a given text deviates from what should be known at a particular age or grade level, constituting a rudimentary form of readability language model.
Readability measures have begun to take center stage in assessing the output of text simplification systems; however, their direct applicability is not without controversy. First, a number of recent studies have considered classical readability formulas [Wubben et al., 2012, Zhu et al., 2010], applying them to sentences, while many studies on the design of readability formulas are based on considerable samples from the text to assess or need to consider long text pieces to yield good estimates; their applicability at the sentence level would need to be re-examined because empirical evidence is still needed to justify their use. Second, a number of studies suggest the use of readability formulas as a way to guide the simplification process (e.g., De Belder [2014], Woodsend and Lapata [2011]). However, the manipulation of texts to match a specific readability score may be problematic since chopping sentences or blindly replacing words could produce totally ungrammatical texts, thereby “cheating” the readability formulas (see, for example, Bruce et al. [1981], Davison et al. [1980]).
2.3 ADVANCED NATURAL LANGUAGE PROCESSING FOR READABILITY ASSESSMENT
Over the last decade, traditional readability assessment formulas have been criticized [Feng et al., 2009]. The advances brought forward in areas of natural language processing made possible a whole new set of studies in the area of readability. Current natural language processing studies in the area of readability assessment rely on automatic parsing, availability of psycholinguistic information, and language modeling techniques [Manning et al., 2008] to develop more robust methods. Today it is possible to extract rich syntactic and semantic features from text in order to analyze and understand how they interact to make the text more or less readable.
2.3.1 LANGUAGE MODELS
Various works have considered corpus-based statistical methods for readability assessment. Si and Callan [2001] cast text readability assessment as a text classification or categorization problem where the classes could be grades or text difficulty levels. Instead of considering just surface linguistic features, they argue, quite naturally, that the content of the document is a key factor contributing to its readability. After observing that some surface features such as syllable count were not useful predictors of grade level in the dataset adopted (syllabi of elementary and middle school science courses of various readability levels from the Web), they combined a unigram language model with a sentence-length language model in the following approach:
where g is a grade level, d is the document, Pa is a unigram language model, Pb is a sentence-length distribution model, and λ is a coefficient adjusted to yield optimal performance. Note that probability parameters in Pa are words, that is the document should be seen as d = w1 … wn with wl the word at position l in the document, while in probability Pb the parameters are sentence lengths, so a document with k sentences should be thought as d = l1 … lk with li the length of the i-th sentence. The Pa probability distribution is a unigram model computed in the usual way using Bayes’s theorem as:
The probabilities are estimates obtained by counting events over a corpus. Where Pb is concerned, a normal distribution model with specific mean and standard deviation is proposed. The combined model of content and sentence length achieves an accuracy of 75% on a blind test set, while the Flesch-Kincaid readability score will just predict 21% of the grades correctly.
2.3.2 READABILITY AS CLASSIFICATION
Schwarm and Ostendorf [2005] see readability assessment as classification and propose the use of SVM algorithms for predicting the readability level of a text based on a set of textual features. In order to train a readability model, they rely on several sources: (i) documents collected from the Weekly Reader1 educational newspaper with 2nd–5th grade levels; (ii) documents from the Encyclopedia Britannica dataset compiled by Barzilay and Elhadad [2003] containing original encyclopedic articles (115) and their corresponding children’s versions (115); and (iii) CNN news stories (111) from the LiteracyNet2 organization available in original and abridged (or simplified) versions. They borrow the idea of Si and Callan [2001], thus devising features based on statistical language modeling. More concretely, given a corpus of documents with say grade k, they create a language model for that grade. Taking 3-gram sequences as units for modeling the text, the probability p(w) of a word sequence w = w1 … wn in the k-grade corpus is computed as:
where the 3-gram probabilities are estimated using 3-gram frequencies observed in the k-grade documents and smoothing techniques to account for unobserved events. Given the probabilities of a sequence w in the different models (one per grade), a likelihood ratio of sequence w is defined as:
where the prior p(k) probabilities can be assumed to be uniform. The LR(w, k) values already give some information on the likelihood of the text being of a certain complexity or grade. Additionally, the authors use perplexity as an indicator of the fit of a particular text to a given model where low perplexity for a text t and model m would indicate a better fit of t to m. Worth noting is the reduction of the features of the language models based on feature filtering by information gain (IG) values to 276 words (the most discriminative) and 56 part of speech tags (for words not selected by IG). SVMs are trained using the graded dataset (Weekly Reader), where each text is represented as a set of features including traditional readability assessment superficial features such as average sentence length, average number of syllables per word, and the Flesch-Kincaid index together with more-sophisticated features such as syntax-based features, vocabulary features, and language model features. Syntax-based features are extracted from parsed sentences [Charniak, 2000] and include average parse tree height, average number of noun phrases, average number of verb phrases, and average number of clauses (SBARs in the Penn Treebank tag set3). Vocabulary features account for out-of-vocabulary (OOV) word occurrences in the text. These are computed as percentages of words or word types not found in the most common 100, 200, and 500 words occurring in 2nd-grade texts. Concerning language model features, there are 12 perplexity values for 12 different language models computed using 12 different combinations of the paired datasets Britannica/CNN (adults vs. children) and three different n-grams: unigrams, bigrams, and trigrams (combining discriminative words and POS tags). The authors obtained better results in comparison to traditional readability formulas when their language model features are used СКАЧАТЬ