Название: Automatic Text Simplification
Автор: Horacio Saggion
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781681731865
isbn:
2.5 ARE CLASSIC READABILITY FORMULAS CORRELATED?
Given the proliferation of readability formulas, one may wonder how they differ and which one should be used for assessing the difficulty of a given text. Štajner et al. [2012] study the correlation of a number of classic readability formulas and linguistically motivated features using different corpora to identify which formula or linguistic characteristics may be used to select appropriate text for people with an autism-spectrum disorder.
The corpora included in the study were: 170 texts from Simple Wikipedia, 171 texts from a collection of news texts from the METER corpus, 91 texts from the health section of the British National Corpus, and 120 fiction texts from the FLOB corpus.4 The readability formulas studied were the Flesch Reading Ease score, the Flesch-Kincaid grade level, the SMOG grading, and FOG index. According to the authors, the linguistically motivated features were designed to detect possible “linguistic obstacles” that a text may have to hinder readability. They include features of structural complexity such as the average number of major POS tags per sentence, average number of infinitive markers, coordinating and subordinating conjunctions, and prepositions. Features indicative of ambiguity include the average number of sentences per word, average number of pronouns and definite descriptions per sentence. The authors first computed over each corpus averages of each readability score to identify which corpora were “easier” according to the formulas. To their surprise and according to all four formulas, the corpus of fiction texts appears to be the easiest to read, with health-related documents at the same readability level as Simple Wikipedia articles. In another experiment, they study the correlation of each pair of formulas in each corpus; their results indicate almost perfect correlation, indicating the formulas could be interchangeable. Their last experiment, which studies the correlation between the Flesch-Kincaid formula and the different linguistically motivated features, indicates that although most features are strongly correlated with the readability formula, the strength of the correlation varies from corpus to corpus. The authors suggest that because of the correlation of the readability formula with linguistic indicators of reading difficulty, the Flesch score could be used to assess the difficulty level of texts for their target audience.
2.6 SENTENCE-LEVEL READABILITY ASSESSMENT
Most readability studies consider the text as the unit for assessment (although Collins-Thompson et al. [2011] present a study also for text snippets and search queries); however, some authors have recently become interested in assessing readability of short units such as sentences. Dell’Orletta et al. [2014a,b], in addition to presenting a readability study for Italian where they test the value of different features for classification of texts into easy or difficult, also address the problem of classifying sentences as easy-to-read or difficult-to-read. The problem they face is the unavailability of annotated corpora for the task, so they rely on documents from two different providers: easy-to-read documents are sampled from the easy-to-read newspaper Due Parole5 while the difficult-to-read documents are sampled from the newspaper La Repubblica.6 Features for document classification included in their study are: raw text features such as sentence-length and word-length averages, lexical features such as type/token ratio (i.e., lexical variety) and percentage of words on different Italian word reference lists, etc., morpho-syntactic features such as probability distributions of POS tags in the text, ratio of the number of content words (nouns, verbs, adjectives, adverbs) to number of words in the text, etc., and syntactic features such as average depth of syntactic parse trees, etc. For sentence readability classification (easy-to-read vs. difficult-to-read), they prepared four different datasets based on the document classification task. Sentences from Due Parole are considered easy-to-read; however, assuming that all sentences from La Reppublica are difficult would in principle be an incorrect assumption. Therefore, they create four different sentence classification datasets for training models and assess the need for manually annotated data: the first set (s1) is a balanced dataset of easy-to-read and difficult-to-read sentences (1310 sentences of each class); the second dataset (s2) is an un-balanced dataset of easy-to-read (3910 sentences) and assumed difficult-to-read sentences (8452), the third dataset (s3) is a balanced dataset with easy-to-read (3910 sentences) and assumed difficult-to-read sentences (3910); and, finally, the fourth dataset (s4) also contains easy-to-read sentences (1310) and assumed difficult-to-read sentences (1310). They perform classification experiments with maximum entropy models to discriminate between easy-to-read and difficult-to-read sentences, using held-out manually annotated data. They noted that although using the gold-standard dataset (s1) provides the best results in terms of accuracy, using a balanced dataset of “assumed” difficult-to-read sentences (i.e., s3) for training is close behind, suggesting that one should trade off the efforts of manually filtering out difficult-sentences to create a dataset. TThey additionally study feature contribution to sentence readability and document readability, noting that local features based on syntax are more relevant for sentence classification while global features such as average sentence and word lengths or token/type ratio are more important for document readability assessment.
Vajjala and Meurers [2014] investigate the issue of readability assessment for English, also focusing on the readability of sentences. Their approach is based on training two different regression algorithms on WeeBit, a corpus of 625 graded documents for age groups 7 to 16 years that they have specifically СКАЧАТЬ