Automatic Text Simplification. Horacio Saggion
Чтение книги онлайн.

Читать онлайн книгу Automatic Text Simplification - Horacio Saggion страница 9

Название: Automatic Text Simplification

Автор: Horacio Saggion

Издательство: Ingram

Жанр: Программы

Серия: Synthesis Lectures on Human Language Technologies

isbn: 9781681731865

isbn:

СКАЧАТЬ could select material based on its reading level assessment; however, this is limited in that the profile or expertise of the reader (i.e., search behavior) is not taken into consideration when presenting the results. Collins-Thompson et al. [2011] introduced a tripartite approach to personalization of search results by reading level (appropriate documents for the user’s readability level should be ranked higher) which takes advantage of user profiles (to assess their readability level), document difficulty, and a re-ranking strategy so that documents more appropriate for the reader would move to the top of the search result list. They use a language-model readability assessment method which leverages word difficulty computed from a web corpus in which pages have been assigned grade levels by their authors [Collins-Thompson and Callan, 2004]. The method departs from traditional readability formulas in that it is based on a probabilistic estimation that models individual word complexity as a distribution across grade levels. Text readability is then based on distribution of those words occurring in the document. The authors argue that traditional formulas which consider morphological word complexity and sentence complexity (e.g., length) features and that sometimes require word-passages of certain sizes (i.e., at least 100 words) to yield an accurate readability estimate appear inappropriate in a web context where sentence boundaries are sometimes nonexistent and pages can have very little textual content (e.g., images and captions). To estimate the reading proficiency of users and also to train some of the model parameters and evaluate their approach, they rely on the availability of proprietary data on user-interaction behaviors with a web search engine (containing queries, search results, and relevance assessment). With this dataset at hand, the authors can compute a distribution of the probability that a reader likes the readability level of a given web page from web pages that the user visited and read. A re-ranking algorithm, LambdaMART [Wu et al., 2010], is then used to improve the search results and bring results more appropriate to the user to the top of the search result list. The algorithm is trained using reading level for pages and snippets (i.e., search results summaries), user reading level, query characteristics (e.g., length), reading level interactions (e.g., snippet-page, query-page), and confidence values for many of the computed features. Re-ranking experiments across a variety of query-types indicate that search results improve at least one rank for all queries (i.e., the appropriate URL was ranked higher than with the default search engine ranking algorithm). Related to work on web documents readability is the question of how different ways in which web pages are parsed (i.e., extracting the text of the document and identifying sentence boundaries) influence the outcome of traditional readability measures. Palotti et al. [2015] study different tools for extracting and sentence-splitting textual content from pages and different traditional readability formulas. They found that web search results ranking varies considerably depending on different readability formulas and text processing methods used and also that some text processing methods would produce document rankings with marginal correlation when a given formula is used.

      Given the proliferation of readability formulas, one may wonder how they differ and which one should be used for assessing the difficulty of a given text. Štajner et al. [2012] study the correlation of a number of classic readability formulas and linguistically motivated features using different corpora to identify which formula or linguistic characteristics may be used to select appropriate text for people with an autism-spectrum disorder.

      The corpora included in the study were: 170 texts from Simple Wikipedia, 171 texts from a collection of news texts from the METER corpus, 91 texts from the health section of the British National Corpus, and 120 fiction texts from the FLOB corpus.4 The readability formulas studied were the Flesch Reading Ease score, the Flesch-Kincaid grade level, the SMOG grading, and FOG index. According to the authors, the linguistically motivated features were designed to detect possible “linguistic obstacles” that a text may have to hinder readability. They include features of structural complexity such as the average number of major POS tags per sentence, average number of infinitive markers, coordinating and subordinating conjunctions, and prepositions. Features indicative of ambiguity include the average number of sentences per word, average number of pronouns and definite descriptions per sentence. The authors first computed over each corpus averages of each readability score to identify which corpora were “easier” according to the formulas. To their surprise and according to all four formulas, the corpus of fiction texts appears to be the easiest to read, with health-related documents at the same readability level as Simple Wikipedia articles. In another experiment, they study the correlation of each pair of formulas in each corpus; their results indicate almost perfect correlation, indicating the formulas could be interchangeable. Their last experiment, which studies the correlation between the Flesch-Kincaid formula and the different linguistically motivated features, indicates that although most features are strongly correlated with the readability formula, the strength of the correlation varies from corpus to corpus. The authors suggest that because of the correlation of the readability formula with linguistic indicators of reading difficulty, the Flesch score could be used to assess the difficulty level of texts for their target audience.

      Most readability studies consider the text as the unit for assessment (although Collins-Thompson et al. [2011] present a study also for text snippets and search queries); however, some authors have recently become interested in assessing readability of short units such as sentences. Dell’Orletta et al. [2014a,b], in addition to presenting a readability study for Italian where they test the value of different features for classification of texts into easy or difficult, also address the problem of classifying sentences as easy-to-read or difficult-to-read. The problem they face is the unavailability of annotated corpora for the task, so they rely on documents from two different providers: easy-to-read documents are sampled from the easy-to-read newspaper Due Parole5 while the difficult-to-read documents are sampled from the newspaper La Repubblica.6 Features for document classification included in their study are: raw text features such as sentence-length and word-length averages, lexical features such as type/token ratio (i.e., lexical variety) and percentage of words on different Italian word reference lists, etc., morpho-syntactic features such as probability distributions of POS tags in the text, ratio of the number of content words (nouns, verbs, adjectives, adverbs) to number of words in the text, etc., and syntactic features such as average depth of syntactic parse trees, etc. For sentence readability classification (easy-to-read vs. difficult-to-read), they prepared four different datasets based on the document classification task. Sentences from Due Parole are considered easy-to-read; however, assuming that all sentences from La Reppublica are difficult would in principle be an incorrect assumption. Therefore, they create four different sentence classification datasets for training models and assess the need for manually annotated data: the first set (s1) is a balanced dataset of easy-to-read and difficult-to-read sentences (1310 sentences of each class); the second dataset (s2) is an un-balanced dataset of easy-to-read (3910 sentences) and assumed difficult-to-read sentences (8452), the third dataset (s3) is a balanced dataset with easy-to-read (3910 sentences) and assumed difficult-to-read sentences (3910); and, finally, the fourth dataset (s4) also contains easy-to-read sentences (1310) and assumed difficult-to-read sentences (1310). They perform classification experiments with maximum entropy models to discriminate between easy-to-read and difficult-to-read sentences, using held-out manually annotated data. They noted that although using the gold-standard dataset (s1) provides the best results in terms of accuracy, using a balanced dataset of “assumed” difficult-to-read sentences (i.e., s3) for training is close behind, suggesting that one should trade off the efforts of manually filtering out difficult-sentences to create a dataset. TThey additionally study feature contribution to sentence readability and document readability, noting that local features based on syntax are more relevant for sentence classification while global features such as average sentence and word lengths or token/type ratio are more important for document readability assessment.

      Vajjala and Meurers [2014] investigate the issue of readability assessment for English, also focusing on the readability of sentences. Their approach is based on training two different regression algorithms on WeeBit, a corpus of 625 graded documents for age groups 7 to 16 years that they have specifically СКАЧАТЬ