Название: Automatic Text Simplification
Автор: Horacio Saggion
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781681731865
isbn:
2.3.3 DISCOURSE, SEMANTICS, AND COHESION IN ASSESSING READABILITY
Feng et al. [2009] are specially interested in readability for individuals with mild-level intellectual disabilities (MID) (e.g., intelligence quotient (IQ) in the 55–70 range) and how to select appropriate reading material for this population. The authors note that people with MID are different from adults with low literacy in that the former have problems with working memory and with discourse representation, thereby complicating the processes of recalling information and inference as they read a text. The authors argue that appropriate readability assessment tools which take into account the specific issues of these users should therefore be designed. Their main research hypothesis being that the number of entity mentions in a text should be related to readability issues for people with MID, they design a series of features accounting for entity density. Where data for studying this specific population is concerned, they have created a small (20 documents in original and simplified versions) but rather unique ID dataset for testing their readability prediction model. The dataset is composed of news documents with aggregated readability scores based on the number of correct answers to multiple choice questions that 14 MID individuals had given after reading the texts. In order to train a model, they rely on the availability of paired and generic graded corpora. The paired dataset (not graded) is composed of original articles from Encyclopedia Britannica written for adults and their adapted versions for children and CNN news stories from the LiteracyNet organization available in original and abridged (or simplified) versions. The graded dataset is composed of articles for students in grades 2–5. Where the model’s features are concerned, although many features studied were already available (or similar) in previous work, novel features take into account the number and the density of entity mentions (i.e., nouns and named entities), the number of lexical chains in the text, average lexical chain length, etc. These features are assessed on the paired datasets so as to identify their discriminative power, leaving all but two features outside the model. Three rich readability prediction models (corresponding to basic, cognitively motivated, and union of all features) are then trained on the graded dataset (80% of the dataset) using a linear regression algorithm (unlike the above approach). Evaluation is carried out on 20% of the dataset, showing considerable error reduction (difference between predicted and gold grade) of the models when compared with a baseline readability formula (the Flesch-Kincaid index [Kincaid et al.]). The final user-specific evaluation is conducted on the ID corpus where the model is evaluated by computing the correlation between system output and human readability scores associated with texts.
Feng et al. [2010] extended the previous work by incorporating additional features (e.g., language model features and out-of-vocabulary features from Schwarm and Ostendorf [2005] and entity coreference and coherence-based features based on those of Barzilay and Lapata [2008] and Pitler and Nenkova [2008]), assessing performance of each group of features, and comparing their model to state-of-the-art competing approaches (i.e., mainly replicating the models of Schwarm and Ostendorf [2005]). Experimental results using SVMs and logistic regression classifiers show that although accuracy is still limited (around 74% with SVMs and selected features) important gains are obtained from the use of more sophisticated linguistically motivated features.
Heilman et al. [2007] are interested in the effect of pedagogically motivated features in the development of readability assessment tools, especially in the case of texts for second language (L2) learners. More specifically, they suggest that since L2 learners acquire lexicon and grammar of the target language from exposure to material specifically chosen for the acquisition process, both lexicon and grammar should play a role in assessing the reading difficulty of the L2 learning material. In terms of lexicon, a unigram language model is proposed for each grade level so as to assess the likelihood of a given text to a given grade (see Section 2.3.1 for a similar approach). Where syntactic information is concerned, two different sets of features are proposed: (i) a set of 22 grammatical constructions (e.g., passive voice, relative clause) identified in sentences after being parsed by the Stanford Parser [Klein and Manning, 2003], which produces syntactic constituent structures; and (ii) 12 grammatical features (e.g., sentence length, verb tenses, part of speech tags) which can be identified without the need of a syntactic parser. All feature values are numerical, indicating the number of times the particular feature occurred per word in the text (note that other works take averages on a per-sentence basis). Texts represented as vectors of features and values are used in a k-Nearest Neighbor (kNN) algorithm (see Mitchell [1997]) to predict the readability grade of unseen texts: a given text t is compared (using a similarity measure) to all available vectors and the k-closest texts retrieved, the grade level of t is then the most frequent grade among the k retrieved texts. While the lexical model above will produce, for each text and grade, a probability, the confidence of the kNN prediction can be computed as the proportion of the k texts with same class as text t. The probability of the language model together with the kNN confidence can be interpolated yielding a confidence score to obtain a joint grade prediction model. In order to evaluate different individual models and combinations, the authors use one dataset for L1 learners (a web corpus [Collins-Thompson and Callan, 2004]) and a second dataset for L2 learners (collected from several sources). Prediction performance is carried out using correlation and MSE, since the authors argue regression is a more appropriate way to see readability assessment. Overall, although the lexical model in isolation is superior to the two grammatical models (in both datasets), their combination shows significant advantages. Moreover, although the complex syntactic features have better predictive power than the simple syntactic features, their slight difference in performance may justify not using a parser.
Although these works are interesting because they consider a different user population, they still lack an analysis of the effect that different automatic tools have in readability assessment performance: since parsers, coreference resolution systems, and lexical chainers are imperfect, an important question to be asked is how changes in performance affect the model outcome.
Crossley et al. [2007] investigate three Coh-Metrix variables [Graesser et al., 2004] for assessing the readability of texts from the Bormuth corpus, a dataset where scores are given to texts based on aggregated answers from informants using cloze tests. The number of words per sentence as an estimate of syntactic complexity, argument overlap—the number of sentences sharing an argument (noun, pronouns, noun phrases)—, and word frequencies from the CELEX database [Celex, 1993] were used in a multiple regression analysis. Correlation between the variables used and the text scores was very high.
Flor and Klebanov [2014] carried out one of the few studies (see Feng et al. [2009]) to assess lexical cohesion [Halliday and Hasan, 1976] for text readability assessment. Since cohesion is related to the way in which elements in the text are tied together to allow text understanding, a more cohesive text may well be perceived as more readable than a less cohesive text. Flor and Klebanov define lexical tightness, a metric based on a normalized form of pointwise mutual information by Church and Hanks [1990] (NPMI) that measures the strength of associations between words in a given document based on co-occurrence statistics compiled from a large corpus. The lexical tightness of a text is the average of NPMIs values of all content words in the text. It is shown that lexical tightness correlates well with grade levels: simple texts tend to be more lexically cohesive than difficult ones.
2.4 READABILITY ON THE WEB
There is increasing interest in assessing document readability in the context of web search engines and in particular for personalization of web search results: search results that, in addition to matching the user’s query, are ranked according to their readability (e.g., from easier to more difficult). One approach is to display search results along with readability levels (Google Search offered in the past the possibility of filtering search results by reading level) СКАЧАТЬ