Название: Automatic Text Simplification
Автор: Horacio Saggion
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781681731865
isbn:
1
http://literacynet.org/
2
http://www.lecturafacil.net/
3
http://www.lattlast.se/
4
http://8sidor.lattlast.se
5
http://www.klartale.no
6
http://www.journal-essentiel.be/
7
http://www.wablieft.be
8
http://www.dr.dk/Nyheder/Ligetil/Presse/Artikler/om.htm
9
http://www.dueparole.it
10
http://papunet.net/selko
11
http://www.noticiasfacil.es
12
http://www.literacyworks.org/learningresources/
13
http://www.inclusion-europe.org
14
http://simple.wikipedia.org
15
http://www.easytoread-network.org/
16Bror Tronbacke, personal communication, December 2010.
CHAPTER 2
Readability and Text Simplification
A key question in text simplification research is the identification of the complexity of a given text so that a decision can be made on whether or not to simplify it. Identifying the complexity of a text or sentence can help assess whether the output produced by a text simplification system matches the reading ability of the target reader. It can also be used to compare different systems in terms of complexity or simplicity of the produced output. There are a number of very complete surveys on the relevant topic of text readability which can be understood as “what makes some texts easier to read than others” [Benjamin, 2012, Collins-Thompson, 2014, DuBay, 2004]. Text readability, which has been investigated for a long time in academic circles, is very close to the “to simplify or not to simplify” question in automatic text simplification. Text readability research has often attempted to devise mechanical methods to assess the reading difficulty of a text so that it can be objectively measured. Classical mechanical text readability formulas combine a number of proxies to obtain a numerical score indicative of the difficulty of a text. These scores could be used to place the texts in an appropriate grade level or used to sort text by difficulty.
2.1 INTRODUCTION
Collins-Thompson [2014]—citing [Dale and Chall, 1948b]—defines text readability as the sum of all elements in textual material that affect a reader’s understanding, reading speed, and level of interest in the material. The ability to quantify the readability of a text has long been a topic of research, but current technology and the availability of massive amounts of text in electronic form has changed research in computational readability assessment, considerably. Today’s algorithms take advantage of advances in natural language processing, cognition, education, psycholinguistics, and linguistics (“all elements in textual material”) to model a text in such a way that a machine learning algorithm can be trained to compute readability scores for texts. Traditional readability measures were based on semantic familiarity of words and the syntactic complexity of sentences. Proxies to measure such elements are, for example, the number of syllables of words or the average number of words per sentence. Most traditional approaches used averages over the set of basic elements (words or sentences) in the text, disregarding order and therefore discourse phenomena. The obvious limitations of early approaches were always clear: words with many syllables are not necessarily complex (e.g., children are probably able to read or understand complex dinosaur names or names of Star Wars characters before more-common words are acquired) and short sentences are not necessarily easy to understand (poetry verses for example). Also, traditional formulas were usually designed for texts that were well formatted (not web data) and relatively long. Most methods are usually dependent on the availability of graded corpora where documents are annotated with grade levels. The grades can be either categorical or ordinal, therefore giving rise to either classification or regression algorithmic approaches. When classification is applied, precision, recall, f-score, and accuracy can be used to measure classification performance and compare different approaches. When regression is applied, Root Mean Squared Error (RMSE) or a correlation coefficient can be used to evaluate the algorithmic performance. In the case of regression, assigning a grade of 4 to a 5th-grade text (1 point difference) is not as serious a mistake as it would be to assign a grade 7 to a 5th-grade text (2 points difference). Collins-Thompson [2014] presents an overview of groups of features which have been accounted for in the readability literature including:
• lexico-semantic (vocabulary) features: relative word frequencies, type/token ratio, probabilistic language model measures such as text probability, perplexity, etc., and word maturity measures;
• psycholinguistic features: word age-of-acquisition, word concreteness, polysemy, etc.;
• syntactic features (designed to model sentence processing time): sentence length, parse tree height, etc.;
• discourse features (designed to model text’s cohesion and coherence): coreference chains, named entities, lexical tightness, etc.; and
• semantic and pragmatic features: use of idioms, cultural references, text type (opinion, satire, etc.), etc.
Collins-Thompson argues that in readability assessment it seems the model used—the features—is more important than the machine learning approach chosen. That is, a well-designed set of features can go a long way in readability assessment.
2.2 READABILITY FORMULAS
DuBay [2004] points out that over 200 readability formulas existed by the 1980s. Many of them have been empirically tested to assess their predictive power usually by correlating their outputs with grade levels associated with text sets.
Two of the most widely used readability formulas are the Flesch Reading Ease Score [Flesch, 1949] and the Flesch-Kincaid readability formula [Kincaid et al.]. The Flesch Reading Ease Score uses two text characteristics as proxies: the average sentence length ASL and the average number of syllables per word ASW which are combined in Formula (2.1):
On a given text the score will produce a value between 1 and 100 where the higher the value the easier the text would be. Documents scoring 30 are very difficult to read while those scoring 70 should be easy to read.
The Flesch-Kincaid readability formula (2.2) simplifies the Flesch score to produce a “grade level” which is easily interpretable (i.e., a text with a grade level of eight according to the formula could be thought appropriate for an eighth grader).
СКАЧАТЬ