Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 9

СКАЧАТЬ task rather than sequentially.

      Sentence splitters generally make use of already tokenized text. GATE’s ANNIE sentence splitter uses a rule-based approach based on GATE’s JAPE pattern-action rule-writing language [7]. The rules are based entirely on information produced by the tokenizer and some lists of common abbreviations, and can easily be modified as necessary. Several variants are provided, as mentioned above.

      Unlike ANNIE, the OpenNLP sentence splitter is typically run before the tokenization module. It uses a machine learning approach, with the models supplied being trained on untokenized text, although it is also possible to perform tokenization first and let the sentence splitter process the already tokenized text. One flaw in the OpenNLP splitter is that because it cannot identify sentence boundaries based on the contents of the sentence, it may cause errors on articles which have titles since these are mistakenly identified to be part of the first sentence.

      NLTK uses the Punkt sentence segmenter [8]. This uses a language-independent, unsupervised approach to sentence boundary detection, based on identifying abbreviations, initials, and ordinal numbers. Its abbreviation detection, unlike most sentence splitters, does not rely on precompiled lists, but is instead based on methods for collocation detection such as log-likelihood.

      Stanford CoreNLP makes use of tokenized text and a set of binary decision trees to decide where sentence boundaries should go. As with the ANNIE sentence splitter, the main problem it tries to resolve is deciding whether a full stop denotes the end of a sentence or not.

      In some studies, the Stanford splitter scored the highest accuracy out of common sentence splitters, although performance will of course vary depending on the nature of the text. State-of-the-art sentence splitters such as the ones described score about 95–98% accuracy on well-formed text. As with most linguistic processing tools, each one has strengths and weaknesses which are often linked to specific features of the text; for example, some splitters may perform better on abbreviations but worse on quoted speech than others.

      Part-of-Speech (POS) tagging is concerned with tagging words with their part of speech, e.g., noun, verb, adjective. These basic linguistic categories are typically divided into quite fine-grained tags, distinguishing for instance between singular and plural nouns, and different tenses of verbs. For languages other than English, gender may also be included in the tag. The set of possible tags used is critical and varies between different tools, making interoperability between different systems tricky. One very commonly used tagset for English is the Penn Treebank (PTB) [9]; other popular sets include those derived from the Brown corpus [10] and the LOB (Lancaster-Oslo/Bergen) Corpus [11], respectively. Figure 2.4 shows an example of some POS-tagged text, using the PTB tagset.

      Figure 2.4: Representation of a POS-tagged sentence.

      The POS tag is determined by taking into account not just the word itself, but also the context in which it appears. This is because many words are ambiguous, and reference to a lexicon is insufficient to resolve this. For example, the word love could be a noun or verb depending on the context (I love fish vs. Love is all you need).

      Approaches to POS tagging typically use machine learning, because it is quite difficult to describe all the rules needed for determining the correct tag given a context (although rule-based methods have been used). Some of the most common and successful approaches use Hidden Markov models (HMMs) or maximum entropy. The Brill transformational rule-based tagger [12], which uses the PTB tagset, is one of the most well-known taggers, used in several major NLP toolkits. It uses a default lexicon and ruleset acquired from a large corpus of training data via machine learning. Similarly, the OpenNLP POS tagger also uses a model learned from a training corpus to predict the correct POS tag from the PTB tagset. It can be trained with either a Maximum Entropy or a Perceptron-based model. The Stanford POS tagger is also based on a Maximum Entropy approach [13] and makes use of the PTB tagset. The TNT (Trigrams’n’Tags) tagger [14] is a fast and efficient statistical tagger using an implementation of the Viterbi algorithm for second-order Markov models.

      In terms of major NLP toolkits, some (such as Stanford CoreNLP) have their own POS taggers, as described above, while others use existing implementations or variants on them. For example, NLTK has Python implementations of the Brill tagger, the Stanford tagger, and the TNT tagger. GATE’s ANNIE English POS tagger [15] is a modified version of the Brill tagger trained on a large corpus taken from the Wall Street Journal. It produces a POS tag as an annotation on each word or symbol. One of the big advantages of this tagger is that the lexicon can easily be modified manually by adding new words or changing the value or order of the possible tags associated with a word. It can also be retrained on a new corpus, although this requires a large pre-tagged corpus of text in the relevant domain/genre, which is not easy to find.

      The accuracy of these general-purpose, reusable taggers is typically excellent (97–98%) on texts similar to those on which the taggers have been trained (mostly news articles). However, the accuracy can fall sharply when presented with new domains, genres, or noisier data such as social media. This can have a serious knock-on effect on other processes further down the pipeline such as Named Entity recognition, ontology learning via lexico-syntactic patterns, relation and event extraction, and even opinion mining, which all need reliable POS tags in order to produce high-quality results.

      Morphological analysis essentially concerns the identification and classification of the linguistic units of a word, typically breaking the word down into its root form and an affix. For example, the verb walked comprises a root form walk and an affix -ed. In English, morphological analysis is typically applied to verbs and nouns, because these may appear in the text as variants created by inflectional morphology. Inflectional morphology refers to the different forms of words reflected by mood, tense, number, and so on, such as the past tense of a verb or the plural of a noun. Inflection in English is typically expressed by adding a suffix to the root form (e.g., walk, walked, box, boxes) or another internal modification such as a vowel change (e.g., run, ran, goose, geese). In other languages, prefixes (adding to the beginning of a word), infixes (adding in the middle of a word), and other changes may be used. Some morphological analysis tools represent these internal modifications as an alternative representation of the default affix. What we mean by this is that if the plural of a noun is commonly represented by adding -s as a suffix, the output of the tool will show the value of the affix as -s even in the case of plural forms such as geese. Essentially, it treats an irregular vowel change form simply as a kind of surface representational variant of the standard affix. The GATE morphological analyzer, for example, depicts the word geese as having the root goose and affix -s.

      Typically, NLP tools which perform morphological analysis deal only with inflectional morphology, as described above, but do not handle derivational morphology. Derivation is the process of adding derivational morphemes, which create a new word from existing words, usually involving a change in grammatical category (for example, creating the noun worker from the verb work, or the noun loudness from the adjective loud.

      Morphological analyzers for English are often rule-based, since the majority of inflectional variants follow grammatical rules and set patterns (for example, plural nouns are typically created by adding -s or -es to the end of the singular noun). Exceptions can also be handled quite СКАЧАТЬ