Название: Natural Language Processing for the Semantic Web
Автор: Diana Maynard
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on the Semantic Web: Theory and Technology
isbn: 9781627056328
isbn:
OpenNLP1 is an open-source machine learning-based toolkit for language processing, which uses maximum entropy and Perceptron-based classifiers. It is freely available under the Apache license. Like Stanford CoreNLP, it can be run on the command line via a simple Java API. While, like most other pipelines, the various components further down the pipeline mainly rely on tokens and sentences, the sentence splitter can be run either before or after the tokenizer, which is slightly unusual.
NLTK [6] is an open-source Python-based toolkit, available under the Apache license, which is also very popular due to its simplicity and command-line interface, particularly where Java-based tools are not a requirement. It provides a number of different variants for some components, both rule-based and learning-based.
In the rest of this chapter, we will describe the individual pipeline components in more detail, using the relevant tools from these pipelines as examples.
2.4 TOKENIZATION
Tokenization is the task of splitting the input text into very simple units, called tokens, which generally correspond to words, numbers, and symbols, and are typically separated by white space in English. Tokenization is a required step in almost any linguistic processing application, since more complex algorithms such as part-of-speech taggers mostly require tokens as their input, rather than using the raw text. Consequently, it is important to use a high-quality tokenizer, as errors are likely to affect the results of all subsequent NLP components in the pipeline. Commonly distinguished types of tokens are numbers, symbols (e.g., $, %), punctuation, and words of different kinds, e.g., uppercase, lowercase, mixed case. A representation of a tokenized sentence is shown in Figure 2.2, where each pink rectangle corresponds to a token.
Figure 2.2: Representation of a tokenized sentence.
Tokenizers may add a number of features describing the token. These include details of orthography (e.g., whether they are capitalized or not), and more information about the kind of token (whether it is a word, number, punctuation, etc.). Other components may also add features to the existing token annotations, such as their syntactic category, details of their morphology, and any cleaning or normalization (such as correcting a mis-spelled word). These will be described in subsequent sections and chapters. Figure 2.3 shows a token for the word offences in the previous example with some features added: the kind of token is a word, it is 8 characters long, and the orthography is lowercase.
Tokenizing well-written text is generally reliable and reusable, since it tends to be domain-independent. However, such general purpose tokenizers typically need to be adapted to work correctly with things like chemical formulae, twitter messages, and other more specific text types. Other non-standard cases are hyphenated words in English, which some tools treat as a single token and some tools treat as three (the two words, plus the hyphen itself). Some systems also perform a more complex tokenization that takes into account number combinations such as dates and times (for example, treating 07:56 as a single token). Other tools leave this to later processing stages, such as a Named Entity Recognition component. Another issue is the apostrophe: for example in cases where an apostrophe is used to denote a missing letter and effectively joins two words without a space between, such as it’s, or in French l’homme. German compound nouns suffer the opposite problem, since many words can be written together without a space. For German tokenizers, an extra module which splits compounds into their constituent parts can therefore be very useful, in particular for retrieval purposes. This extra segmentation module is critical to define word boundaries also for many East Asian languages such as Chinese, which have no notion of white space between words.
Figure 2.3: Representation of a tokenized sentence.
Because tokenization generally follows a rigid set of constraints about what constitutes a token, pattern-based rule matching approaches are frequently used for these tools, although some tools do use other approaches. The OpenNLP TokenizerME,2 for example, is a trainable maximum entropy tokenizer. It uses a statistical model, based on a training corpus, and can be re-trained on a new corpus.
GATE’s ANNIE Tokenizer3 relies on a set of regular expression rules which are then compiled into a finite-state machine. It differs somewhat from most other tokenizers in that it maximizes efficiency by doing only very light processing, and enabling greater flexibility by placing the burden of deeper processing on other components later in the pipeline, which are more adaptable. The generic version of the ANNIE tokenizer is based on Unicode4 and can be used for any language which has similar notions of token and white space to English (i.e., most Western languages). The tokenizer can be adapted for different languages either by modifying the existing rules, or by adding some extra post-processing rules. For English, a specialized set of rules is available, dealing mainly with use of apostrophes in words such as don’t.
The PTBTokenizer5 is an efficient, fast, and deterministic tokenizer, which forms part of the suite of Stanford CoreNLP tools. It was initially designed to largely mimic Penn Treebank 3 (PTB) tokenization, hence its name. Like the ANNIE Tokenizer, it works well for English and other Western languages, but works best on formal text. While deterministic, it uses some quite good heuristics, so as with ANNIE, it can usually decide when single quotes are parts of words, when full stops imply sentence boundaries, and so on. It is also quite customizable, in that there are a number of options that can be tweaked.
NLTK6 also has several similar tokenizers to ANNIE, one based on regular expressions, written in Python.
2.5 SENTENCE SPLITTING
Sentence detection (or sentence splitting) is the task of separating text into its constituent sentences. This typically involves determining whether punctuation, such as full stops, commas, exclamation marks, and question marks, denote the end of a sentence or something else (quoted speech, abbreviations, etc.). Most sentence splitters use lists of abbreviations to help determine this: a full stop typically denotes the end of a sentence unless it follows an abbreviation such as Mr., or lies within quotation marks. Other issues involve determining sentence structure when line breaks are used, such as in addresses or in bulleted lists. Sentence splitters vary in how such things are handled.
More complex cases arise when the text contains tables, titles, formulae, or other formatting markup: these are usually the biggest source of error. Some splitters ignore these completely, requiring a punctuation mark as a sentence boundary. Others use two consecutive new lines or carriage returns as an indication of a sentence end, while there are also cases when even a single newline or carriage return character would indicate end of a sentence (e.g., comments in software code or bulleted/numbered lists which have one entry per line). GATE’s ANNIE sentence splitter actually provides several variants in order to let the user decide which is the most appropriate solution for their particular text. HTML formatting tags, Twitter hashtags, wiki syntax, and other such special text types are also somewhat problematic for general-purpose sentence splitters which have been trained on well-written corpora, typically newspaper texts. Note that sometimes tokenization and sentence splitting are performed СКАЧАТЬ