Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 11

СКАЧАТЬ target="_blank" rel="nofollow" href="#ulink_50e315d7-d5d9-5f74-8718-5c42ab1f8663">Figure 2.6 shows a parse tree generated using a dependency grammar, while Figure 2.7 shows one generated using a constituency grammar for the same sentence.

      Figure 2.6: Parse tree showing dependency relation.

      Figure 2.7: Parse tree showing constituency relation.

      The RASP statistical parser [18] is a domain-independent, robust parser for English. It comes with its own tokenizer, POS tagger, and morphological analyzer included, and as with Minipar, requires the text to be already segmented into sentences. RASP is available under the LGPL license and can therefore be used also in commercial applications.

      The Stanford statistical parser [19] is a probabilistic parsing system. It provides either a dependency output or a phrase structure output. The latter can be viewed in its own GUI or through the user interface of GATE Developer. The Stanford parser comes with data files for parsing Arabic, Chinese, English, and German and is licensed under GNU GPL.

      The SUPPLE parser is a bottom-up parser that can produce a semantic representation of sentences, called simplified quasilogical form (SQLF). It has the advantage of being very robust, since it can still produce partial syntactic and semantic results for fragments even when the full sentence parses cannot be determined. This makes it particularly applicable for deriving semantic features for the machine learning–based extraction of semantic relations from large volumes of real text.

      Parsing algorithms can be computationally expensive and, like many linguistic processing tools, tend to work best on text similar to that on which they have been trained. Because it is a much more difficult task than some of the lower-level processing tasks, such as tokenization and sentence splitting, performance is also typically much lower, and this can have knock-on effects on any subsequent processing modules such as Named Entity recognition and relation finding. Sometimes it is better therefore to sacrifice the increased knowledge provided by a parser for something more lightweight but reliable, such as a chunker which performs a more shallow kind of analysis. Chunkers, also sometimes called shallow parsers, recognize sequences of syntactically correlated words such as Noun Phrases, but unlike parsers, do not provide details of their internal structure or their role in the sentence.

      Tools for chunking can be subdivided into Noun Phrase (NP) Chunkers and Verb Phrase (VP) Chunkers. They vary less than parsing algorithms because the analysis is at a more coarsegrained level—they perform identification of the relevant “chunks” of text but do not try to analyze it. However, they may differ in what they consider to be relevant for the chunk in question. For example, a simple Noun Phrase might consist of a consecutive string containing an optional determiner, one or more optional adjectives, and one or more nouns, as shown in Figure 2.8. A more complex Noun Phrase might also include a Prepositional Phrase or Relative Clause modifying it. Some chunkers include such things as part of the Noun Phrase, as shown in Figure 2.9, while others do not (Figure 2.10). This kind of decision is highly dependent on what the chunks will be used for later. For example, if they are used as input for a term recognition tool, it should be considered whether the possibility of a term that contains a Prepositional Phrase is relevant or not. For ontology generation, such a term is probably not required, but for use as a target for sentiment analysis, it might be useful.

image

      Figure 2.9: Complex NP chunking excluding PPs.

image

      Figure 2.10: Complex NP chunking including PPs.

      Verb Phrase chunkers delimit verbs, which may consist of a single word such as bought or a more complex group comprising modals, infinitives and so on (for example might have bought or to buy). They may even include negative elements such as might not have bought or didn’t buy. An example of chunker output combining both noun and verb phrase chunking is shown in Figure 2.11.

      Figure 2.11: Complex VP chunking.

      Some tools also provide additional chunks; for example, the TreeTagger [21] (trained on the Penn Treebank) can also generate chunks for prepositional phrases, adjectival phrases, adverbial phrases, and so on. These can be useful for building up a representation of the whole sentence without the requirement for full parsing.

      As we have already seen, linguistic processing tools are not infallible, even assuming that the components they rely on have generated perfect output. It may seem simple to create an NP chunker based on grammatical rules involving POS tags, but it can easily go wrong. Consider the two sentences I gave the man food and I bought the baby food. In the first case, the man and food are independent NPs which are respectively the indirect and direct objects of the verb gave. We can rephrase this sentence as I gave food to the man without any change in meaning, where it is clear these NPs are independent. In the second example, however, the baby food could be either a single NP which contains the compound noun baby food, or follow the same structure as the previous example (I bought food for the baby). An NP chunker which used the seemingly sensible pattern “Determiner + Noun + Noun” would not be able to distinguish between these two cases. In this case, a learning-based model might do better than a rule-based approach.

      GATE provides both NP and VP chunker implementations. The NP Chunker is a Java implementation of the Ramshaw and Marcus BaseNP chunker [22], which is based on their POS tags and uses transformation-based learning. The output from this version is identical to the output of the original C++/Perl version.

      The GATE VP chunker is written in JAPE, GATE’s rule-writing language, and is based on grammar rules for English [23, 24]. It contains rules for the identification of non-recursive verb groups, covering finite (is investigating), non-finite (to investigate), participles (investigated), and special verb constructs (is going to investigate). All the forms may include adverbials and negatives. One advantage of this tool is that it explicitly marks negation in verbs (e.g., don’t, which is extremely useful for other tasks such as sentiment analysis. The rules make use of POS tags as well as some specific strings (e.g., the word might is used to identify modals).

      OpenNLP’s chunker uses a pre-packaged English maximum entropy model. Unlike GATE, whose two chunkers are independent, it analyses the text one sentence at a time and produces both NP and VP chunks in one go, based on their POS tags. The OpenNLP chunker is easily retrainable, making it easy to adapt to new domains and text types if one has a suitable pre-annotated СКАЧАТЬ