Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 10

СКАЧАТЬ words are assumed to follow default rules. The English morphological analyzer in GATE is rule-based, with the rule language (flex) supporting rules and variables that can be used in regular expressions in the rules. POS tags can be taken into account if desired, depending on a configuration parameter. The analyzer takes as input a tokenized document, and considering one token and its POS tag at a time, it identifies its lemma and affix. These values are than added as features of the token.

      The Stanford Morphology tool also uses a rule-based approach, is based on a finite-state transducer, and is written in flex. Unlike the GATE tool, however, it requires the use of POS tags as well as tokens, and generates lemmas but not affixes.

      NLTK provides an implementation of morphological analysis based on WordNet’s built-in morphy function. WordNet [16] is a large lexical database of English resembling a thesaurus, where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The morphy function is designed to allow users to query an inflectional form against a base form listed in WordNet. It uses a rule-based method involving lists of inflectional endings, based on syntactic category, and an exception list for each syntactic category, in which a search for an inflected form is done. Like the Stanford tool, it returns only the lemma but not the affix. Furthermore, it can only handle words present in WordNet.

      OpenNLP does not currently provide any tools for morphological analysis.

      Stemmers produce the stem form of each word, e.g., driving and drivers have the stem drive, whereas morphological analysis tends to produce the root/lemma forms of the words and their affixes, e.g., drive and driver for the above examples, with affixes -ing and -s respectively. There is much confusion about the difference between stemming and morphological analysis, due to the fact that stemmers can vary considerably in how they operate and in their output. In general, stemmers do not attempt to perform an analysis of the root or stem and its affix, but simply strip the word down to its stem. The main way in which stemmers themselves vary is due to the presence or absence of the constraint that the stem must also be a real word in the given language. Basic stemming algorithms simply strip off the affix, e.g., driving would be stripped to the stem driv-by removing the suffix -ing. The distinction between verbs and nouns is often not maintained, so both driver and driving would be stripped down to the stem driv-. Information retrieval (IR) systems often make use of this kind of suffix stripping, since it can be performed by a simple algorithm and does not require other linguistic pre-processing such as POS tagging. Stemming is useful for IR systems because it brings together lexico-syntactic variants of a word which have a common meaning (so one can use either the singular or plural form of a word in the search query, and it will match against either form in a web page). Note that unlike most morphological analysis tools, stemming tools may also consider variants arising from derivational morphology, since they ignore the syntactic category of the word. A further difference is that typically, stemmers do not refer to the context surrounding the word, but only to the word in isolation, while morphological analyzers may also use the context.

      Figure 2.5 shows an example of how stemming and morphological analysis may differ. The stemmer in GATE strips off the derivational affix -ness, reducing the noun loudness to the base adjective loud, as shown by the stem feature. The morphological analyzer, on the other hand, is not concerned with derivational morphology, and leaves the word in its entirety, as shown by the root feature loudness and producing a zero affix.

      Figure 2.5: Comparison of stemming and morphological analysis in GATE.

      Suffix-stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language).

      The most well-known stemming algorithm is the Porter Stemmer [17], which has been re-implemented in many forms. Due to the problems of many different variants being created, Porter later invented the Snowball language, which is a small string processing language designed specifically for creating stemming algorithms for use in Information Retrieval. A variety of useful open-source stemmers for many languages have since been created in Snowball. GATE provides a wrapper for a number of these, covering 11 European languages, while NLTK provides an implementation of them for Python. Because the stemmers are rule-based and easy to modify, following Porter’s original approach, this makes them very straightforward to combine with the other low-level linguistic components described previously in this chapter. OpenNLP and Stanford CoreNLP do not provide any stemmers.

      Syntactic parsing is concerned with analysing sentences to derive their syntactic structure according to a grammar. Essentially, parsing explains how different elements in a sentence are related to each other, such as how the subject and object of a verb are connected. There are many different syntactic theories in computational linguistics, which posit different kinds of syntactic structures. Parsing tools may therefore vary widely not only in performance but in the kind of representation they generate, based on the syntactic theory they make use of.

      Freely available wide-coverage parsers include the Minipar7 dependency parser, the RASP [18] statistical parser, the Stanford [19] statistical parser, and the general-purpose SUPPLE parser [20]. These are all available within GATE, so that the user can try them all and decide which is the most appropriate for their needs.

      Minipar is a dependency parser, i.e., it determines the dependency relationships between the words in a sentence. It processes the text one sentence at a time, and thus only needs a sentence splitter as a prerequisite. It works on the basis of identifying linguistic constructions and parts-of-speech like apposition, relative clauses, subjects and objects of verbs, and determiners, and how they relate to each other. Apposition is the construction where two noun phrases next to each other refer to the same thing, e.g., “my brother John,” or “Paris, the capital of France.” Relative clauses typically start with a relative pronoun (such as “who,” “which,” etc.) and modify a preceding noun, e.g., “who was wearing the hat” in the phrase “the man who was wearing the hat.”

      In contrast to dependency relations, constituency parsers are based on the idea of constituency relations, and may involve a number of different Constituency Grammar theories such as Phrase-Structure Grammars, Categorial Grammars and Lexical Functional Grammars, amongst others. The constituency relation is hierarchical and derives from the subject-predicate division of Latin and Greek grammars, where the basic clause structure is divided into the subject (noun phrase) and predicate (verb phrase). Further subdivisions of each are then made at a more finegrained level.

      A good example of a constituency parser is the Shift-Reduce Constituency Parser which is part of the Stanford CoreNLP Tools.8 Shift-and-reduce operations have long been used for dependency parsing with high speed and accuracy, but only more recently have they been used for constituency parsing. The Shift-Reduce parser aims to improve on older constituency parsers which used chart-based algorithms (dynamic programming) to find the highest scoring parse, which were accurate but very slow. The latest Shift-Reduce Constituency parser is faster than the previous Stanford parsers, while being more accurate than almost all of them.

      СКАЧАТЬ