Название: Natural Language Processing for the Semantic Web
Автор: Diana Maynard
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on the Semantic Web: Theory and Technology
isbn: 9781627056328
isbn:
The concluding chapter summarizes the main concepts described in the book, and gives some discussion of the current state-of-the-art, major problems still to be overcome, and an outlook to the future.
CHAPTER 2
Linguistic Processing
2.1 INTRODUCTION
There are a number of low-level linguistic tasks which form the basis of more complex language processing algorithms. In this chapter, we first explain the main approaches used for NLP tasks, and the concept of an NLP processing pipeline, giving examples of some of the major open source toolkits. We then describe in more detail the various linguistic processing components that are typically used in such a pipeline, and explain the role and significance of this pre-processing for Semantic Web applications. For each component in the pipeline, we describe its function and show how it connects with and builds on the previous components. At each stage, we provide examples of tools and describe typical performance of them, along with some of the challenges and pitfalls associated with each component. Specific adaptations to these tools for non-standard text such as social media, and in particular Twitter, will be discussed in Chapter 8.
2.2 APPROACHES TO LINGUISTIC PROCESSING
There are two main kinds of approach to linguistic processing tasks: a knowledge-based approach and a learning approach, though the two may also be combined. There are advantages and disadvantages to each approach, summarized in Table 2.1.
Knowledge-based or rule-based approaches are largely the more traditional methods, and in many cases have been superseded by machine learning approaches now that processing vast quantities of data quickly and efficiently is less of a problem than in the past. Knowledge-based approaches are based on hand-written rules typically written by NLP specialists, and require knowledge of the grammar of the language and linguistic skills, as well as some human intuition. These approaches are most useful when the task can easily be defined by rules (for example: “a proper noun always starts with a capital letter”). Typically, exceptions to such rules can be easily encoded too. When the task cannot so easily be defined in this way (for example, on Twitter, people often do not use capital letters for proper nouns), then this method becomes more problematic. One big advantage of knowledge-based approaches is that it is quite easy to understand the results. When the system incorrectly identifies something, the developer can check the rules and find out why the error has occurred, and potentially then correct the rules or write additional rules to resolve the problem. Writing rules can, however, be quite time-consuming, and if specifications for the task change, the developer may have to rewrite many rules.
Machine learning approaches have become more popular recently with the advent of powerful machines, and because no domain expertise or linguistic knowledge is required. One can set up a supervised system very quickly if sufficient training data is available, and get reasonable results with very little effort. However, acquiring or creating sufficient training data is often extremely problematic and time-consuming, especially if it has to be done manually. This dependency on training data also means that adaptation to new types of text, domain, or language is likely to be expensive, as it requires a substantial amount of new training data. Human readable rules therefore typically tend to be easier to adapt to new languages and text types than those built from statistical models. The problem of sufficient training data can be handled by incorporating unsupervised or semi-supervised methods for machine learning: these will be discussed further in Chapters 3 and 4. However, these typically produce less accurate results than supervised learning.
Table 2.1: Summary of knowledge-based vs. machine learning approaches to NLP
Knowledge-Based | Machine Learning Systems |
Based on hand-coded rules | Use statistics or other machine learning |
Developed by NLP specialists | Developers do not need NLP expertise |
Make use of human intuition | Requires large amounts of training data |
Easy to understand results | Cause of errors is hard to understand |
Development could be very time consuming | Development is quick and easy |
Changes may require rewriting rules | Changes may require re-annotation |
2.3 NLP PIPELINES
An NLP pre-processing pipeline, as shown in Figure 2.1, typically consists of the following components:
• Tokenization.
• Sentence splitting.
• Part-of-speech tagging.
• Morphological analysis.
• Parsing and chunking.
Figure 2.1: A typical linguistic pre-processing pipeline.
The first task is typically tokenization, followed by sentence splitting, to chop the text into tokens (typically words, numbers, punctuation, and spaces) and sentences respectively. Part-of-speech (POS) tagging assigns a syntactic category to each token. When dealing with multilingual text such as tweets, an additional step of language identification may first be added before these take place, as discussed in Chapter 8. Morphological analysis is not compulsory, but is frequently used in a pipeline, and essentially consists of finding the root form of each word (a slightly more sophisticated form of stemming or lemmatization). Finally, parsing and/or chunking tools may be used to analyze the text syntactically, identifying things like noun and verb phrases in the case of chunking, or performing a more detailed analysis of grammatical structure in the case of parsing.
Concerning toolkits, GATE [4] provides a number of open-source linguistic preprocessing components under the LGPL license. It contains a ready-made pipeline for Information Extraction, called ANNIE, and also a large number of additional linguistic processing tools such as a selection of different parsers. While GATE does provide functionality for machine learning-based components, ANNIE is mostly knowledge-based, making for easy adaptation. Additional resources can be added via the plugin mechanism, including components from other pipelines such as the Stanford CoreNLP Tools. GATE components are all Java-based, which makes for easy integration and platform independence.
Stanford CoreNLP [5] is another open-source annotation pipeline framework, available under the GPL license, which can perform all the core linguistic processing described in this section, via a simple Java API. One of the main advantages is that it can be used on the СКАЧАТЬ