Natural Language Processing for the Semantic Web. Diana Maynard
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for the Semantic Web - Diana Maynard страница 12

СКАЧАТЬ

      NLTK and Stanford CoreNLP do not provide any chunkers, although they could be created using rules and/or machine learning from the other components (such as POS tags) in the relevant toolkit.

      In this chapter we have introduced the idea of an NLP pipeline and described the main components, with reference to some of the widely used open-source toolkits. It is important to note that while performance in these low-level linguistic processing tasks is generally high, the tools do vary in performance, not just in accuracy, but also in the way in which they perform the tasks and their output, due to adhering to different linguistic theories. It is therefore critical when selecting pre-processing tools to understand what is required by other tools downstream in the application. While mixing and matching of some tools is possible (particularly in frameworks such as GATE, which are designed precisely with interoperability in mind), compatibility between different components may be an issue. This is one of the reasons why there are several different toolkits available offering similar but slightly different sets of tools. On the performance side, it is also important to be aware of the effect of changing domain and text type, and whether the tools are easily modifiable or not if this is necessary. In particular, moving from tools trained on standard newswire to processing social media text can be problematic; this is discussed in detail in Chapter 8. Similarly, some tools can be adapted easily to new languages (in particular, the first components in the chain such as tokenizers), while more complex tools such as parsers may be more difficult to adapt. In the following chapter, we introduce the task of Named Entity Recognition and show how the linguistic processing tools described in this chapter can be built on to accomplish this.

       1 http://opennlp.apache.org/index.html

       2 http://incubator.apache.org/opennlp/documentation/manual/opennlp.html

       3 http://gate.ac.uk

      4A good explanation of Unicode can be found at http://www.unicode.org/standard/WhatIsUnicode.html.

       5 http://nlp.stanford.edu/software/tokenizer.shtml

       6 http://www.nltk.org/

       7 http://www.cs.ualberta.ca/~lindek/minipar.htm

       8 http://nlp.stanford.edu/software/srparser.shtml

      CHAPTER 3

       Named Entity Recognition and Classification

       3.1 INTRODUCTION

      As discussed in Chapter 1, information extraction is the process of extracting information from unstructured text and turning it into structured data. Central to this is the task of named entity recognition and classification (NERC), which involves the identification of proper names in texts (NER), and their classification into a set of predefined categories of interest (NEC). Unlike the pre-processing tools discussed in the previous chapter, which deal with syntactic analysis, NERC is about automatically deriving semantics from textual content. The traditional core set of named entities, developed for the shared NERC task at MUC-6 [25], comprises Person, Organization, Location, and Date and Time expressions, such as Barack Obama, Microsoft, New York, 4th July 2015, etc.

      NERC is generally an annotation task, i.e., to annotate a text with named entities (NEs), but it can involve simply producing a list of NEs which may then be used for other purposes, including creating or extending gazetteers to assist with the NE annotation process in future. It can be subdivided into two tasks: the recognition task, involving identifying the boundaries of an NE (typically referred to as NER); and named entity classification (NEC), involving detecting the class or type of the NE. Slightly confusingly, NER is often used to mean the combination of the two tasks, especially in older work; here we stick to using NERC for the combined task and NER for only the recognition element. For more fine-grained NEC than the standard Person, Organization, and Location classification, classes are often taken from an ontology schema and are subclasses of these [26]. The main challenge for NEC is that NEs can be highly ambiguous (e.g., “May” can be a person’s name or a month of the year; “Mark” can be a person’s name or a common noun). Partly for this reason, the two tasks of NER and NEC are typically solved as a single task.

      A further task regarding named entities is named entity linking (NEL). The NEL task is to recognize if a named entity mention in a text corresponds to any NEs in a reference knowledge base. A named entity mention is an expression in the text referring to a named entity: this may be under different forms, e.g., “Mr. Smith” and “John Smith” are both mentions (textual representations) of the same real-world entity, expressed by slightly different linguistic realizations. The reference knowledge base used is typically Wikipedia. NEL is even more challenging than NEC because distinctions do not only have to be made on the class-level, but also within classes. For example, there are many persons with the name “John Smith.” The more popular the names are, the more difficult the NEL task becomes. A further problem, which all knowledge base–related tasks have, is that knowledge bases are incomplete; for example, they will only contain the most famous people named “John Smith.” This is particularly challenging when working on tasks involving recent events, since there is often a time lag between newly emerging entities appearing in the news or on social media and the updating of knowledge bases with their information. More details on named entity linking, along with relevant reference corpora, are given in Chapter 5.

      The reason that Person, Organization, Location, Date, and Time have become so popular as standard types of named entity is due largely to the Message Understanding Conference series (MUC) [25], which introduced the Named Entity Recognition and Classification task in 1995 and which drove the initial development of many systems which are still in existence today. Due to the expansion of NERC evaluation efforts (described in more detail in Section 3.3) and the need for using NERC tools in real-life applications, other kinds of proper nouns and expressions gradually also started to be considered as named entities, according to the task, such as newspapers, monetary amounts, and more fine-grained classifications of the above, such as authors, music bands, football teams, TV programs, and so on. NERC is the starting point for many more complex applications and tasks such as ontology building, relation extraction, question answering, information extraction, information retrieval, machine translation, and semantic annotation. With the advent of open information extraction scenarios focusing on the whole of the web, analysis of social media where new entities emerge constantly, and named entity linking tasks, the range of entities extracted has widened dramatically, which has brought many new challenges (see for example Section 4.4, where the role of knowledge bases for Named Entity Linking is discussed). Furthermore, the standard kind of 5- or 7-class entity recognition problem is now often less useful, which in turn means that new paradigms are required. In some cases, such as the recognition of Twitter user names, the distinction between traditional classes, such as Organization and Location, has become blurred even for a human, and is no longer always useful (see Chapter 8).

      Defining what exactly should constitute each entity type is never easy, СКАЧАТЬ