Natural Language Processing for Social Media. Diana Inkpen
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for Social Media - Diana Inkpen страница 14

СКАЧАТЬ Preposition or subordinating conjunction 7 JJ Adjective 8 JJR Adjective, comparative 9 JJS Adjective, superlative 10 LS List item marker 11 MD Model 12 NN Noun, singular or mass 13 NNS Noun, plural 14 NNP Proper noun, singular 15 NNPS Proper noun, plural 16 PDT Predeterminer 17 POS Possessive ending 18 PRP Personal pronoun 19 PRP$ Possessive pronoun 20 RB Adverb 21 RBR Adverb, comparative 22 RBS Adverb, superlative 23 RP Particle 24 SYM Symbol 25 TO To 26 UH Interjection 27 VB Verb, base form 28 VBD Verb, past tense 29 VBG Verb, gerund or present participle 30 VBN Verb, past participle 31 VBP Verb, non-3rd person singular present 32 VBZ Verb, 3rd person singular present 33 WDT Wh-determiner 34 WP Wh-pronoun 35 WP$ Possessive wh-pronoun 36 WRB Wh-adverb

       Methods for Part-of-speech Taggers

      Horsmann and Zesch [2016] trained a CRF classifier [Lafferty et al., 2001] using the FlexTag tagger [Zesch and Horsmann, 2016] There are two adaptations involved in this method. The first is a general domain adaptation. The researchers applied a domain adaption strategy, which they proposed as a competitive model to improve the accuracy for tagging social media texts. To train their model, they used the CMC and Web corpora subsets from the EmpiriST shared task and some additional 100,000 tokens of newswire text from the Tiger corpus. The second adaptation is specific to the EmpiriST shared task. Because some POS tags are too rare to be learned from training data, the researchers utilized a post-processing step that leveraged heuristics. This step involved the use of regular expressions and word lists from Wikipedia and Wiktionary to improve named entity recognition and case-insensitive matching. Selecting tags from the larger Tiger corpus introduced bias, so the researchers added extra Boolean features to their model.

      Deep learning-based POS taggers became easy to build. They directly tansform sequences of words into sequences of POS taggs. For example, Popov [2016] surveys the techniques that can be applied, starting with word embeddings and enhanced with suffix embeddings.

       Evaluation Measures for Part-of-speech Taggers

      The accuracy of the tagging is usually measured as the number of tags correctly assigned out of the total number of words/tokens being tagged.

       Adapting Part-of-speech Taggers

      POS taggers clearly need re-training in order to be usable on social media data. Even the set of POS tags used must be extended in order to adapt to the needs of this kind of text. Ritter et al. [2011] used the Penn TreeBank tagset (Table 2.3) to annotate 800 Twitter messages. They added a few new tags for the Twitter-specific phenomena: retweets, @usernames, #hashtags, and URLs. Words in these categories can be tagged with very high accuracy using simple regular expressions, but they still need to be taken into consideration as features in the re-training of the taggers (for example as tags of the previous word to be tagged). In Ritter et al. [2011], the POS tagging accuracy drops from about 97% on newspaper text to 80% on the 800 tweets. These numbers are reported for the Stanford POS tagger [Toutanova et al., 2003]. Their POS tagger T-POS—based on a Conditional Random Field classifier and on the clustering of out-of-vocabulary (OOV) words—also obtained low performance on Twitter data (81%). By retraining the T-POS tagger on the annotated Twitter data (which is rather small), the accuracy increases to 85%. The best accuracy raises to 88% when the size of the training data is increased by adding to the Twitter data the initial Penn TreeBank training data, plus 40,000 tokens of annotated Internet Relay Chat (IRC) data [Forsyth and Martell, 2007], which is similar in style to Twitter data. Similar numbers are reported by Derczynski et al. [2013b] on a part of the same Twitter dataset.

      A key reason for the drop in accuracy on Twitter data is that the data contains far more OOV words than grammatical text. Many of these OOV words come from spelling variation, e.g., the use of the word n for in in Example 3 from Table 2.1 The tag for proper nouns (NNP) is the most frequent tag for OOV words, while in fact only about one third are proper nouns.

      Gimpel et al. [2011] developed a new POS tagset for Twitter (see Table 2.4), that is more coarse-grained, and it pays particular attention to punctuation, emoticons, and Twitter-specific tags (@usernames, #hashtags, URLs). They manually tagged 1,827 tweets with the new tagset; then, they trained a POS tagging model that uses features geared toward Twitter text. The experiments conducted to evaluate the model showed 90% accuracy for the POS tagging task. Owoputi et al. [2013] improved on the model by using word clustering techniques and trained the POS tagger on a better dataset of tweets and chat messages.8 Some of the expressions used in Twitter messages are formal, and some are informal. Therefore, POS tagging for СКАЧАТЬ