Natural Language Processing for Social Media. Diana Inkpen
Чтение книги онлайн.

Читать онлайн книгу Natural Language Processing for Social Media - Diana Inkpen страница 17

СКАЧАТЬ co-reference. A text classifier is also available.15

      Open NLP includes tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution, implemented in Java. It also includes maximum entropy and perceptron-based machine learning algorithms.16

      FreeLing includes tools for English and several other languages: text tokenization, sentence splitting, morphological analysis, phonetic encoding, named entity recognition, POS tagging, chart-based shallow parsing, rule-based dependency parsing, nominal co-reference resolution, etc.17

      NLTK is a suite of text processing libraries in Python for classification, tokenization, stemming, POS tagging, parsing, and semantic reasoning.18

      GATE includes components for diverse language processing tasks, e.g., parsers, morphology, POS tagging. It also contains information retrieval tools, information extraction components for various languages, and many others. The information extraction system (ANNIE) includes a named entity detector.19

      NLPTools is a library for NLP written in PHP, geared toward text classification, clustering, tokenizing, stemming, etc.20

      Some components of these toolkits were re-trained for social media texts, such as the Stanford part-of-speech (POS) tagger by Derczynski et al. [2013b], and the OpenNLP chunker by Ritter et al. [2011], as we noted earlier.

      One toolkit that was fully adapted to social media text is GATE. A new module or plugin called TwitIE21 is available [Derczynski et al., 2013a] for tokenization of Twitter texts, as well as POS tagging, name entities recognition, etc.

      Two new toolkits were built especially for social media texts: the TweetNLP tools developed at CMU and the Twitter NLP tools developed at the University of Washington (UW).

      TweetNLP is a Java-based tokenizer and part-of-speech tagger for Twitter text [Owoputi et al., 2013]. It includes training data of manually labeled POS annotated tweets (that we noted above), a Web-based annotation tool, and hierarchical word clusters from unlabeled tweets.22 It also includes the TweeboParser mentioned above.

      The UW Twitter NLP Tools [Ritter et al., 2011] contain the POS tagger and the annotated Twitter data (mentioned above—see adaptation of POS taggers).23

      A few other tools for English are in development, and a few tools for other languages have been adapted or can be adapted to social media text. The development of the latter is slower, due to the difficulty in producing annotated training data for many languages, but there is progress. For example, a treebank for French social media texts was developed by Seddah et al. [2012].

      Social media messages are available in many languages. Some messages could be mixed, for example part in English and part in another language. This is called “code switching.” If tools for multiple languages are available, a language identification tool needs to be run on the texts before using the right language-specific tools for the next processing steps.

      Language identification can reach very high accuracy for long texts (98–99%), but it needs adaptation to social media texts, especially to short texts such as Twitter messages.

      Derczynski et al. [2013b] showed that language identification accuracy decreases to around 90% on Twitter data, and that re-training can lead to 95–97% accuracy levels. This increase is easily achievable for tools that classify into a small number of languages, while tools that classify into a large number of languages (close to 100 languages) cannot be further improved on short informal texts. Lui and Baldwin [2014] tested six language identification tools and obtained the best results on Twitter data by majority voting over three of them, up to an F-score of 0.89.

      Barman et al. [2014] presented a new dataset containing Facebook posts and comments that exhibit code mixing between Bengali, English, and Hindi. The researchers demonstrated some preliminary word-level language identification experiments using this dataset. The methods surveyed included a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labeling using Conditional Random Fields. The preliminary results demonstrated the superiority of supervised classification and sequence labeling over dictionary-based classification, suggesting that contextual clues are necessary for accurate classifiers. The CRF model achieved the best result with an F-score of 0.95.

      There is a lot of work on language identification in social media. Twitter has been a favorite target, and a number of papers deal with language identification of Twitter messages specifically Bergsma et al. [2012], Carter et al. [2013], Goldszmidt et al. [2013], Mayer [2012], Tromp and Pechenizkiy [2011]. Tromp and Pechenizkiy [2011] proposed a graph-based n-gram approach that works well on tweets. Lui and Baldwin [2014] looked specifically at the problem of adapting existing language identification tools to Twitter messages, including challenges in obtaining data for evaluation, as well as the effectiveness of proposed strategies. They tested several tools on Twitter data (including a newly collected corpus for English, Japanese, and Chinese). The tests were done with off-the-shelf tools, before and after a simple cleaning of the Twitter data, such as removing hashtags, mentions, emoticons, etc. The improvement after the cleaning was small. Bergsma et al. [2012] looked at less common languages, in order to collect language-specific corpora. The nine languages they focused on (Arabic, Farsi, Urdu, Hindi, Nepali, Marathi, Russian, Bulgarian, Ukrainian) use three different non-Latin scripts: Arabic, Devanagari, and Cyrillic. Their method for language identification was based on language models.

      Most of the methods used only the text of the message, but Carter et al. [2013] also looked at the use of metadata, an approach which is unique to social media. They identified five microblog characteristics that can help in language identification: the language profile of the blogger, the content of an attached hyperlink, the language profile of other users mentioned in the post, the language profile of a tag, and the language of the original post, if the post is a reply. Further, they presented methods that combine the prior language class probabilities in a post-dependent and post-independent way. Their test results on 1,000 posts from 5 languages (Dutch, English, French, German, and Spanish) showed improvements in accuracy by 5% over the baseline, and showed that post-dependent combinations of the priors achieved the best performance.

      Taking a broader view of social media, Nguyen and Doğruöz [2013] looked at language identification in a mixed Dutch-Turkish Web forum. Mayer [2012] considered language identification of private messages between eBay users.

      Here are some of the available tools for language identification.

      • langid.py24 [Lui and Baldwin, 2012] works for 97 languages and uses a feature set selected from multiple sources, combined via a multinomial Naïve Bayes classifier.

      • CLD2,25 the language identifier embedded in the Chrome Web browser,26 uses a Naïve Bayes classifier and script-specific tokenization strategies.

      • LangDetect27 is a Naïve Bayes classifier, using a representation based on character n-grams without feature selection, with a set of normalization heuristics.

      • whatlang [Brown, 2013] uses a vector-space model with per-feature weighting over character n-grams.

      • YALI28 computes a per-language score using the relative frequency of a set of byte n-grams selected by term frequency.

      • TextCat29 is an implementation of СКАЧАТЬ