Название: Social Monitoring for Public Health
Автор: Michael J. Paul
Издательство: Ingram
Жанр: Компьютеры: прочее
Серия: Synthesis Lectures on Information Concepts, Retrieval, and Services
isbn: 9781681736105
isbn:
Suppose you are interested in using social monitoring to learn how people are changing their behavior in response to the virus, so you decide to focus on topics related to pregnancy and travel. To narrow down to tweets on these topics, you could construct a list of additional keywords for filtering, maybe using the word associations learned by the topic model, or using your own ideas about relevant words, perhaps gained by manually reading a sample of tweets. Finally, if you need to identify tweets that can’t be captured with a simple keyword list (for example, you want to identify when someone mentions that they are personally changing travel plans, as opposed to more general discussion of travel advisories), then you should label some of the filtered tweets for relevance to your task and train a classifier to identify more such tweets.
Tools and Resources
A number of free tools exist for the machine learning tasks described above, although most require some programming experience. For a guide aimed at a public health audience rather than computer scientists, see Yoon et al. [2013]. For computationally oriented researchers, we recommend the following machine learning tools.
• scikit-learn (http://scikit-learn.org
) is a Python library for a variety of general purpose machine learning tasks, including classification and validation.
• MALLET (http://mallet.cs.umass.edu
) is a Java library for machine learning for text data, supporting document classification and topic modeling.
• NLTK (http://www.nltk.org
) is a Python library for text processing, supporting tokenization and classification.
• Stanford Core NLP (https://stanfordnlp.github.io/CoreNLP/
) is a set of natural language processing tools, including named entity recognition and dependency parsing.
• HLTCOE Concrete (http://hltcoe.github.io/
) is a data serialization standard for NLP data that includes a variety of “concrete compliant” NLP tools.
• Twitter NLP (https://github.com/aritter/twitter_nlp
) is a Python toolkit that implements some core NLP tools with models specifically trained on Twitter data.
• TweetNLP (http://www.cs.cmu.edu/~ark/TweetNLP/
) is a toolkit implemented in Java and Python of text processing tools specifically for Twitter.
• Weka (http://www.cs.waikato.ac.nz/ml/weka/
) is a machine learning software package that supports tasks like classification and clustering. It has a graphical interface, making it more user-friendly than the other tools.
4.1.2 TREND INFERENCE
We will now describe methods for extracting trends—levels of interest or activity across time intervals or geographic locations—from social media. First, we discuss how raw volumes of filtered content can be converted to trends by normalizing the counts. Second, we describe how filtered content can be used as predictors in more sophisticated statistical models to produce trend estimates. Examples of these two approaches, as applied to influenza surveillance, are contrasted in Figure 4.3.
Counting and Normalization
A simple method for extracting trends is to compute the volume of data filtered for relevance (Section 4.1.1) in each point (e.g., time period of location), for example the number of flu tweets per week [Chew and Eysenbach, 2010, Lamb et al., 2013, Lampos and Cristianini, 2010].
It is important to normalize the volume counts to adjust for variation over time and location. For example, the system of Lamb et al. [2013] normalizes influenza counts by dividing the volumes by the counts of a random sample of public tweets for the same location and time period. Normalization is especially important for comparing locations, as volumes are affected by regional differences in population and social media usage, but normalization is also important for comparing values across long time intervals, as usage of a social media platform inevitably changes over time.
Note that the search volume counts provided by Google Trends are already normalized, although normalization is plot dependent, and values cannot be compared between plots with establishing baselines for comparison. See Ayers et al. [2011b] for details.
Statistical Modeling and Regression
A more sophisticated approach to trend inference is to represent trends with statistical models. When a model is used to predict values, it is called regression. Regression models are used to fit data, such as social media volume, to “gold standard” values from an existing surveillance system, such as the influenza-like illness network from the Centers for Disease Control and Prevention (CDC).
Figure 4.3: Estimates of influenza prevalence derived from Twitter (blue) alongside the gold standard CDC rate (black). The dashed Twitter trend is the normalized count of influenza-related tweets, estimated with the method of Lamb et al. [2013]. The solid Twitter trend uses the normalized counts in a regression model to predict the CDC’s rates. The regression approach is based on research by Paul et al. [2014], in which an autoregressive model is trained on the Twitter counts as well as the previous three weeks of CDC data. Predictions for each season (segmented with vertical lines) are based on models trained on the remaining two seasons. The regression predictions, which incorporate lagged CDC data, are a closer fit to the gold standard curve than the counts alone.
The simplest type of regression model is a univariate (one predictor) linear model, which has the form: yi = b + βxi, for each point i, where a point is a time period such as week. For example, yi could be the CDC’s influenza prevalence at week i and xi could be the volume of flu-related social media activity in the same week [Culotta, 2010, Ginsberg et al., 2009]. The β value is the regression coefficient, interpreted as the slope of the line in a linear model, while b is an intercept. By plugging social media counts into a regression model, one can estimate the CDC’s values.
Other predictors can be included in regression models besides social media volume. A useful predictor is the trend itself: the previous week’s value is a good predictor of the current week, for example. A kth-order autoregressive (AR) model is a regression model whose predictors are the previous k values. For example, a second-order autoregressive model has the form yi = β1yi−1 + β2yi−2. If predictors are included in addition to the time series data itself, such as the social media estimate xi, it is called an autoregressive exogenous (ARX) model. ARX models have been shown to outperform basic regression models for influenza prediction from social media [Achrekar et al., 2012, Paul et al., 2014].
A commonly used extension to the linear autoregressive model is the autoregressive integrated moving average (ARIMA) model, which assumes an underlying smooth behavior in the time series. These models have also been used for predicting influenza prevalence [Broniatowski et al., 2015, Dugas et al., 2013, Preis and Moat, 2014].