3,885 research outputs found
Probabilistic Inference of Twitter Users' Age based on What They Follow
Twitter provides an open and rich source of data for studying human behaviour
at scale and is widely used in social and network sciences. However, a major
criticism of Twitter data is that demographic information is largely absent.
Enhancing Twitter data with user ages would advance our ability to study social
network structures, information flows and the spread of contagions. Approaches
toward age detection of Twitter users typically focus on specific properties of
tweets, e.g., linguistic features, which are language dependent. In this paper,
we devise a language-independent methodology for determining the age of Twitter
users from data that is native to the Twitter ecosystem. The key idea is to use
a Bayesian framework to generalise ground-truth age information from a few
Twitter users to the entire network based on what/whom they follow. Our
approach scales to inferring the age of 700 million Twitter accounts with high
accuracy.Comment: 9 pages, 9 figure
An analysis of the user occupational class through Twitter content
Social media content can be used as a complementary source to the traditional
methods for extracting and studying collective social attributes. This study focuses on the prediction of the occupational class for a public user profile. Our analysis is conducted on a new annotated corpus of Twitter users, their respective job titles, posted textual content and platform-related attributes. We frame our task as classification using latent feature representations such as word clusters and embeddings. The employed linear and, especially, non-linear methods can predict a userās occupational class with strong accuracy for the coarsest level of a standard occupation taxonomy which includes nine classes. Combined with a qualitative assessment, the derived results confirm the feasibility of our approach in inferring a new user attribute that can be embedded in a multitude of downstream applications
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
Mining the Demographics of Political Sentiment from Twitter Using Learning from Label Proportions
Opinion mining and demographic attribute inference have many applications in
social science. In this paper, we propose models to infer daily joint
probabilities of multiple latent attributes from Twitter data, such as
political sentiment and demographic attributes. Since it is costly and
time-consuming to annotate data for traditional supervised classification, we
instead propose scalable Learning from Label Proportions (LLP) models for
demographic and opinion inference using U.S. Census, national and state
political polls, and Cook partisan voting index as population level data. In
LLP classification settings, the training data is divided into a set of
unlabeled bags, where only the label distribution in of each bag is known,
removing the requirement of instance-level annotations. Our proposed LLP model,
Weighted Label Regularization (WLR), provides a scalable generalization of
prior work on label regularization to support weights for samples inside bags,
which is applicable in this setting where bags are arranged hierarchically
(e.g., county-level bags are nested inside of state-level bags). We apply our
model to Twitter data collected in the year leading up to the 2016 U.S.
presidential election, producing estimates of the relationships among political
sentiment and demographics over time and place. We find that our approach
closely tracks traditional polling data stratified by demographic category,
resulting in error reductions of 28-44% over baseline approaches. We also
provide descriptive evaluations showing how the model may be used to estimate
interactions among many variables and to identify linguistic temporal
variation, capabilities which are typically not feasible using traditional
polling methods
- ā¦