The massive popularity of online social media provides a unique opportunity
for researchers to study the linguistic characteristics and patterns of user's
interactions. In this paper, we provide an in-depth characterization of
language usage across demographic groups in Twitter. In particular, we extract
the gender and race of Twitter users located in the U.S. using advanced image
processing algorithms from Face++. Then, we investigate how demographic groups
(i.e. male/female, Asian/Black/White) differ in terms of linguistic styles and
also their interests. We extract linguistic features from 6 categories
(affective attributes, cognitive attributes, lexical density and awareness,
temporal references, social and personal concerns, and interpersonal focus), in
order to identify the similarities and differences in particular writing set of
attributes. In addition, we extract the absolute ranking difference of top
phrases between demographic groups. As a dimension of diversity, we also use
the topics of interest that we retrieve from each user. Our analysis unveils
clear differences in the writing styles (and the topics of interest) of
different demographic groups, with variation seen across both gender and race
lines. We hope our effort can stimulate the development of new studies related
to demographic information in the online space.Comment: Proceedings of the 28th ACM Conference on Hypertext and Social Media
2017 (HT '17