3,334 research outputs found
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using
geotagged microblogging datasets. By collecting all Twitter messages written in
Spanish over more than two years, we build a corpus from which a carefully
selected list of concepts allows us to characterize Spanish varieties on a
global scale. A cluster analysis proves the existence of well defined
macroregions sharing common lexical properties. Remarkably enough, we find that
Spanish language is split into two superdialects, namely, an urban speech used
across major American and Spanish citites and a diverse form that encompasses
rural areas and small towns. The latter can be further clustered into smaller
varieties with a stronger regional character.Comment: 10 pages, 5 figure
The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection
This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%
Dialectometric analysis of language variation in Twitter
In the last few years, microblogging platforms such as Twitter have given
rise to a deluge of textual data that can be used for the analysis of informal
communication between millions of individuals. In this work, we propose an
information-theoretic approach to geographic language variation using a corpus
based on Twitter. We test our models with tens of concepts and their associated
keywords detected in Spanish tweets geolocated in Spain. We employ
dialectometric measures (cosine similarity and Jensen-Shannon divergence) to
quantify the linguistic distance on the lexical level between cells created in
a uniform grid over the map. This can be done for a single concept or in the
general case taking into account an average of the considered variants. The
latter permits an analysis of the dialects that naturally emerge from the data.
Interestingly, our results reveal the existence of two dialect macrovarieties.
The first group includes a region-specific speech spoken in small towns and
rural areas whereas the second cluster encompasses cities that tend to use a
more uniform variety. Since the results obtained with the two different metrics
qualitatively agree, our work suggests that social media corpora can be
efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201
Mapping the Americanization of English in Space and Time
As global political preeminence gradually shifted from the United Kingdom to
the United States, so did the capacity to culturally influence the rest of the
world. In this work, we analyze how the world-wide varieties of written English
are evolving. We study both the spatial and temporal variations of vocabulary
and spelling of English using a large corpus of geolocated tweets and the
Google Books datasets corresponding to books published in the US and the UK.
The advantage of our approach is that we can address both standard written
language (Google Books) and the more colloquial forms of microblogging messages
(Twitter). We find that American English is the dominant form of English
outside the UK and that its influence is felt even within the UK borders.
Finally, we analyze how this trend has evolved over time and the impact that
some cultural events have had in shaping it.Comment: 16 pages, 6 figures, 2 tables. Published versio
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
- …