53 research outputs found
Dialectometric analysis of language variation in Twitter
In the last few years, microblogging platforms such as Twitter have given
rise to a deluge of textual data that can be used for the analysis of informal
communication between millions of individuals. In this work, we propose an
information-theoretic approach to geographic language variation using a corpus
based on Twitter. We test our models with tens of concepts and their associated
keywords detected in Spanish tweets geolocated in Spain. We employ
dialectometric measures (cosine similarity and Jensen-Shannon divergence) to
quantify the linguistic distance on the lexical level between cells created in
a uniform grid over the map. This can be done for a single concept or in the
general case taking into account an average of the considered variants. The
latter permits an analysis of the dialects that naturally emerge from the data.
Interestingly, our results reveal the existence of two dialect macrovarieties.
The first group includes a region-specific speech spoken in small towns and
rural areas whereas the second cluster encompasses cities that tend to use a
more uniform variety. Since the results obtained with the two different metrics
qualitatively agree, our work suggests that social media corpora can be
efficiently used for dialectometric analyses.Comment: 10 pages, 7 figures, 1 table. Accepted to VarDial 201
Mapping the Americanization of English in Space and Time
As global political preeminence gradually shifted from the United Kingdom to
the United States, so did the capacity to culturally influence the rest of the
world. In this work, we analyze how the world-wide varieties of written English
are evolving. We study both the spatial and temporal variations of vocabulary
and spelling of English using a large corpus of geolocated tweets and the
Google Books datasets corresponding to books published in the US and the UK.
The advantage of our approach is that we can address both standard written
language (Google Books) and the more colloquial forms of microblogging messages
(Twitter). We find that American English is the dominant form of English
outside the UK and that its influence is felt even within the UK borders.
Finally, we analyze how this trend has evolved over time and the impact that
some cultural events have had in shaping it.Comment: 16 pages, 6 figures, 2 tables. Published versio
Modeling Global Syntactic Variation in English Using Dialect Classification
This paper evaluates global-scale dialect identification for 14 national
varieties of English as a means for studying syntactic variation. The paper
makes three main contributions: (i) introducing data-driven language mapping as
a method for selecting the inventory of national varieties to include in the
task; (ii) producing a large and dynamic set of syntactic features using
grammar induction rather than focusing on a few hand-selected features such as
function words; and (iii) comparing models across both web corpora and social
media corpora in order to measure the robustness of syntactic variation across
registers
Regionalized models for Spanish language variations based on Twitter
Spanish is one of the most spoken languages in the globe, but not necessarily
Spanish is written and spoken in the same way in different countries.
Understanding local language variations can help to improve model performances
on regional tasks, both understanding local structures and also improving the
message's content. For instance, think about a machine learning engineer who
automatizes some language classification task on a particular region or a
social scientist trying to understand a regional event with echoes on social
media; both can take advantage of dialect-based language models to understand
what is happening with more contextual information hence more precision.
This manuscript presents and describes a set of regionalized resources for
the Spanish language built on four-year Twitter public messages geotagged in 26
Spanish-speaking countries. We introduce word embeddings based on FastText,
language models based on BERT, and per-region sample corpora. We also provide a
broad comparison among regions covering lexical and semantical similarities; as
well as examples of using regional resources on message classification tasks
Métodos de la dialectología cuantitativa
La introducción de la cuantificación de la variación geolingüística ha traído consigo un espectacular auge de las publicaciones sobre la materia, que indican una renovada vitalidad de la disciplina. Uno de los mayores avances de la dialectología del siglo pasado, la dialectometría, se ha convertido en una realidad en prácticamente todas las lenguas cultivadas (Goebl 1992; Nerbonne 2013). La variedad de técnicas cuantitativas utilizadas en la dialectología pone al alcance de los investigadores un amplio abanico de posibilidades de analizar los datos dialectales. Pero todo análisis cuantitativo necesita de una base de datos amplia que aleja al dialectólogo de las prácticas del denominado (single) feature based dialectología, ganando en la objetividad de la muestra del análisis. En este trabajo se presentan los pasos que hay que seguir para desarrollar una investigación en dialectología cuantitativa. Además, se exponen algunas de las técnicas utilizadas, como las destinadas a la cuantificación de la distancia entre variedades, a la clasificación jerárquica, y/o al análisis del continuum dialectal. Así mismo, también se exponen métodos multivariantes para la identificación de patrones de variación, estudio de las variables que presentan similares patrones geográficos, analizar la probabilidad de pertenencia a determinados grupos dialectales, etc. La metodología de la dialectología cuantitativa se halla delimitada por los siguientes pasos: elección de un atlas lingüístico del que se proveerá su base de datos (que puede ser fonética, ortográfica o/y etiquetada), aplicación de una medida de distancia que proporciona una matriz de distancias y el uso de técnicas cuantitativas aplicadas a la matriz de distancias. La cuantificación se ha convertido en un paso obligatorio para expertos que se dedican al estudio de la variación lingüística.The introduction of the quantification of geolinguistic variation has brought a spectacular rise in publications on the subject, which indicate a renewed vitality of the discipline. One of the greatest advances in dialectology of the last century, dialectometry, has become a reality in practically all cultivated languages (Goebl 1992; Nerbonne 2013). The variety of quantitative techniques used in dialectometry offers researchers a wide range of possibilities for analyzing dialectical data. But any quantitative analysis needs a broad database that distances the dialectologist from the practices of the so-called '(single) feature based' dialectology, gaining in the objectivity of the analysis sample. The methodology of quantitative dialectology begins with the choice of a linguistic atlas from which its database will be provided (which can be phonetic, orthographic or/and labeled). The application of a distance measurement provides the distance matrix. The quantitative techniques applied to the distance matrix range from the quantification of the distance between dialectal varieties (interpunctual dialectometry), the hierarchicalclassification of dialectal varieties, the analysis of the dialectal continuum (with the technique of multidimensional scaling (MDS), the analysis of the correlation between geographical and linguistic distance, the detection of linguistic characteristics, etc. Quantification has become a mandatory step for experts who study linguistic variation
Métodos de la dialectología cuantitativa
The introduction of the quantification of geolinguistic variation has brought a spectacular rise in publications on the subject, which indicate a renewed vitality of the discipline. One of the greatest advances in dialectology of the last century, dialectometry, has become a reality in practically all cultivated languages (Goebl 1992; Nerbonne 2013).
The variety of quantitative techniques used in dialectometry offers researchers a wide range of possibilities for analyzing dialectical data. But any quantitative analysis needs a broad database that distances the dialectologist from the practices of the so-called '(single) feature based' dialectology, gaining in the objectivity of the analysis sample.
The methodology of quantitative dialectology begins with the choice of a linguistic atlas from which its database will be provided (which can be phonetic, orthographic or/and labeled). The application of a distance measurement provides the distance matrix. The quantitative techniques applied to the distance matrix range from the quantification of the distance between dialectal varieties (interpunctual dialectometry), the hierarchical classification of dialectal varieties, the analysis of the dialectal continuum (with the technique of multidimensional scaling (MDS), the analysis of the correlation between geographical and linguistic distance, the detection of linguistic characteristics, etc. Quantification has become a mandatory step for experts who study linguistic variation.La introducción de la cuantificación de la variación geolingüística ha traído consigo un espectacular auge de las publicaciones sobre la materia, que indican una renovada vitalidad de la disciplina. Uno de los mayores avances de la dialectología del siglo pasado, la dialectometría, se ha convertido en una realidad en prácticamente todas las lenguas cultivadas (Goebl 1992; Nerbonne 2013).
La variedad de técnicas cuantitativas utilizadas en la dialectología pone al alcance de los investigadores un amplio abanico de posibilidades de analizar los datos dialectales. Pero todo análisis cuantitativo necesita de una base de datos amplia que aleja al dialectólogo de las prácticas del denominado (single) feature based dialectología, ganando en la objetividad de la muestra del análisis.
En este trabajo se presentan los pasos que hay que seguir para desarrollar una investigación en dialectología cuantitativa. Además, se exponen algunas de las técnicas utilizadas, como las destinadas a la cuantificación de la distancia entre variedades, a la clasificación jerárquica, y/o al análisis del continuum dialectal. Así mismo, también se exponen métodos multivariantes para la identificación de patrones de variación, estudio de las variables que presentan similares patrones geográficos, analizar la probabilidad de pertenencia a determinados grupos dialectales, etc. La metodología de la dialectología cuantitativa se halla delimitada por los siguientes pasos: elección de un atlas lingüístico del que se proveerá su base de datos (que puede ser fonética, ortográfica o/y etiquetada), aplicación de una medida de distancia que proporciona una matriz de distancias y el uso de técnicas cuantitativas aplicadas a la matriz de distancias. La cuantificación se ha convertido en un paso obligatorio para expertos que se dedican al estudio de la variación lingüística
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Peer reviewe
American cultural regions mapped through the lexical analysis of social media
Cultural areas represent a useful concept that cross-fertilizes diverse
fields in social sciences. Knowledge of how humans organize and relate their
ideas and behavior within a society helps to understand their actions and
attitudes towards different issues. However, the selection of common traits
that shape a cultural area is somewhat arbitrary. What is needed is a method
that can leverage the massive amounts of data coming online, especially through
social media, to identify cultural regions without ad-hoc assumptions, biases
or prejudices. This work takes a crucial step in this direction by introducing
a method to infer cultural regions based on the automatic analysis of large
datasets from microblogging posts. The approach presented here is based on the
principle that cultural affiliation can be inferred from the topics that people
discuss among themselves. Specifically, regional variations in written
discourse are measured in American social media. From the frequency
distributions of content words in geotagged Tweets, the regional hotspots of
words' usage are found, and from there, principal components of regional
variation are derived. Through a hierarchical clustering of the data in this
lower-dimensional space, this method yields clear cultural areas and the topics
of discussion that define them. It uncovers a manifest North-South separation,
which is primarily influenced by the African American culture, and further
contiguous (East-West) and non-contiguous divisions that provide a
comprehensive picture of today's cultural areas in the US.Comment: 13 pages, 5 figures; contains Supplementary Informatio
- …