Search CORE

271 research outputs found

Crowdsourcing Dialect Characterization through Twitter

Author: Bruno Gonçalves
D Mocanu
David Sánchez
DT Pham
J Borge-Holthoefer
M Salathé
M Salathé
PJ Rousseeuw
Tobias Preis
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 26/07/2014
Field of study

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.Comment: 10 pages, 5 figure

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

HAL AMU

Directory of Open Access Journals

PubMed Central

Digital.CSIC

Regionalized models for Spanish language variations based on Twitter

Author: Graff Mario
Miranda Sabino
Moctezuma Daniela
Ruiz Guillermo
Tellez Eric S.
Publication venue
Publication date: 22/04/2022
Field of study

Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks

arXiv.org e-Print Archive

Understanding U.S. regional linguistic variation with Twitter data analysis

Author: Alice Kasakoff
Atwood
Borruso
Bro
Carlos
Carver
Chambers
Cheshire
Crampton
Di Nunzio
Diansheng Guo
Eisenstein
Eisenstein
Gastil
Gimpel
Goebl
Gonçalves
Goodchild
Goodchild
Grieve
Grieve
Grieve
Grieve
Guo
Guo
Guo
Haining
Handcock
Heeringa
Hong
Jack Grieve
James
Kafadar
Kitchin
Kohonen
Koylu
Kretzschmar
Kretzschmar
Kretzschmar
Kupfer
Kurath
Labov
Labov
Lee
Longley
Masser
Nerbonne
Nerbonne
Nerbonne
Nerbonne
Nerbonne
O'Cain
Petrovic
Rao
Spence
Szmrecsanyi
Séguy
Thill
Wang
Wieling
Wolfram
Xu
Yuan Huang
Publication venue: 'Elsevier BV'
Publication date: 01/09/2016
Field of study

We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S

Crossref

University of Birmingham Research Portal

Aston Publications Explorer

Understanding US regional linguistic variation with Twitter data analysis

Author: Grieve Jack
Guo Diansheng
Huang Yuan
Kasakoff Alice
Publication venue
Publication date
Field of study

University of Birmingham Research Portal

La zonificación dialectal del español de América: propuestas clásicas y propuestas actuales

Author: Rodríguez Vázquez Paloma
Publication venue
Publication date: 31/10/2018
Field of study

Traballo Fin de Grao en Lingua e Literatura Españolas. Curso 2018-2019Uno de los problemas clásicos de la historia de la lingüística en Hispanoamérica a lo largo del siglo XX afecta a la dialectología hispanoamericana, concretamente a las propuestas sobre la zonificación dialectal del español en América. El reconocimiento del español hablado en América como un gran mosaico de variedades lingüísticas, como una entidad heterogénea, suscita interrogantes sobre la manera en que se puede compartimentar este complejo dialectal. Desde el planteamiento de Henríquez Ureña en los años 20 del siglo pasado, se presentaron diversos enfoques, bien a partir de criterios externos, como las lenguas indígenas habladas en cada territorio, bien a partir de críterios internos, ya sea fonéticos, morfosintácticos o, sobre todo, léxicos. En torno a la división dialectal se encuentra además bien arraigada la distinción de dos macrozonas, una distinción nacida también en el siglo XX. El propósito de este trabajo es centrarse en las distintas propuestas que han surgido sobre la zonificación, considerando su vigencia actual, principalmente en comparación con las aportaciones aparecidas en el siglo XXI, que nacen de nuevos métodos de acceso y recogida de los datos. El desarrollo tecnológico permite abordar un análisis a gran escala, con lo que obtenemos nuevas conclusiones y un panorama renovado sobre la cuestión. El recorrido del TFG sigue una linea que parte de la discusión de las propuestas con sus respectivas divisiones dialectales, para considerar la validez que tiene cada una de ellas, así como la metodología que aportan en los diferentes momentos. Sobre todo ello se ofrece una valoración, poniendo énfasis en el momento actua

Repositorio Institucional da Universidade de Santiago de Compostela