271 research outputs found
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using
geotagged microblogging datasets. By collecting all Twitter messages written in
Spanish over more than two years, we build a corpus from which a carefully
selected list of concepts allows us to characterize Spanish varieties on a
global scale. A cluster analysis proves the existence of well defined
macroregions sharing common lexical properties. Remarkably enough, we find that
Spanish language is split into two superdialects, namely, an urban speech used
across major American and Spanish citites and a diverse form that encompasses
rural areas and small towns. The latter can be further clustered into smaller
varieties with a stronger regional character.Comment: 10 pages, 5 figure
Regionalized models for Spanish language variations based on Twitter
Spanish is one of the most spoken languages in the globe, but not necessarily
Spanish is written and spoken in the same way in different countries.
Understanding local language variations can help to improve model performances
on regional tasks, both understanding local structures and also improving the
message's content. For instance, think about a machine learning engineer who
automatizes some language classification task on a particular region or a
social scientist trying to understand a regional event with echoes on social
media; both can take advantage of dialect-based language models to understand
what is happening with more contextual information hence more precision.
This manuscript presents and describes a set of regionalized resources for
the Spanish language built on four-year Twitter public messages geotagged in 26
Spanish-speaking countries. We introduce word embeddings based on FastText,
language models based on BERT, and per-region sample corpora. We also provide a
broad comparison among regions covering lexical and semantical similarities; as
well as examples of using regional resources on message classification tasks
Understanding U.S. regional linguistic variation with Twitter data analysis
We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S
La zonificación dialectal del español de América: propuestas clásicas y propuestas actuales
Traballo Fin de Grao en Lingua e Literatura Españolas. Curso 2018-2019Uno de los problemas clásicos de la historia de la lingüística en Hispanoamérica a lo largo del siglo XX afecta a la dialectología hispanoamericana, concretamente a las propuestas sobre la zonificación dialectal del español en América. El reconocimiento del español hablado en América como un gran mosaico de variedades lingüísticas, como una entidad heterogénea, suscita interrogantes sobre la manera en que se puede compartimentar este complejo dialectal. Desde el planteamiento de Henríquez Ureña en los años 20 del siglo pasado, se presentaron diversos enfoques, bien a partir de criterios externos, como las lenguas indígenas habladas en cada territorio, bien a partir de críterios internos, ya sea fonéticos, morfosintácticos o, sobre todo, léxicos. En torno a la división dialectal se encuentra además bien arraigada la distinción de dos macrozonas, una distinción nacida también en el siglo XX.
El propósito de este trabajo es centrarse en las distintas propuestas que han surgido sobre la zonificación, considerando su vigencia actual, principalmente en comparación con las aportaciones aparecidas en el siglo XXI, que nacen de nuevos métodos de acceso y recogida de los datos. El desarrollo tecnológico permite abordar un análisis a gran escala, con lo que obtenemos nuevas conclusiones y un panorama renovado sobre la cuestión. El recorrido del TFG sigue una linea que parte de la discusión de las propuestas con sus respectivas divisiones dialectales, para considerar la validez que tiene cada una de ellas, así como la metodología que aportan en los diferentes momentos. Sobre todo ello se ofrece una valoración, poniendo énfasis en el momento actua
- …