271 research outputs found

    Crowdsourcing Dialect Characterization through Twitter

    Get PDF
    We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.Comment: 10 pages, 5 figure

    Regionalized models for Spanish language variations based on Twitter

    Full text link
    Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks

    Understanding U.S. regional linguistic variation with Twitter data analysis

    Get PDF
    We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S

    La zonificación dialectal del español de América: propuestas clásicas y propuestas actuales

    Get PDF
    Traballo Fin de Grao en Lingua e Literatura Españolas. Curso 2018-2019Uno de los problemas clásicos de la historia de la lingüística en Hispanoamérica a lo largo del siglo XX afecta a la dialectología hispanoamericana, concretamente a las propuestas sobre la zonificación dialectal del español en América. El reconocimiento del español hablado en América como un gran mosaico de variedades lingüísticas, como una entidad heterogénea, suscita interrogantes sobre la manera en que se puede compartimentar este complejo dialectal. Desde el planteamiento de Henríquez Ureña en los años 20 del siglo pasado, se presentaron diversos enfoques, bien a partir de criterios externos, como las lenguas indígenas habladas en cada territorio, bien a partir de críterios internos, ya sea fonéticos, morfosintácticos o, sobre todo, léxicos. En torno a la división dialectal se encuentra además bien arraigada la distinción de dos macrozonas, una distinción nacida también en el siglo XX. El propósito de este trabajo es centrarse en las distintas propuestas que han surgido sobre la zonificación, considerando su vigencia actual, principalmente en comparación con las aportaciones aparecidas en el siglo XXI, que nacen de nuevos métodos de acceso y recogida de los datos. El desarrollo tecnológico permite abordar un análisis a gran escala, con lo que obtenemos nuevas conclusiones y un panorama renovado sobre la cuestión. El recorrido del TFG sigue una linea que parte de la discusión de las propuestas con sus respectivas divisiones dialectales, para considerar la validez que tiene cada una de ellas, así como la metodología que aportan en los diferentes momentos. Sobre todo ello se ofrece una valoración, poniendo énfasis en el momento actua
    corecore