2,813 research outputs found
Detection of Sociolinguistic Features in Digital Social Networks for the Detection of Communities
The emergence of digital social networks has transformed society, social groups, and institutions in terms of the communi cation and expression of their opinions. Determining how language variations allow the detection of communities, together with the relevance of specifc vocabulary (proposed by the National Council of Accreditation of Colombia (Consejo Nacional de Acreditación - CNA) to determine the quality evaluation parameters for universities in Colombia) in digital assemblages could lead to a better understanding of their dynamics and social foundations, thus resulting in better communication policies and intervention where necessary. The approach presented in this paper intends to determine what are the semantic spaces (sociolinguistic features) shared by social groups in digital social networks. It includes fve layers based on Design Science Research, which are integrated with Natural Language Processing techniques (NLP), Computational Linguistics (CL), and
Artifcial Intelligence (AI). The approach is validated through a case study wherein the semantic values of a series of “Twit ter” institutional accounts belonging to Colombian Universities are analyzed in terms of the 12 quality factors established by CNA. In addition, the topics and the sociolect used by diferent actors in the university communities are also analyzed. The current approach allows determining the sociolinguistic features of social groups in digital social networks. Its application allows detecting the words or concepts to which each actor of a social group (university) gives more importance in terms of vocabular
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Semantic Variation in Online Communities of Practice
We introduce a framework for quantifying semantic variation of common words
in Communities of Practice and in sets of topic-related communities. We show
that while some meaning shifts are shared across related communities, others
are community-specific, and therefore independent from the discussed topic. We
propose such findings as evidence in favour of sociolinguistic theories of
socially-driven semantic variation. Results are evaluated using an independent
language modelling task. Furthermore, we investigate extralinguistic features
and show that factors such as prominence and dissemination of words are related
to semantic variation.Comment: 13 pages, Proceedings of the 12th International Conference on
Computational Semantics (IWCS 2017
Language, Twitter and Academic Conferences
Using Twitter during academic conferences is a way of engaging and connecting
an audience inherently multicultural by the nature of scientific collaboration.
English is expected to be the lingua franca bridging the communication and
integration between native speakers of different mother tongues. However,
little research has been done to support this assumption. In this paper we
analyzed how integrated language communities are by analyzing the scholars'
tweets used in 26 Computer Science conferences over a time span of five years.
We found that although English is the most popular language used to tweet
during conferences, a significant proportion of people also tweet in other
languages. In addition, people who tweet solely in English interact mostly
within the same group (English monolinguals), while people who speak other
languages tend to show a more diverse interaction with other lingua groups.
Finally, we also found that the people who interact with other Twitter users
show a more diverse language distribution, while people who do not interact
mostly post tweets in a single language. These results suggest a relation
between the number of languages a user speaks, which can affect the interaction
dynamics of online communities.Comment: 4 pages, 3 figures, 4 tables, submitted to ACM Hypertext and Social
Media 201
From White-Box Machine Learning to Fuzzy Logic for Automatic Gender Detection in Spanish Texts from Social Networks
Aquesta dissertació, emmarcada en l'àmbit de la sociolingüística computacional, explora l'ús de variables sociolingüístiques en models computacionals basats en Intel·ligència Artificial per a la detecció automàtica del gènere en textos escrits en espanyol.
El nostre interès resideix a dissenyar models computacionals basats en algorismes d'aprenentatge automàtic de caixa blanca i lògica difusa amb variables derivades de la sociolingüística.
Vam elaborar una caracterització del gènere basada en nivells lingüístics a partir de les publicacions emmarcadas en l'àmbit de la llengua i el gènere, l'àrea de recerca de la comunicació mitjançant computadora i el gènere, i la sociolingüística computacional. Aquesta caracterització constitueix els fonaments de la nostra anàlisi experimental.
En l'anàlisi experimental, vam implementar l'algorisme Decision Tree amb variables ortogràfiques, morfològiques, lèxiques, sintàctiques, digitals i pragmàtic-discursives en el conjunt de dades PAN-AP-13 a fi d'identificar patrons sociolingüístics de gènere. A partir d'aquest primer experiment computacional, vam ampliar la nostra anàlisi a altres conjunts de dades i algorismes; concretament, vam explorar, més enllà del conjunt PAN-AP-13 i de l'algorisme Decision Tree, els conjunts de dades PAN-AP-15, PAN-AP-17, PAN-AP-18 i PAN-AP-19, i els algorismes Random Forest i XGBoost. Vam dissenyar 63 models a partir de les combinacions dels conjunts de variables. L'exactitud en la classificació dels models resultants, els quals no superaven les 160 variables lingüístiques, va ser del 70%.
Vam culminar l'anàlisi experimental amb una caracterització sociolingüística del gènere basada en 39 patrons organitzats per la seva robustesa.
La nostra proposta teòrica presenta 64 models difusos, dels quals 57 són models difusos assemblats. La sortida final d'aquests models va ser calculada amb l'esquema de vot majoritari. Segons els resultats, el model assemblat Ortogràfic, Lèxic, Sintàctic, Digital i Pragmàtic-Discursiu (OLSDP) va produir els millors resultats.
Els algorismes d'aprenentatge automàtic de caixa blanca i la lògica difusa, juntament amb les variables inspirades en la sociolingüística, han d'incorporar-se en la identificació automàtica del gènere a fi de dilucidar la complexa relació entre la llengua i el gènere.Esta disertación, enmarcada en el ámbito de la sociolingüística computacional, explora el uso de variables sociolingüísticas en modelos computacionales basados en Inteligencia Artificial para la detección automática del género en textos escritos en español.
Nuestro interés reside en diseñar modelos computacionales basados en algoritmos de aprendizaje automático de caja blanca y lógica difusa con variables derivadas de la sociolingüística.
Elaboramos una caracterización del género basada en niveles lingüísticos a partir de las publicaciones enmarcadas en el ámbito de la lengua y el género, el área de investigación de la comunicación mediada por computadora y el género, y la sociolingüística computacional. Esta caracterización constituye los fundamentos de nuestro análisis experimental.
En el análisis experimental, implementamos el algoritmo Decision Tree con variables ortográficas, morfológicas, léxicas, sintácticas, digitales y pragmático-discursivas en el conjunto de datos PAN-AP-13 a fin de identificar patrones sociolingüísticos de género. A partir de este primer experimento computacional, ampliamos nuestro análisis a otros conjuntos de datos y algoritmos; concretamente, exploramos, además del conjunto PAN-AP-13 y del algoritmo Decision Tree, los conjuntos de datos PAN-AP-15, PAN-AP-17, PAN-AP-18 y PAN-AP-19, y los algoritmos Random Forest y XGBoost. Diseñamos 63 modelos a partir de las combinaciones de los conjuntos de variables. La exactitud en la clasificación de los modelos resultantes, los cuales no suepraban las 160 variables lingüísticas, se situó en torno al 70%.
Culminamos el análisis experimental con una caracterización sociolingüística del género basada en 39 patrones organizados por su robustez.
Nuestra propuesta teórica presenta 64 modelos difusos, de los cuales 57 son modelos difusos ensamblados cuya salida final fue calculada utilizando el esquema de voto mayoritario. Según los resultados, el modelo ensamblado Ortográfico, Léxico, Sintáctico, Digital y Pragmático-Discursiveo (OLSDP) produjo los mejores resultados.
Los algoritmos de aprendizaje automático de caja blanca y la lógica difusa, junto con las variables inspiradas en la sociolingüística, deben incorporarse en la identificación automática del género a fin de dilucidar la compleja relación entre la lengua y el género.This dissertation, framed in the computational sociolinguistics field, explores the use of sociolinguistic-derived features in Artificial Intelligence-based computational models for automatic gender detection on Spanish texts.
Our interest lays in designing computational models based on white-box machine learning algorithms and fuzzy logic with sociolinguistic-inspired features.
We elaborated a characterisation of gender based on linguistic levels from the publications framed in the language and gender field, the computer-mediated communication and gender research area, and computational sociolinguistics. This characterisation serves as the foundation of our experimental analysis.
In the experimental analysis, we implemented the Decision Tree algorithm with orthographic, morphological, lexical, syntactic, digital, and pragmatic-discursive features on the PAN-AP-13 dataset in order to identify gender sociolinguistic patterns. From this first computational experiment, we extended our analysis to other datasets and algorithms; specifically, we explored, besides the PAN-AP-13 and the Decision Tree algorithm, the PAN-AP-15, PAN-AP-17, PAN-AP-18, and PAN-AP-19 datasets, and the Random Forest and XGBoost algorithms. We designed 63 models from the combinations of the feature sets. The classification accuracy of the resulting models, which did not exceed 160 linguistic features, was around 70%.
We culminated the experimental analysis with a sociolinguistic characterisation of gender based on 39 patterns organised according to their robustnesss.
Our theoretical proposal presents 64 fuzzy models, of which 57 are ensemble fuzzy models whose final output was calculated using the majority vote scheme. According to the results, the Orthographic, Lexical, Syntactic, Digital, and Pragmatic-Discursive (OLSDP) ensemble model produced the best results.
White-box machine learning algorithms and fuzzy logic, along with sociolinguistic-inspired features, must be incorporated into automatic gender identification in order to elucidate the complex relationship between language and gender
Towards modelling language innovation acceptance in online social networks
Language change and innovation is constant in online and offline communication, and has led to new words entering people’s lexicon and even entering modern day dictionaries, with recent additions of ‘e-cig’ and ‘vape’. However the manual work required to identify these ‘innovations’ is both time consuming and subjective. In this work we demonstrate how such innovations in language can be identified across two different OSN’s (Online Social Networks) through the operationalisation of known language acceptance models that incorporate relatively simplistic statistical tests. From grounding our work in language theory, we identified three statistical tests that can be applied, variation in; frequency, form and meaning; each showing different success rates across the two networks (Geo-bound Twitter sample and a sample of Reddit). These tests were also applied to different community levels within the two networks allow- ing for different innovations to be identified across different community structures over the two networks, for instance: identifying regional variation across Twitter, and variation across groupings of Subreddits, where identified example in- novations included ‘casualidad’ and ‘cym’
- …
