3,115 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
Self-reported use and perception of the L1 and L2 among maximally proficient bi- and multilinguals: a quantitative and qualitative investigation
This study investigates language preferences and perceptions in the use of the
native language (L1) and second language (L2) by 386 bi- and multilingual
adults. Participants declared that they were maximally proficient in L1 and L2
and used both constantly. A quantitative analysis revealed that despite their
maximal proficiency in the L1 and L2, participants preferred to use the L1 for
communicating feelings or anger, swearing, addressing their children, performing
mental calculations, and using inner speech. They also perceived their
L1 to be emotionally stronger than their L2 and reported lower levels of communicative
anxiety in their L1. An analysis of interview data from 20 participants
confirmed these findings while adding nuance. Indeed, differences in the
use of the L1 and L2 and perceptions of both are often subtle and context-specific.
Participants confirmed the finding that the L1 is usually felt to be more
powerful than the L2, but this did not automatically translate into a preference
for the L1. Longer stretches of time in the L2 culture are linked to a gradual
shift in linguistic practices and perceptions. Participants reported that their
multilingualism and multiculturalism gave them a sense of empowerment and
a feeling of freedom
Assumptions behind grammatical approaches to code-switching: when the blueprint is a red herring
Many of the so-called âgrammarsâ of code-switching are based on various underlying assumptions, e.g. that informal speech can be adequately or appropriately described in terms of ââgrammarââ; that deep, rather than surface, structures are involved in code-switching; that one âlanguageâ is the âbaseâ or âmatrixâ; and that constraints derived from existing data are universal and predictive. We question these assumptions on several grounds. First, âgrammarâ is arguably distinct from the processes driving speech production. Second, the role of grammar is mediated by the variable, poly-idiolectal repertoires of bilingual speakers. Third, in many instances of CS the notion of a âbaseâ system is either irrelevant, or fails to explain the facts. Fourth, sociolinguistic factors frequently override âgrammaticalâ factors, as evidence from the same language pairs in different settings has shown. No principles proposed to date account for all the facts, and it seems unlikely that âgrammarâ, as conventionally conceived, can provide definitive answers. We conclude that rather than seeking universal, predictive grammatical rules, research on CS should focus on the variability of bilingual grammars
The Use of Prepositions in English as Lingua Franca Interactions: Corpus IST-Erasmus
The growth of English into a lingua franca has inevitably created linguistic deviations and innovations in the use of English. These emerging uses that result from the needs and preferences of speakers whose mother tongues are all different can be broadly identified as lexico-grammatical and pronunciation features and they compose one of the main arteries of study in English as lingua franca communication. In an effort to investigate shared and systematized uses of English as a lingua franca (ELF) and their possible codification have formed the focus of considerable research in the field. This paper introduces an ELF corpus, Corpus IST-Erasmus, which is compiled as part of a PhD study to investigate the lexico-grammar of ELF interactions. The corpus consists of 10 hours 47 minutes of recorded speech and 93,913 words of transcribed data. It is compiled by means of 54 speech events, 29 interviews and 25 focus group meetings. The participants of the study are 79 incoming Erasmus students, representing 24 first languages. These languages are namely Arabic, Azerbaijan, Basque, Bulgarian, Cantonese, Chinese, Czech, Danish, Dutch, French, Galician, German, Greek, Italian, Korean, Lithuanian, Mandarin Chinese, Polish, Portuguese, Slovak, Spanish, Suriname, Turkish, and Ukrainian. The focus of this paper is to examine whether there are variations from standard English as Native Language (ENL) forms with respect to the use of prepositions in spoken ELF interactions, as have been outlined in ELF research (Seidlhofer, 2004). The paper also aims to present the emerging patterns in the use of prepositions and suggest implications for an ELF-aware pedagogy in English Language Teaching. Although there is an increase in the number of empirical studies, there is still a gap in the description of ELF discourse. In order to fully identify the characteristics of ELF, more corpora studies should be conducted. These studies will provide data for ELT professionals in designing an ELF-oriented pedagogy and materials. Besides, there is limited research on the English use of international students- none in the Turkish setting. The present research, therefore, aims to fulfil this niche in the ELF research. Keywords: English as a Lingua Franca, ELF interactions, Corpus IST-Erasmus, ELF lexico-gramma
Explaining Russian-German code-mixing
The study of grammatical variation in language mixing has been at the core of research into bilingual language practices. Although various motivations have been proposed in the literature to account for possible mixing patterns, some of them are either controversial, or remain untested. Little is still known about whether and how frequency of use of linguistic elements can contribute to the patterning of bilingual talk. This book is the first to systematically explore the factor usage frequency in a corpus of bilingual speech. The two aims are (i) to describe and analyze the variation in mixing patterns in the speech of Russia German adolescents and young adults in Germany, and (ii) to propose and test usage-based explanations of variation in mixing patterns in three morphosyntactic contexts: the adjective-modified noun phrase, the prepositional phrase, and the plural marking of German noun insertions in bilingual sentences. In these contexts, German noun insertions combine with either Russian or German words and grammatical markers, thus yielding mixed bilingual and German monolingual constituents in otherwise Russian sentences, the latter also labelled as embedded-language islands. The results suggest that the frequency with which words are used together mediates the distribution of mixing patterns in each of the examined contexts. The differing impacts of co-occurrence frequency are attributed to the distributional and semantic specifics of the analyzed morphosyntactic configurations. Lexical frequency has been found to be another important determinant in this variation. Other factors include recency, or lexical priming, in discourse in the case of prepositional phrases, and phonological and structural similarities and differences in the inflectional systems of the contact languages in the case of plural marking
- âŠ