3,115 research outputs found

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

    Full text link
    Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings

    Self-reported use and perception of the L1 and L2 among maximally proficient bi- and multilinguals: a quantitative and qualitative investigation

    Get PDF
    This study investigates language preferences and perceptions in the use of the native language (L1) and second language (L2) by 386 bi- and multilingual adults. Participants declared that they were maximally proficient in L1 and L2 and used both constantly. A quantitative analysis revealed that despite their maximal proficiency in the L1 and L2, participants preferred to use the L1 for communicating feelings or anger, swearing, addressing their children, performing mental calculations, and using inner speech. They also perceived their L1 to be emotionally stronger than their L2 and reported lower levels of communicative anxiety in their L1. An analysis of interview data from 20 participants confirmed these findings while adding nuance. Indeed, differences in the use of the L1 and L2 and perceptions of both are often subtle and context-specific. Participants confirmed the finding that the L1 is usually felt to be more powerful than the L2, but this did not automatically translate into a preference for the L1. Longer stretches of time in the L2 culture are linked to a gradual shift in linguistic practices and perceptions. Participants reported that their multilingualism and multiculturalism gave them a sense of empowerment and a feeling of freedom

    Connected languages:Effects of intensifying contact between Turkish and Dutch

    Get PDF

    Assumptions behind grammatical approaches to code-switching: when the blueprint is a red herring

    Get PDF
    Many of the so-called ‘grammars’ of code-switching are based on various underlying assumptions, e.g. that informal speech can be adequately or appropriately described in terms of ‘‘grammar’’; that deep, rather than surface, structures are involved in code-switching; that one ‘language’ is the ‘base’ or ‘matrix’; and that constraints derived from existing data are universal and predictive. We question these assumptions on several grounds. First, ‘grammar’ is arguably distinct from the processes driving speech production. Second, the role of grammar is mediated by the variable, poly-idiolectal repertoires of bilingual speakers. Third, in many instances of CS the notion of a ‘base’ system is either irrelevant, or fails to explain the facts. Fourth, sociolinguistic factors frequently override ‘grammatical’ factors, as evidence from the same language pairs in different settings has shown. No principles proposed to date account for all the facts, and it seems unlikely that ‘grammar’, as conventionally conceived, can provide definitive answers. We conclude that rather than seeking universal, predictive grammatical rules, research on CS should focus on the variability of bilingual grammars

    The Use of Prepositions in English as Lingua Franca Interactions: Corpus IST-Erasmus

    Get PDF
    The growth of English into a lingua franca has inevitably created linguistic deviations and innovations in the use of English. These emerging uses that result from the needs and preferences of speakers whose mother tongues are all different can be broadly identified as lexico-grammatical and pronunciation features and they compose one of the main arteries of study in English as lingua franca communication. In an effort to investigate shared and systematized uses of English as a lingua franca (ELF) and their possible codification have formed the focus of considerable research in the field. This paper introduces an ELF corpus, Corpus IST-Erasmus, which is compiled as part of a PhD study to investigate the lexico-grammar of ELF interactions. The corpus consists of 10 hours 47 minutes of recorded speech and 93,913 words of transcribed data. It is compiled by means of 54 speech events, 29 interviews and 25 focus group meetings. The participants of the study are 79 incoming Erasmus students, representing 24 first languages. These languages are namely Arabic, Azerbaijan, Basque, Bulgarian, Cantonese, Chinese, Czech, Danish, Dutch, French, Galician, German, Greek, Italian, Korean, Lithuanian, Mandarin Chinese, Polish, Portuguese, Slovak, Spanish, Suriname, Turkish, and Ukrainian. The focus of this paper is to examine whether there are variations from standard English as Native Language (ENL) forms with respect to the use of prepositions in spoken ELF interactions, as have been outlined in ELF research (Seidlhofer, 2004). The paper also aims to present the emerging patterns in the use of prepositions and suggest implications for an ELF-aware pedagogy in English Language Teaching. Although there is an increase in the number of empirical studies, there is still a gap in the description of ELF discourse. In order to fully identify the characteristics of ELF, more corpora studies should be conducted. These studies will provide data for ELT professionals in designing an ELF-oriented pedagogy and materials. Besides, there is limited research on the English use of international students- none in the Turkish setting. The present research, therefore, aims to fulfil this niche in the ELF research. Keywords: English as a Lingua Franca, ELF interactions, Corpus IST-Erasmus, ELF lexico-gramma

    Explaining Russian-German code-mixing

    Get PDF
    The study of grammatical variation in language mixing has been at the core of research into bilingual language practices. Although various motivations have been proposed in the literature to account for possible mixing patterns, some of them are either controversial, or remain untested. Little is still known about whether and how frequency of use of linguistic elements can contribute to the patterning of bilingual talk. This book is the first to systematically explore the factor usage frequency in a corpus of bilingual speech. The two aims are (i) to describe and analyze the variation in mixing patterns in the speech of Russia German adolescents and young adults in Germany, and (ii) to propose and test usage-based explanations of variation in mixing patterns in three morphosyntactic contexts: the adjective-modified noun phrase, the prepositional phrase, and the plural marking of German noun insertions in bilingual sentences. In these contexts, German noun insertions combine with either Russian or German words and grammatical markers, thus yielding mixed bilingual and German monolingual constituents in otherwise Russian sentences, the latter also labelled as embedded-language islands. The results suggest that the frequency with which words are used together mediates the distribution of mixing patterns in each of the examined contexts. The differing impacts of co-occurrence frequency are attributed to the distributional and semantic specifics of the analyzed morphosyntactic configurations. Lexical frequency has been found to be another important determinant in this variation. Other factors include recency, or lexical priming, in discourse in the case of prepositional phrases, and phonological and structural similarities and differences in the inflectional systems of the contact languages in the case of plural marking

    Responses to Questionnaires of Young Turkish-German Bilinguals in Berlin: Their Thoughts about Language Choice

    Get PDF
    • 

    corecore