11,808 research outputs found

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

    The Sociolinguistics of Code-switching in Hong Kong’s Digital Landscape: A Mixed-Methods Exploration of Cantonese-English Alternation Patterns on WhatsApp

    Get PDF
    This paper examines the prevalence of Cantonese-English code-mixing in Hong Kong through an under-researched digital medium. Prior research on this code-alternation practice has often been limited to exploring either the social or linguistic constraints of code-switching in spoken or written communication. Our study takes a holistic approach to analyzing code-switching in a hybrid medium that exhibits features of both spoken and written discourse. We specifically analyze the code-switching patterns of 24 undergraduates from a Hong Kong university on WhatsApp and examine how both social and linguistic factors potentially constrain these patterns. Utilizing a self-compiled sociolinguistic corpus as well as survey data, we discovered that those who identified as male, studied English, and had an English medium-of-instruction (EMI) background tended to avoid intra-clausal code-switching between Cantonese and English. Responses to the open-ended questions revealed that many of our participants used code-switching as a means to fill conceptual gaps, engage in socialization (e.g., to strengthen solidarity or make their speech sound more casual and natural), and construct bilingual and Hongkonger identities. Our findings shed some light on at least some of the locally embedded social meaning(s) of this linguistic practice in a digital context

    Characteristics of Vietnamese lexis of Vietnamese Australian immigrants

    Get PDF
    The Vietnamese of Australian communities (VAC) still maintains many obsolete expressions originating from and related to the Southern Vietnamese political institutions of the pre-1975 Southern government. In addition, VAC has adopted English loanwords (ELs) through close contact with Australian English and uses them extensively to fill gaps in vocabulary. English loanwords have not only been borrowed in their original forms but were also nativized through the mechanism of loanwords and loan translation. Moreover, hybridised expressions have been coined by Vietnamese Australian émigrés through the compounding of one English or Vietnamese item with a Vietnamese or English item through loan blending

    Assessing language dominance in bilingual acquisition: a case for mean length utterance differentials

    Get PDF
    The notion of language dominance is often defined in terms of proficiency. We distinguish dominance, as a property of the bilingual mind and a concept of language knowledge, from proficiency, as a concept of language use. We discuss ways in which language dominance may be assessed, with a focus on measures of mean length of utterance (MLU). Comparison of MLU in the child's 2 languages is subject to questions of comparability across languages. Using the Hong Kong Bilingual corpus of Cantonese–English children's development, we show how MLU differentials can be a viable measure of dominance that captures asymmetrical development where there is an imbalance between the child's 2 languages. The directionality of syntactic transfer goes primarily from the language with higher MLU value to the language with lower MLU value, and the MLU differential matches the pervasiveness of transfer effects, as in the case of null objects discussed here: The greater the differential, the more frequent the occurrence of null objects. Cantonese-dominant children with a larger MLU differential use null objects more frequently than those with a lower MLU differential. In our case studies, MLU differentials also matched with language preferences and silent periods but did not predict the directionality of code-mixing.published_or_final_versio

    Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

    Full text link
    Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings

    Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

    Get PDF
    Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark
    • …
    corecore