11,808 research outputs found
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
Code-Mixing and Mixed Verbs in Cantonese-English Bilingual Children: Input and Innovation
published_or_final_versio
The Sociolinguistics of Code-switching in Hong Kong’s Digital Landscape: A Mixed-Methods Exploration of Cantonese-English Alternation Patterns on WhatsApp
This paper examines the prevalence of Cantonese-English code-mixing in Hong Kong through an under-researched digital medium. Prior research on this code-alternation practice has often been limited to exploring either the social or linguistic constraints of code-switching in spoken or written communication. Our study takes a holistic approach to analyzing code-switching in a hybrid medium that exhibits features of both spoken and written discourse. We specifically analyze the code-switching patterns of 24 undergraduates from a Hong Kong university on WhatsApp and examine how both social and linguistic factors potentially constrain these patterns. Utilizing a self-compiled sociolinguistic corpus as well as survey data, we discovered that those who identified as male, studied English, and had an English medium-of-instruction (EMI) background tended to avoid intra-clausal code-switching between Cantonese and English. Responses to the open-ended questions revealed that many of our participants used code-switching as a means to fill conceptual gaps, engage in socialization (e.g., to strengthen solidarity or make their speech sound more casual and natural), and construct bilingual and Hongkonger identities. Our findings shed some light on at least some of the locally embedded social meaning(s) of this linguistic practice in a digital context
Characteristics of Vietnamese lexis of Vietnamese Australian immigrants
The Vietnamese of Australian communities (VAC) still maintains many obsolete expressions originating from and related to the Southern Vietnamese political institutions of the pre-1975 Southern government. In addition, VAC has adopted English loanwords (ELs) through close contact with Australian English and uses them extensively to fill gaps in vocabulary. English loanwords have not only been borrowed in their original forms but were also nativized through the mechanism of loanwords and loan translation. Moreover, hybridised expressions have been coined by Vietnamese Australian émigrés through the compounding of one English or Vietnamese item with a Vietnamese or English item through loan blending
Assessing language dominance in bilingual acquisition: a case for mean length utterance differentials
The notion of language dominance is often defined in terms of proficiency. We distinguish dominance, as a property of the bilingual mind and a concept of language knowledge, from proficiency, as a concept of language use. We discuss ways in which language dominance may be assessed, with a focus on measures of mean length of utterance (MLU). Comparison of MLU in the child's 2 languages is subject to questions of comparability across languages. Using the Hong Kong Bilingual corpus of Cantonese–English children's development, we show how MLU differentials can be a viable measure of dominance that captures asymmetrical development where there is an imbalance between the child's 2 languages. The directionality of syntactic transfer goes primarily from the language with higher MLU value to the language with lower MLU value, and the MLU differential matches the pervasiveness of transfer effects, as in the case of null objects discussed here: The greater the differential, the more frequent the occurrence of null objects. Cantonese-dominant children with a larger MLU differential use null objects more frequently than those with a lower MLU differential. In our case studies, MLU differentials also matched with language preferences and silent periods but did not predict the directionality of code-mixing.published_or_final_versio
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
Understanding the sentiment of a comment from a video or an image is an
essential task in many applications. Sentiment analysis of a text can be useful
for various decision-making processes. One such application is to analyse the
popular sentiments of videos on social media based on viewer comments. However,
comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts.
Non-availability of annotated code-mixed data for a low-resourced language like
Tamil also adds difficulty to this problem. To overcome this, we created a gold
standard Tamil-English code-switched, sentiment-annotated corpus containing
15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator
agreement and show the results of sentiment analysis trained on this corpus as
a benchmark
- …