794 research outputs found
Recommended from our members
Identifying and Modeling Code-Switched Language
Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during written or spoken communication. The importance of developing language technologies that are able to process code-switched language is immense, given the large populations that routinely code-switch. Current NLP and Speech models break down when used on code-switched data, interrupting the language processing pipeline in back-end systems and forcing users to communicate in ways which for them are unnatural.
There are four main challenges that arise in building code-switched models: lack of code-switched data on which to train generative language models; lack of multilingual language annotations on code-switched examples which are needed to train supervised models; little understanding of how to leverage monolingual and parallel resources to build better code-switched models; and finally, how to use these models to learn why and when code-switching happens across language pairs. In this thesis, I look into different aspects of these four challenges.
The first part of this thesis focuses on how to obtain reliable corpora of code-switched language. We collected a large corpus of code-switched language from social media using a combination of sets of anchor words that exist in one language and sentence-level language taggers. The newly obtained corpus is superior to other corpora collected via different strategies when it comes to the amount and type of bilingualism in it. It also helps train better language tagging models. We also have proposed a new annotation scheme to obtain part-of-speech tags for code-switched English-Spanish language. The annotation scheme is composed of three different subtasks including automatic labeling, word-specific questions labeling and question-tree word labeling. The part-of-speech labels obtained for the Miami Bangor corpus of English-Spanish conversational speech show very high agreement and accuracy.
The second section of this thesis focuses on the tasks of part-of-speech tagging and language modeling. For the first task, we proposed a state-of-the-art approach to part-of-speech tagging of code-switched English-Spanish data based on recurrent neural networks.Our models were tested on the Miami Bangor corpus on the task of POS tagging alone, for which we achieved 96.34% accuracy, and joint part-of-speech and language ID tagging,which achieved similar POS tagging accuracy (96.39%) and very high language ID accuracy (98.78%).
For the task of language modeling, we first conducted an exhaustive analysis of the relationship between cognate words and code-switching. We then proposed a set of cognate-based features that helped improve language modeling performance by 12% relative points. Furthermore, we showed that these features can also be used across language pairs and still obtain performance improvements.
Finally, we tackled the question of how to use monolingual resources for code-switching models by pre-training state-of-the-art cross-lingual language models on large monolingual corpora and fine-tuning them on the tasks of language modeling and word-level language tagging on code-switched data. We obtained state-of-the-art results on both tasks
Machine Translation on a parallel Code-Switched Corpus
International audienceCode-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate code-switched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning
Hansard as an Aid to Statutory Interpretation in Canadian Courts from 1999 to 2010
This thesis employs qualitative and quantitative methods to provide a comprehensive picture o f the judicial use o f Hansard as an extrinsic aid to statutory interpretation in the courts of Canada from 1999 to 2010. The qualitative portion of the thesis examines all Supreme Court of Canada judgments in 2010 that make reference to Hansard and Hansard-like materials. The findings are compared with the findings of Professor Stéphane Beaulac, who studied the phenomenon in 1999. The quantitative portion ofthe research examines the prevalence and distribution ofjudgments that make reference to Hansard in the Courts throughout Canada from 1999 to 2010
Machine Translation on a parallel Code-Switched Corpus
International audienceCode-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate code-switched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning
Audiences, referees, and landscapes: Understanding the use of Māori and English in New Zealand dual language picturebooks through a sociolinguistic lens
When non-dominant perspectives are represented in children's literature, it is labelled multicultural, and this form of literature has much potential for altering existing power structures in society. Bishop (1990) first introduced the metaphor of multicultural children's literature offering the possibility of windows - an opportunity to see into others' worlds; mirrors - an opportunity to see your own world being reflected back; and glass sliding doors - an opportunity to step into a world through a book. However, to date, any exploration of the extent to which language diversity contributes to the representation of non-dominant perspectives in multicultural children's literature has been limited, and the use of sociolinguistic theories to frame and theorise such explorations almost non-existent
The Quest for a User-Friendly Copyright Regime in Hong Kong
The quest for a user-friendly copyright regime began a decade ago when the Hong Kong government launched a public consultation on Copyright Protection in the Digital Environment in December 2006. Although this consultation initially sought to address Internet-related challenges, such as those caused by peer-to-peer file-sharing technology, the reform effort quickly evolved into a more comprehensive digital upgrade of the Hong Kong copyright regime.A decade later, however, Hong Kong still has not yet amended its Copyright Ordinance. Thus far, three consultation exercises have been launched in December 2006, April 2008 and July 2013. Two bills have also been introduced in June 2011 and June 2014. Because the latest bill lapsed at the end of the fifth term of the Legislative Council, which expired in July 2016, the Hong Kong government will have to submit a new bill to the legislature after the September 2016 elections to restart the upgrading effort.In the run-up to this third (and hopefully successful) bill, it will be timely to retrospectively examine the developments surrounding the Copyright (Amendment) Bill 2014, including some of the committee stage amendments moved by legislators. Written for a symposium on International and Comparative User Rights in the Digital Economy, this article recounts the origin and evolution of the Bill.The article also examines three proposals that the author either developed or was heavily involved in defending, in my capacity as a pro bono advisor to Internet user groups — and, by extension, some pan-Democrat legislators. The first proposal concerned an exception for predominantly noncommercial user-generated content. The second proposal involved the addition of an open-ended, catch-all fair use provision to the new and existing fair dealing provisions. The final proposal called for the creation of a moratorium on lawsuits against individual Internet users based on noncommercial copyright infringement
What is kept and what is lost without translation? A corpus-assisted discourse study of the European Parliament’s original and translated English
In July 2011, the European Parliament (EP) stopped providing a written translation of its proceedings. Some years later, it seems apposite to look back and ask: What is kept and what is lost without the EP translating? To answer this question, the present paper adopts the first (modern diachronic) corpus-assisted discourse analysis study (MD-CADS) carried out within translation studies by drawing on the discourse-historical approach (DHA) and corpus linguistics (CL) tools. Hence, along DHA lines, the paper proceeds from texture through strategies to content by focusing on CL key keywords and detailed consistency. It performs analysis upon the European Comparable and Parallel Corpus archive, compiled at the Universitat Jaume I (Spain). This study shows that MD-CADS is a potential source of data for triangulation with other, more qualitative, approaches
- …