20 research outputs found
Improving natural language processing for under-served languages through increased training data diversity
More and better data is often the most effective way to improve the quality of natural language processing (NLP), with the highest-performing applications requiring terabytes of data. However, most of the world's language varieties do not have anything like this amount of data available, limiting performance. This thesis aims to increase the diversity of training data for under-served language varieties as a means to improving downstream NLP applications.
We take two broad approaches to increasing diversity in this thesis. We look firstly at diverse data augmentation, quantifying different types of induced diversity and how these affect downstream performance. Using neural machine translation as a specific application, we measure the diversity of different methods of generating back translation (BT), a popular data augmentation method. We find that some types of diversity are more important than others for downstream performance and make recommendations about how to make BT more effective.
The second approach towards increasing training data diversity taken in this thesis is to improve language identification (LID), a fundamental part of any data-gathering pipeline. Given that poor LID is a significant impediment to diverse corpus creation, we curate an open dataset covering around 200 language varieties to facilitate further research. We demonstrate the quality of this dataset by using it to train a high-performing LID model and by carrying out further analysis into its capability.
We use our LID dataset and model to explore two challenging problems for LID: identifying code-switched text and improving classification for Arabic dialects. We focus on making these challenges tractable for realistic corpus building, employing metrics which reflect downstream performance more faithfully. Our findings demonstrate the limitations of current LID techniques and lay the groundwork for future research in this area.
A key finding throughout this thesis is that quality matters, particularly for under-served languages. Furthermore, even as corpus sizes grow, it is crucial not to lose sight of the quirks of individual languages. We provide resources and future research directions for increasing the diversity of useful training data for under-served languages, and in so doing facilitate the development of effective NLP applications for a wider variety of users
Exploring Diversity in Back Translation for Low-Resource Machine Translation
Back translation is one of the most widely used methods for improving the
performance of neural machine translation systems. Recent research has sought
to enhance the effectiveness of this method by increasing the 'diversity' of
the generated translations. We argue that the definitions and metrics used to
quantify 'diversity' in previous work have been insufficient. This work puts
forward a more nuanced framework for understanding diversity in training data,
splitting it into lexical diversity and syntactic diversity. We present novel
metrics for measuring these different aspects of diversity and carry out
empirical analysis into the effect of these types of diversity on final neural
machine translation model performance for low-resource
EnglishTurkish and mid-resource
EnglishIcelandic. Our findings show that generating back
translation using nucleus sampling results in higher final model performance,
and that this method of generation has high levels of both lexical and
syntactic diversity. We also find evidence that lexical diversity is more
important than syntactic for back translation performance
The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)
The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions
An Open Dataset and Model for Language Identification
Language identification (LID) is a fundamental step in many natural language
processing pipelines. However, current LID systems are far from perfect,
particularly on lower-resource languages. We present a LID model which achieves
a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201
languages, outperforming previous work. We achieve this by training on a
curated dataset of monolingual data, the reliability of which we ensure by
auditing a sample from each source and each language manually. We make both the
model and the dataset available to the research community. Finally, we carry
out detailed analysis into our model's performance, both in comparison to
existing open models and by language class.Comment: To be published in ACL 202
The University of Edinburgh's English-German and English-Hausa Submissions to the WMT21 News Translation Task
This paper presents the University of Edinburgh's constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fine-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping
Efficacy and safety of a self-applied carrageenan-based gel to prevent human papillomavirus infection in sexually active young women (CATCH study): an exploratory phase IIB randomised, placebo-controlled trialResearch in context
Summary: Background: Carrageenan demonstrated potent anti-HPV (human papillomavirus) activity in vitro and in animal models. The Carrageenan-gel Against Transmission of Cervical Human papillomavirus trialâs interim analysis (n = 277) demonstrated a 36% protective effect of carrageenan against incident HPV infections. Herein, we report the trialâs final results. Methods: In this exploratory phase IIB randomised, placebo-controlled trial, we recruited healthy women aged â„18 years primarily from health service clinics at two Canadian Universities in Montreal. Participants were randomised (1:1) by the study coordinator (using computer-assisted block randomisation with randomly variable block sizes up to a block size of eight) to a carrageenan-based or placebo gel to be self-applied every other day for the first month and before/after intercourse. Participants, study nurses, and laboratory technicians (HPV testing and genotyping) were blinded to group assignment. At each visit (months 0, 0.5, 1, 3, 6, 9, 12), participants provided questionnaire data and a self-collected vaginal sample (tested for 36 HPV types, Linear Array). The primary outcome was type-specific HPV incidence (occurring at any follow-up visit). Intention-to-treat analyses for incidence were conducted using Cox proportional hazards regression models, including participants with â„2 visits. Safety analyses included all participants randomised. This trial is registered with the ISRCTN registry, ISRCTN96104919. Findings: Between Jan 16, 2013 and Sept 30, 2020, 461 participants (enrolled) were randomly assigned to the carrageenan (n = 227) or placebo (n = 234) groups. Incidence and safety analyses included 429 and 461 participants, respectively. We found 51.9% (108/208) of participants in carrageenan and 66.5% (147/221) in placebo arm acquired â„1 HPV type (hazard ratio 0.63 [95% CI: 0.49â0.81], p = 0.0003). Adverse events were reported by 34.8% (79/227) and 39.7% (93/234) of participants in carrageenan and placebo arm (p = 0.27), respectively. Interpretation: Consistent with the interim analysis, use of a carrageenan-based gel compared to placebo resulted in a 37% reduction in risk of incident genital HPV infections in women with no increase in adverse events. A carrageenan-based gel may complement HPV vaccination. Funding: Canadian Institutes of Health Research, CarraShield Labs Inc
Gender, risk and micro-financial subjectivities
This article analyses the gendered contradictions of microfinance's celebrated âdouble bottom lineâ of social and financial impact. The example of microfinance is used to illustrate the gendered and colonial constructions of âriskâ and âresponsibilityâ that underpin neoliberalism and its gendered paradoxes. After revisiting the discursive critique of these terms, I draw on how indigenous women participating in a microfinance institution in Bolivia describe their experience to suggest how gendered ideas of risk and responsibility are framing their negotiation of and resistance to the market. While the gendered and colonial construction of risk creates dynamics that perpetuate indigenous women's exclusion from the market, the terms of the resistance and use of the intervention also challenge feminist critiques of neoliberal governmentality developed mostly with reference to advanced modernity and welfare regimes