Search CORE

20 research outputs found

Improving natural language processing for under-served languages through increased training data diversity

Author: Burchell Laurie Vear
Publication venue: The University of Edinburgh
Publication date: 21/10/2024
Field of study

More and better data is often the most effective way to improve the quality of natural language processing (NLP), with the highest-performing applications requiring terabytes of data. However, most of the world's language varieties do not have anything like this amount of data available, limiting performance. This thesis aims to increase the diversity of training data for under-served language varieties as a means to improving downstream NLP applications. We take two broad approaches to increasing diversity in this thesis. We look firstly at diverse data augmentation, quantifying different types of induced diversity and how these affect downstream performance. Using neural machine translation as a specific application, we measure the diversity of different methods of generating back translation (BT), a popular data augmentation method. We find that some types of diversity are more important than others for downstream performance and make recommendations about how to make BT more effective. The second approach towards increasing training data diversity taken in this thesis is to improve language identification (LID), a fundamental part of any data-gathering pipeline. Given that poor LID is a significant impediment to diverse corpus creation, we curate an open dataset covering around 200 language varieties to facilitate further research. We demonstrate the quality of this dataset by using it to train a high-performing LID model and by carrying out further analysis into its capability. We use our LID dataset and model to explore two challenging problems for LID: identifying code-switched text and improving classification for Arabic dialects. We focus on making these challenges tractable for realistic corpus building, employing metrics which reflect downstream performance more faithfully. Our findings demonstrate the limitations of current LID techniques and lay the groundwork for future research in this area. A key finding throughout this thesis is that quality matters, particularly for under-served languages. Furthermore, even as corpus sizes grow, it is crucial not to lose sight of the quirks of individual languages. We provide resources and future research directions for increasing the diversity of useful training data for under-served languages, and in so doing facilitate the development of effective NLP applications for a wider variety of users

Edinburgh Research Archive

Exploring Diversity in Back Translation for Low-Resource Machine Translation

Author: Birch-Mayne Alexandra
Burchell Laurie
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/06/2022
Field of study

Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. We argue that the definitions and metrics used to quantify 'diversity' in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English

\leftrightarrow

Turkish and mid-resource English

\leftrightarrow

Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance

arXiv.org e-Print Archive

Edinburgh Research Explorer

The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)

Author: Burchell Laurie
Chen Pinzhen
Iyer Vivek
Kirefu Faheem
Publication venue
Publication date: 01/12/2022
Field of study

The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions

Edinburgh Research Explorer

An Open Dataset and Model for Language Identification

Author: Birch Alexandra
Bogoychev Nikolay
Burchell Laurie
Heafield Kenneth
Publication venue
Publication date: 23/05/2023
Field of study

Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.Comment: To be published in ACL 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Querent Intent in Multi-Sentence Questions

Author: Burchell Laurie
Chi Jie
Hosking Tom
Markl Nina
Webber Bonnie
Publication venue
Publication date: 01/12/2020
Field of study

Edinburgh Research Explorer

The University of Edinburgh's English-German and English-Hausa Submissions to the WMT21 News Translation Task

Author: Birch Alexandra
Bogoychev Nikolay
Burchell Laurie
Chen Pinzhen
Germann Ulrich
Heafield Kenneth
Helcl Jindřich
Miceli Barone Antonio Valerio
Waldendorf Jonas
Publication venue
Publication date: 01/11/2021
Field of study

This paper presents the University of Edinburgh's constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fine-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping

Edinburgh Research Explorer

Efficacy and safety of a self-applied carrageenan-based gel to prevent human papillomavirus infection in sexually active young women (CATCH study): an exploratory phase IIB randomised, placebo-controlled trialResearch in context

Author: Ann N. Burchell
Cassandra Laurie
Eduardo L. Franco
François Coutlée
Joseph E. Tota
Mariam El-Zein
Pierre-Paul Tellier
Sarah Botting-Provost
Publication venue: 'Elsevier BV'
Publication date: 01/06/2023
Field of study

Summary: Background: Carrageenan demonstrated potent anti-HPV (human papillomavirus) activity in vitro and in animal models. The Carrageenan-gel Against Transmission of Cervical Human papillomavirus trial’s interim analysis (n = 277) demonstrated a 36% protective effect of carrageenan against incident HPV infections. Herein, we report the trial’s final results. Methods: In this exploratory phase IIB randomised, placebo-controlled trial, we recruited healthy women aged ≥18 years primarily from health service clinics at two Canadian Universities in Montreal. Participants were randomised (1:1) by the study coordinator (using computer-assisted block randomisation with randomly variable block sizes up to a block size of eight) to a carrageenan-based or placebo gel to be self-applied every other day for the first month and before/after intercourse. Participants, study nurses, and laboratory technicians (HPV testing and genotyping) were blinded to group assignment. At each visit (months 0, 0.5, 1, 3, 6, 9, 12), participants provided questionnaire data and a self-collected vaginal sample (tested for 36 HPV types, Linear Array). The primary outcome was type-specific HPV incidence (occurring at any follow-up visit). Intention-to-treat analyses for incidence were conducted using Cox proportional hazards regression models, including participants with ≥2 visits. Safety analyses included all participants randomised. This trial is registered with the ISRCTN registry, ISRCTN96104919. Findings: Between Jan 16, 2013 and Sept 30, 2020, 461 participants (enrolled) were randomly assigned to the carrageenan (n = 227) or placebo (n = 234) groups. Incidence and safety analyses included 429 and 461 participants, respectively. We found 51.9% (108/208) of participants in carrageenan and 66.5% (147/221) in placebo arm acquired ≥1 HPV type (hazard ratio 0.63 [95% CI: 0.49–0.81], p = 0.0003). Adverse events were reported by 34.8% (79/227) and 39.7% (93/234) of participants in carrageenan and placebo arm (p = 0.27), respectively. Interpretation: Consistent with the interim analysis, use of a carrageenan-based gel compared to placebo resulted in a 37% reduction in risk of incident genital HPV infections in women with no increase in adverse events. A carrageenan-based gel may complement HPV vaccination. Funding: Canadian Institutes of Health Research, CarraShield Labs Inc

Directory of Open Access Journals

Gender, risk and micro-financial subjectivities

Author: Abu-Lughod
Brott
Bruni
Brush
Burchell
Burchell
Chase
Condon
Connell
Crandon-Malamud
De Beauvoir
De Goede
De Goede
de la Cadena
Douglas
Eversole
Foucault
Garland
Hale
Harris
Hart
Herbert Cheshire
Isbell
Kessler Harris
Koch
Kohl
Lagos
Larner
Larner
Larson
Larson
Lash
Laurie
Laurie
Lazar
Lemke
Lind
Lupton
Maclean
Maclean
Maclean
Marshall
Mayoux
McDowell
McDowell
Molyneux
Mosley
Murra
O’Malley
O’Malley
Paulson
Perreault
Postero
Rankin
Rhyne
Robinson
Rose
Rose
Roy
Townsend
Tulloch
Velasco
Weber
Weir
Westley
Young
Young
Yuval-Davis
Zaloom
Zoomers
Publication venue: 'Wiley'
Publication date: 01/03/2012
Field of study

This article analyses the gendered contradictions of microfinance's celebrated “double bottom line” of social and financial impact. The example of microfinance is used to illustrate the gendered and colonial constructions of “risk” and “responsibility” that underpin neoliberalism and its gendered paradoxes. After revisiting the discursive critique of these terms, I draw on how indigenous women participating in a microfinance institution in Bolivia describe their experience to suggest how gendered ideas of risk and responsibility are framing their negotiation of and resistance to the market. While the gendered and colonial construction of risk creates dynamics that perpetuate indigenous women's exclusion from the market, the terms of the resistance and use of the intervention also challenge feminist critiques of neoliberal governmentality developed mostly with reference to advanced modernity and welfare regimes

Crossref

Birkbeck Institutional Research Online

King's Research Portal

A potential link between lateral semicircular canal orientation, head posture, and dietary habits in extant rhinos (Perissodactyla, Rhinocerotidae)

Author: Alexander
Antoine
Antoine
Antoine
Araújo
Bales
Becker
Becker
Beer
Benoit
Berlin
Brown
Burchell
Coutier
Desmarest
Dinerstein
Duijm
Ekdale
Fischer
Geist
Geraads
Geraads
Groves
Groves
Groves
Groves
Groves
Guérin
Harley
Heissig
Heissig
Heissig
Heissig
Hernesniemi
Hieronymus
Highstein
Hillman-Smith
Hoffmann
Hooijer
Hullar
Hyrtl
Kaiser
Kurtén
Laurie
Linnaeus
Marugán-Lobón
Matthew
Mendoza
Orlando
Osborn
Owen-Smith
Owen-Smith
Palmqvist
Rabbitt
Rookmaaker
Schellhorn
Schellhorn
Schellhorn
Schenkel
Sereno
Sody
Taylor
Tougard
von den Driesch
Willerslev
Witmer
Witmer
Zeuner
Zeuner
Publication venue: 'Wiley'
Publication date
Field of study

Crossref