8 research outputs found
LUX-ASR: Building an ASR system for the Luxembourgish language
We present a first system for automatic speech recognition
(ASR) for the low-resource language Luxembourgish. By
applying transfer-learning, we were able to fine-tune Metaâs
wav2vec2-xls-r-300m checkpoint with 35 hours of labeled
Luxembourgish speech data. The best word error rate received lies at 14.47
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitaÄnĂ prĂĄce se zabĂœvĂĄ problematikou adaptace systĂ©mĆŻ
rozpoznĂĄvĂĄnĂ ĆeÄi na vybranĂ© reĂĄlnĂ© podmĂnky nasazenĂ. Je koncipovĂĄna
jako sbornĂk celkem dvanĂĄcti ÄlĂĄnkĆŻ, kterĂ© se touto problematikou
zabĂœvajĂ. Jde o publikace, jejichĆŸ jsem hlavnĂm autorem
nebo spoluatorem, a kterĂ© vznikly v rĂĄmci nÄkolika navazujĂcĂch
vĂœzkumnĂœch projektĆŻ. Na ĆeĆĄenĂ tÄchto projektĆŻ jsem se
podĂlel jak v roli Älena vĂœzkumnĂ©ho tĂœmu, tak i v roli ĆeĆĄitele nebo
spoluĆeĆĄitele.
Publikace zaĆazenĂ© do tohoto sbornĂku lze rozdÄlit podle tĂ©matu
do tĆĂ hlavnĂch skupin. Jejich spoleÄnĂœm jmenovatelem je
snaha pĆizpĆŻsobit danĂœ rozpoznĂĄvacĂ systĂ©m novĂœm podmĂnkĂĄm Äi
konkrĂ©tnĂmu faktoru, kterĂœ vĂœznamnĂœm zpĆŻsobem ovlivĆuje jeho
funkci Äi pĆesnost.
PrvnĂ skupina ÄlĂĄnkĆŻ se zabĂœvĂĄ Ășlohou neĆĂzenĂ© adaptace na
mluvÄĂho, kdy systĂ©m pĆizpĆŻsobuje svoje parametry specifickĂœm
hlasovĂœm charakteristikĂĄm danĂ© mluvĂcĂ osoby. DruhĂĄ ÄĂĄst prĂĄce
se pak vÄnuje problematice identifikace neĆeÄovĂœch udĂĄlostĂ na vstupu
do systĂ©mu a souvisejĂcĂ Ășloze rozpoznĂĄvĂĄnĂ ĆeÄi s hlukem
(a zejmĂ©na hudbou) na pozadĂ. KoneÄnÄ tĆetĂ ÄĂĄst prĂĄce se zabĂœvĂĄ
pĆĂstupy, kterĂ© umoĆŸĆujĂ pĆepis audio signĂĄlu obsahujĂcĂho promluvy
ve vĂce neĆŸ v jednom jazyce. Jde o metody adaptace existujĂcĂho
rozpoznĂĄvacĂho systĂ©mu na novĂœ jazyk a metody identifikace
jazyka z audio signĂĄlu.
ObÄ zmĂnÄnĂ© identifikaÄnĂ Ășlohy jsou pĆitom vyĆĄetĆovĂĄny zejmĂ©na
v nĂĄroÄnĂ©m a mĂ©nÄ probĂĄdanĂ©m reĆŸimu zpracovĂĄnĂ po jednotlivĂœch
rĂĄmcĂch vstupnĂho signĂĄlu, kterĂœ je jako jedinĂœ vhodnĂœ pro on-line
nasazenĂ, napĆ. pro streamovanĂĄ data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings
Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following:
* We release âlanguage packsâ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations.
* We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora.
* We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains.
* We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams.
This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations.
We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available
Unmet goals of tracking: within-track heterogeneity of students' expectations for
Educational systems are often characterized by some form(s) of ability grouping, like tracking. Although substantial variation in the implementation of these practices exists, it is always the aim to improve teaching efficiency by creating homogeneous groups of students in terms of capabilities and performances as well as expected pathways. If studentsâ expected pathways (university, graduate school, or working) are in line with the goals of tracking, one might presume that these expectations are rather homogeneous within tracks and heterogeneous between tracks. In Flanders (the northern region of Belgium), the educational system consists of four tracks. Many students start out in the most prestigious, academic track. If they fail to gain the necessary credentials, they move to the less esteemed technical and vocational tracks. Therefore, the educational system has been called a 'cascade system'. We presume that this cascade system creates homogeneous expectations in the academic track, though heterogeneous expectations in the technical and vocational tracks. We use data from the International Study of City Youth (ISCY), gathered during the 2013-2014 school year from 2354 pupils of the tenth grade across 30 secondary schools in the city of Ghent, Flanders. Preliminary results suggest that the technical and vocational tracks show more heterogeneity in studentâs expectations than the academic track. If tracking does not fulfill the desired goals in some tracks, tracking practices should be questioned as tracking occurs along social and ethnic lines, causing social inequality
Epidemiology of Injury in English Women's Super league Football: A Cohort Study
INTRODUCTION: The epidemiology of injury in male professional football has been well documented (Ekstrand, HĂ€gglund, & WaldĂ©n, 2011) and used as a basis to understand injury trends for a number of years. The prevalence and incidence of injuries occurring in womens super league football is unknown. The aim of this study is to estimate the prevalence and incidence of injury in an English Super League Womenâs Football squad. METHODS: Following ethical approval from Leeds Beckett University, players (n = 25) signed to a Womenâs Super League Football club provided written informed consent to complete a self-administered injury survey. Measures of exposure, injury and performance over a 12-month period was gathered. Participants were classified as injured if they reported a football injury that required medical attention or withdrawal from participation for one day or more. Injuries were categorised as either traumatic or overuse and whether the injury was a new injury and/or re-injury of the same anatomical site RESULTS: 43 injuries, including re-injury were reported by the 25 participants providing a clinical incidence of 1.72 injuries per player. Total incidence of injury was 10.8/1000 h (95% CI: 7.5 to 14.03). Participants were at higher risk of injury during a match compared with training (32.4 (95% CI: 15.6 to 48.4) vs 8.0 (95% CI: 5.0 to 10.85)/1000 hours, p 28 days) of which there were three non-contact anterior cruciate ligament (ACL) injuries. The epidemiological incidence proportion was 0.80 (95% CI: 0.64 to 0.95) and the average probability that any player on this team will sustain at least one injury was 80.0% (95% CI: 64.3% to 95.6%) CONCLUSION: This is the first report capturing exposure and injury incidence by anatomical site from a cohort of English players and is comparable to that found in Europe (6.3/1000 h (95% CI 5.4 to 7.36) Larruskain et al 2017). The number of ACL injuries highlights a potential injury burden for a squad of this size. Multi-site prospective investigations into the incidence and prevalence of injury in womenâs football are require
Esa 12th Conference: Differences, Inequalities and Sociological Imagination: Abstract Book
Esa 12th Conference: Differences, Inequalities and Sociological Imagination: Abstract Boo
Study on media plurality and diversity online
Published online: 16 September 2022Corporate authors: Centre on Media Pluralism and Media Freedom (CMPF) , CiTiP (Centre for Information Technology and Intellectual Property) of KU Leuven , Directorate-General for Communications Networks, Content and Technology (European Commission) , Institute for Information Law of the University of Amsterdam (IViR/UvA) , Vrije Universiteit Brussels (Studies in Media Innovation and Technology VUB- SMIT)Personal authors: Parcu, Pier Luigi ; Brogi, Elda ; Verza, Sofia ; Da Costa Leite Borges, Danielle ; Carlini, Roberta ; Trevisan, Matteo ; Tambini, Damian ; Mazzoli, Eleonora Maria ; Klimkiewicz, Beata ; Broughton Micova, Sally ; PetkoviÄ, Brankica ; Rossi, Maria Alessandra ; Stasi, Maria Luisa ; Valcke, Peggy ; Lambrecht, Ingrid ; Irion, Kristina ; Fahy, Ronan ; Idiz, Daphne ; Meiring, Arlette ; Seipp, Theresa ; Poort, Joost ; Ranaivoson, Heritiana ; Afilipoaie, Adelaida ; Domazetovikj, NinoThe Study on Media Plurality and Diversity Online investigates the value of safeguarding media pluralism and diversity online, focusing on (i) the prominence and discoverability of general interest content and services, and on (ii) market plurality and the concentration of economic resources. With a focus on Europe, the project is funded by a tender from the European Commission to produce a study on Media Plurality and Diversity Online and involves four partner universities: CMPF (EUI); CiTiP (Centre for Information Technology and Intellectual Property) of KU Leuven; the Institute for Information Law of the University of Amsterdam (IViR/UvA); imec-SMIT-Vrije Universiteit Brussel. The purpose of the assignment was to describe, analyse and evaluate the existing regulatory and business practices in the two areas mentioned above, and finally to elaborate some policy recommendations. Data were collected from the database of the Media Pluralism Monitor (CMPF) and through desk research, online consultations and interviews with stakeholders. The contractor was able to call on a network of national experts across the Member States to support this work