8 research outputs found

    LUX-ASR: Building an ASR system for the Luxembourgish language

    Get PDF
    We present a first system for automatic speech recognition (ASR) for the low-resource language Luxembourgish. By applying transfer-learning, we were able to fine-tune Meta’s wav2vec2-xls-r-300m checkpoint with 35 hours of labeled Luxembourgish speech data. The best word error rate received lies at 14.47

    Adaptation of speech recognition systems to selected real-world deployment conditions

    Get PDF
    Tato habilitačnĂ­ prĂĄce se zabĂœvĂĄ problematikou adaptace systĂ©mĆŻ rozpoznĂĄvĂĄnĂ­ ƙeči na vybranĂ© reĂĄlnĂ© podmĂ­nky nasazenĂ­. Je koncipovĂĄna jako sbornĂ­k celkem dvanĂĄcti člĂĄnkĆŻ, kterĂ© se touto problematikou zabĂœvajĂ­. Jde o publikace, jejichĆŸ jsem hlavnĂ­m autorem nebo spoluatorem, a kterĂ© vznikly v rĂĄmci několika navazujĂ­cĂ­ch vĂœzkumnĂœch projektĆŻ. Na ƙeĆĄenĂ­ těchto projektĆŻ jsem se podĂ­lel jak v roli člena vĂœzkumnĂ©ho tĂœmu, tak i v roli ƙeĆĄitele nebo spoluƙeĆĄitele. Publikace zaƙazenĂ© do tohoto sbornĂ­ku lze rozdělit podle tĂ©matu do tƙí hlavnĂ­ch skupin. Jejich společnĂœm jmenovatelem je snaha pƙizpĆŻsobit danĂœ rozpoznĂĄvacĂ­ systĂ©m novĂœm podmĂ­nkĂĄm či konkrĂ©tnĂ­mu faktoru, kterĂœ vĂœznamnĂœm zpĆŻsobem ovlivƈuje jeho funkci či pƙesnost. PrvnĂ­ skupina člĂĄnkĆŻ se zabĂœvĂĄ Ășlohou neƙízenĂ© adaptace na mluvčího, kdy systĂ©m pƙizpĆŻsobuje svoje parametry specifickĂœm hlasovĂœm charakteristikĂĄm danĂ© mluvĂ­cĂ­ osoby. DruhĂĄ část prĂĄce se pak věnuje problematice identifikace neƙečovĂœch udĂĄlostĂ­ na vstupu do systĂ©mu a souvisejĂ­cĂ­ Ășloze rozpoznĂĄvĂĄnĂ­ ƙeči s hlukem (a zejmĂ©na hudbou) na pozadĂ­. Konečně tƙetĂ­ část prĂĄce se zabĂœvĂĄ pƙístupy, kterĂ© umoĆŸĆˆujĂ­ pƙepis audio signĂĄlu obsahujĂ­cĂ­ho promluvy ve vĂ­ce neĆŸ v jednom jazyce. Jde o metody adaptace existujĂ­cĂ­ho rozpoznĂĄvacĂ­ho systĂ©mu na novĂœ jazyk a metody identifikace jazyka z audio signĂĄlu. Obě zmĂ­něnĂ© identifikačnĂ­ Ășlohy jsou pƙitom vyĆĄetƙovĂĄny zejmĂ©na v nĂĄročnĂ©m a mĂ©ně probĂĄdanĂ©m reĆŸimu zpracovĂĄnĂ­ po jednotlivĂœch rĂĄmcĂ­ch vstupnĂ­ho signĂĄlu, kterĂœ je jako jedinĂœ vhodnĂœ pro on-line nasazenĂ­, napƙ. pro streamovanĂĄ data.This habilitation thesis deals with adaptation of automatic speech recognition (ASR) systems to selected real-world deployment conditions. It is presented in the form of a collection of twelve articles dealing with this task; I am the main author or a co-author of these articles. They were published during my work on several consecutive research projects. I have participated in the solution of them as a member of the research team as well as the investigator or a co-investigator. These articles can be divided into three main groups according to their topics. They have in common the effort to adapt a particular ASR system to a specific factor or deployment condition that affects its function or accuracy. The first group of articles is focused on an unsupervised speaker adaptation task, where the ASR system adapts its parameters to the specific voice characteristics of one particular speaker. The second part deals with a) methods allowing the system to identify non-speech events on the input, and b) the related task of recognition of speech with non-speech events, particularly music, in the background. Finally, the third part is devoted to the methods that allow the transcription of an audio signal containing multilingual utterances. It includes a) approaches for adapting the existing recognition system to a new language and b) methods for identification of the language from the audio signal. The two mentioned identification tasks are in particular investigated under the demanding and less explored frame-wise scenario, which is the only one suitable for processing of on-line data streams

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Unmet goals of tracking: within-track heterogeneity of students' expectations for

    Get PDF
    Educational systems are often characterized by some form(s) of ability grouping, like tracking. Although substantial variation in the implementation of these practices exists, it is always the aim to improve teaching efficiency by creating homogeneous groups of students in terms of capabilities and performances as well as expected pathways. If students’ expected pathways (university, graduate school, or working) are in line with the goals of tracking, one might presume that these expectations are rather homogeneous within tracks and heterogeneous between tracks. In Flanders (the northern region of Belgium), the educational system consists of four tracks. Many students start out in the most prestigious, academic track. If they fail to gain the necessary credentials, they move to the less esteemed technical and vocational tracks. Therefore, the educational system has been called a 'cascade system'. We presume that this cascade system creates homogeneous expectations in the academic track, though heterogeneous expectations in the technical and vocational tracks. We use data from the International Study of City Youth (ISCY), gathered during the 2013-2014 school year from 2354 pupils of the tenth grade across 30 secondary schools in the city of Ghent, Flanders. Preliminary results suggest that the technical and vocational tracks show more heterogeneity in student’s expectations than the academic track. If tracking does not fulfill the desired goals in some tracks, tracking practices should be questioned as tracking occurs along social and ethnic lines, causing social inequality

    Epidemiology of Injury in English Women's Super league Football: A Cohort Study

    Get PDF
    INTRODUCTION: The epidemiology of injury in male professional football has been well documented (Ekstrand, HĂ€gglund, & WaldĂ©n, 2011) and used as a basis to understand injury trends for a number of years. The prevalence and incidence of injuries occurring in womens super league football is unknown. The aim of this study is to estimate the prevalence and incidence of injury in an English Super League Women’s Football squad. METHODS: Following ethical approval from Leeds Beckett University, players (n = 25) signed to a Women’s Super League Football club provided written informed consent to complete a self-administered injury survey. Measures of exposure, injury and performance over a 12-month period was gathered. Participants were classified as injured if they reported a football injury that required medical attention or withdrawal from participation for one day or more. Injuries were categorised as either traumatic or overuse and whether the injury was a new injury and/or re-injury of the same anatomical site RESULTS: 43 injuries, including re-injury were reported by the 25 participants providing a clinical incidence of 1.72 injuries per player. Total incidence of injury was 10.8/1000 h (95% CI: 7.5 to 14.03). Participants were at higher risk of injury during a match compared with training (32.4 (95% CI: 15.6 to 48.4) vs 8.0 (95% CI: 5.0 to 10.85)/1000 hours, p 28 days) of which there were three non-contact anterior cruciate ligament (ACL) injuries. The epidemiological incidence proportion was 0.80 (95% CI: 0.64 to 0.95) and the average probability that any player on this team will sustain at least one injury was 80.0% (95% CI: 64.3% to 95.6%) CONCLUSION: This is the first report capturing exposure and injury incidence by anatomical site from a cohort of English players and is comparable to that found in Europe (6.3/1000 h (95% CI 5.4 to 7.36) Larruskain et al 2017). The number of ACL injuries highlights a potential injury burden for a squad of this size. Multi-site prospective investigations into the incidence and prevalence of injury in women’s football are require

    Esa 12th Conference: Differences, Inequalities and Sociological Imagination: Abstract Book

    Get PDF
    Esa 12th Conference: Differences, Inequalities and Sociological Imagination: Abstract Boo

    Study on media plurality and diversity online

    Get PDF
    Published online: 16 September 2022Corporate authors: Centre on Media Pluralism and Media Freedom (CMPF) , CiTiP (Centre for Information Technology and Intellectual Property) of KU Leuven , Directorate-General for Communications Networks, Content and Technology (European Commission) , Institute for Information Law of the University of Amsterdam (IViR/UvA) , Vrije Universiteit Brussels (Studies in Media Innovation and Technology VUB- SMIT)Personal authors: Parcu, Pier Luigi ; Brogi, Elda ; Verza, Sofia ; Da Costa Leite Borges, Danielle ; Carlini, Roberta ; Trevisan, Matteo ; Tambini, Damian ; Mazzoli, Eleonora Maria ; Klimkiewicz, Beata ; Broughton Micova, Sally ; Petković, Brankica ; Rossi, Maria Alessandra ; Stasi, Maria Luisa ; Valcke, Peggy ; Lambrecht, Ingrid ; Irion, Kristina ; Fahy, Ronan ; Idiz, Daphne ; Meiring, Arlette ; Seipp, Theresa ; Poort, Joost ; Ranaivoson, Heritiana ; Afilipoaie, Adelaida ; Domazetovikj, NinoThe Study on Media Plurality and Diversity Online investigates the value of safeguarding media pluralism and diversity online, focusing on (i) the prominence and discoverability of general interest content and services, and on (ii) market plurality and the concentration of economic resources. With a focus on Europe, the project is funded by a tender from the European Commission to produce a study on Media Plurality and Diversity Online and involves four partner universities: CMPF (EUI); CiTiP (Centre for Information Technology and Intellectual Property) of KU Leuven; the Institute for Information Law of the University of Amsterdam (IViR/UvA); imec-SMIT-Vrije Universiteit Brussel. The purpose of the assignment was to describe, analyse and evaluate the existing regulatory and business practices in the two areas mentioned above, and finally to elaborate some policy recommendations. Data were collected from the database of the Media Pluralism Monitor (CMPF) and through desk research, online consultations and interviews with stakeholders. The contractor was able to call on a network of national experts across the Member States to support this work
    corecore