Search CORE

12 research outputs found

Online Incremental Machine Translation

Author: Rottmann Kay
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

In this thesis we investigate the automatic improvements of statistical machine translation systems at runtime based on user feedback. We also propose a framework to use the proposed algorithms in large scale translation settings

KITopen

Word Reordering in Statistical Machine Translation with a POS-Based Distortion Model

Author: Rottmann Kay
Vogel Stephan
Publication venue: University of Skövde
Publication date: 03/01/2024
Field of study

In this paper we describe a word reordering strategy for statistical machine translation that reorders the source side based on Part of Speech (POS) information. Reordering rules are learned from the word aligned corpus. Reordering is integrated into the decoding process by constructing a lattice, which contains all word reorderings according to the reordering rules. Probabilities are assigned to the different reorderings. On this lattice monotone decoding is performed. This reordering strategy is compared with our previous reordering strategy, which looks at all permutations within a sliding window. We extend reordering rules by adding context information. Phrase translation pairs are learned from the original corpus and from a reordered source corpus to better capture the reordered word sequences at decoding time. Results are presented for English → Spanish and German ↔ English translations, using the European Parliament Plenary Sessions corpus

KITopen

Tools for Collecting Speech Corpora via Mechanical-Turk

Author: Eck Matthias
Lane Ian
Rottmann Kay
Waibel Alex
Publication venue: Association for Computational Linguistics
Publication date: 03/01/2024
Field of study

To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data

KITopen

The Massively Multilingual Natural Language Understanding 2022 (MMNLU-22) Workshop and Competition

Author: FitzGerald Jack
Hench Christopher
Peris Charith
Rottmann Kay
Publication venue
Publication date: 12/12/2022
Field of study

Despite recent progress in Natural Language Understanding (NLU), the creation of multilingual NLU systems remains a challenge. It is common to have NLU systems limited to a subset of languages due to lack of available data. They also often vary widely in performance. We launch a three-phase approach to address the limitations in NLU and help propel NLU technology to new heights. We release a 52 language dataset called the Multilingual Amazon SLU resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation, or MASSIVE, in an effort to address parallel data availability for voice assistants. We organize the Massively Multilingual NLU 2022 Challenge to provide a competitive environment and push the state-of-the art in the transferability of models into other languages. Finally, we host the first Massively Multilingual NLU workshop which brings these components together. The MMNLU workshop seeks to advance the science behind multilingual NLU by providing a platform for the presentation of new research in the field and connecting teams working on this research direction. This paper summarizes the dataset, workshop and the competition and the findings of each phase.Comment: 5 page

arXiv.org e-Print Archive

Recent improvements in the CMU large-scale Chinese-English SMT system

Author: Bach Nguyen
Gao Qin
Hewavitharana Sanjika
Hildebrand Almut Silja
Notari Timothy
Rottmann Kay
Vogel Stephan
Publication venue: Association for Computational Linguistics
Publication date: 03/01/2024
Field of study

In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach

KITopen