12 research outputs found
Online Incremental Machine Translation
In this thesis we investigate the automatic improvements of statistical machine translation systems at runtime based on user feedback. We also propose a framework to use the proposed algorithms in large scale translation settings
Word Reordering in Statistical Machine Translation with a POS-Based Distortion Model
In this paper we describe a word reordering strategy for statistical machine translation that reorders the source side based on Part of Speech (POS) information. Reordering rules are learned from the word aligned corpus. Reordering is integrated into the decoding process by constructing a lattice, which contains all word reorderings according to the reordering rules. Probabilities are assigned to the different reorderings. On this lattice monotone decoding is performed. This reordering strategy is compared with our previous reordering strategy, which looks at all permutations within a sliding window. We extend reordering rules by adding context information. Phrase translation pairs are learned from the original corpus and from a reordered source corpus to better capture the reordered word sequences at decoding time. Results are presented for English → Spanish and German ↔ English translations, using the European Parliament Plenary Sessions corpus
Tools for Collecting Speech Corpora via Mechanical-Turk
To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data
The Massively Multilingual Natural Language Understanding 2022 (MMNLU-22) Workshop and Competition
Despite recent progress in Natural Language Understanding (NLU), the creation
of multilingual NLU systems remains a challenge. It is common to have NLU
systems limited to a subset of languages due to lack of available data. They
also often vary widely in performance. We launch a three-phase approach to
address the limitations in NLU and help propel NLU technology to new heights.
We release a 52 language dataset called the Multilingual Amazon SLU resource
package (SLURP) for Slot-filling, Intent classification, and Virtual assistant
Evaluation, or MASSIVE, in an effort to address parallel data availability for
voice assistants. We organize the Massively Multilingual NLU 2022 Challenge to
provide a competitive environment and push the state-of-the art in the
transferability of models into other languages. Finally, we host the first
Massively Multilingual NLU workshop which brings these components together. The
MMNLU workshop seeks to advance the science behind multilingual NLU by
providing a platform for the presentation of new research in the field and
connecting teams working on this research direction. This paper summarizes the
dataset, workshop and the competition and the findings of each phase.Comment: 5 page
Recent improvements in the CMU large-scale Chinese-English SMT system
In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach
Recent Improvements in the CMU Large Scale Chinese-English SMT System
In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach