Search CORE

42 research outputs found

Cognate-aware morphological segmentation for multilingual neural translation

Author: Grönroos Stig-Arne
Kurimo Mikko
Virpioja Sami
Publication venue
Publication date: 01/01/2018
Field of study

This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The system is based on the Transformer model. We focus on improving the consistency of morphological segmentation for words that are similar orthographically, semantically, and distributionally; such words include etymological cognates, loan words, and proper names. For this, we introduce Cognate Morfessor, a multilingual variant of the Morfessor method. We show that our approach improves the translation quality particularly for Estonian, which has less resources for training the translation model.Comment: To appear in WMT1

arXiv.org e-Print Archive

Crossref

Aaltodoc Publication Archive

Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline

Author: Grönroos Stig-Arne
Kurimo Mikko
Smit Peter
Virpioja Sami
Publication venue: Aalto-yliopisto
Publication date: 01/01/2013
Field of study

Morfessor is a family of probabilistic machine learning methods that find morphological segmentations for words of a natural language, based solely on raw text data. After the release of the public implementations of the Morfessor Baseline and Categories-MAP methods in 2005, they have become popular as automatic tools for processing morphologically complex languages for applications such as speech recognition and machine translation. This report describes a new implementation of the Morfessor Baseline method. The new version not only fixes the main restrictions of the previous software, but also includes recent methodological extensions such as semi-supervised learning, which can make use of small amounts of manually segmented words. Experimental results for the various features of the implementation are reported for English and Finnish segmentation tasks

Aaltodoc Publication Archive

Morfessor 2.0: Toolkit for statistical morphological segmentation

Author: Grönroos Stig-Arne
Kurimo Mikko
Smit Peter
Virpioja Sami
Publication venue: Aalto-yliopisto
Publication date: 01/01/2014
Field of study

Morfessor is a family of probabilistic machine learning methods forfinding the morphological segmentation from raw text data. Recentdevelopments include the development of semi-supervised methods forutilizing annotated data. Morfessor 2.0 is a rewrite of the original,widely-used Morfessor 1.0 software, with well documented command-linetools and library interface. It includes algorithmic improvements and new features such as semi-supervised learning, online training, and integrated evaluation code.Peer reviewe

Crossref

Aaltodoc Publication Archive

The MeMAD Submission to the IWSLT 2018 Speech Translation Task

Author: Kurimo Mikko
Rouhe Aku
Stig-Arne Grönroos
Sulubacak Umut
Tiedemann Jörg
Publication venue
Publication date: 30/10/2018
Field of study

This paper describes the MeMAD project entry to the IWSLT Speech Translation Shared Task, addressing the translation of English audio into German text. Between the pipeline and end-to-end model tracks, we participated only in the former, with three contrastive systems. We tried also the latter, but were not able to finish our end-to-end model in time. All of our systems start by transcribing the audio into text through an automatic speech recognition model trained on the TED-LIUM English Speech Recognition Corpus. Afterwards, we feed the transcripts into English-German text-based neural machine translation (NMT) models. Our systems employ three different translation models trained on separate training sets compiled from the English-German part of the TED Speech Translation Corpus and the OpenSubtitles2018 section of the OPUS collection. In this paper, we also describe the experiments leading up to our final systems. Our experiments indicate that using OpenSubtitles2018 in training significantly improves translation performance. We also experimented with various pre- and postprocessing routines for the NMT module, but we did not have much success with these. Our best-scoring system attains a BLEU score of 16.45 on the test set for this year’s task

Helsingin yliopiston digitaalinen arkisto

Morfessor-enriched features and multilingual training for canonical morphological segmentation

Author: Creutz Mathias
Grönroos Stig-Arne
Kurimo Mikko
Rouhe Aku
Virpioja Sami
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2022
Field of study

In our submission to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation, we study whether an unsupervised morphological segmentation method, Morfessor, can help in a supervised setting. Previous research has shown the effectiveness of the approach in semisupervised settings with small amounts of labeled data. The current tasks vary in data size: the amount of word-level annotated training data is much larger, but the amount of sentencelevel annotated training data remains small. Our approach is to pre-segment the input data for a neural sequence-to-sequence model with the unsupervised method. As the unsupervised method can be trained with raw text data, we use Wikipedia to increase the amount of training data. In addition, we train multilingual models for the sentence-level task. The results for the Morfessor-enriched features are mixed, showing benefit for all three sentencelevel tasks but only some of the word-level tasks. The multilingual training yields considerable improvements over the monolingual sentence-level models, but it negates the effect of the enriched features.Peer reviewe

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Silo NLP's Participation at WAT2022

Author: Granroth-Wilding Mark
Grönroos Stig-Arne
Koistinen Mika
Panda Subhadarshi
Parida Shantipriya
Publication venue: COLING
Publication date: 17/10/2022
Field of study

This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Silo NLP's Participation at WAT2022

Author: Granroth-Wilding Mark
Grönroos Stig-Arne
Koistinen Mika
Panda Subhadarshi
Parida Shantipriya
Publication venue: COLING
Publication date: 02/08/2022
Field of study

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Latest Development in the FoTran Project – Scaling Up Language Coverage in Neural Machine Translation Using Distributed Training with Language-Specific Components

Author: Boggia Michele
Grönroos Stig-Arne
Loppi Niki A.
Raganato Alessandro
Tiedemann Jörg
Vazquez Raul
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

We give an update on the Found in Translation (FoTran) project, focusing on the study of emerging language-agnostic representations from neural machine translation (NMT). We describe our attention-bridge model, a modular NMT model which connects language-specific components through a shared network layer. Our latest implementation supports distributed training over many nodes and GPUs in order to substantially scale up the number of languages that can be included in a modern neural translation architecture.Peer reviewe

Helsingin yliopiston digitaalinen arkisto