Search CORE

81 research outputs found

Revisiting NMT for normalization of early English letters

Author: Hämäläinen Mika
Mäkelä Eetu
Rueter Jack
Säily Tanja
Tiedemann Jörg
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

Author: Alnajjar Khalid
Hämäläinen Mika
Partanen Niko
Publication venue: Association pour le Traitement Automatique des Langues
Publication date: 01/01/2021
Field of study

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.Comment: la 28e Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

From plenipotentiary to puddingless : Users and uses of new words in early English letters

Author: Hämäläinen Mika
Mäkelä Eetu
Säily Tanja
Publication venue: University of Helsinki
Publication date: 01/01/2021
Field of study

Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Author: Alnajjar Khalid
Hämäläinen Mika
Partanen Niko
Poibeau Thierry
Rueter Jack
Publication venue: Association for Computational Creativity
Publication date: 01/01/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

On New Text Corpora For Minority Languages On The Helsinki korp.csc.fi Server

Author: Partanen Niko
Rueter Jack
Publication venue
Publication date: 20/12/2019
Field of study

The korp.csc.fi server in Finland provides text corpora of multiple varieties for numerous languages large and small. The Korp infrastructure is developed by the Swedish Språkbanken in the University and Gothenburg, and the source code is released under MIT license. Open nature of the systems makes it easily transferred into new environments, and there are already numerous Korp installations available. The one we discuss is maintained by the Language Bank of Finland.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wrangling with non-standard data

Author: Hämäläinen Mika
Kaislaniemi Samuli
Lagus Krista
Lahti Leo
Mäkelä Eetu
Nevalainen Terttu
Säily Tanja
Tolonen Mikko
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Model-Based Evaluation of Multilinguality

Author: Vamvas Jannis
Publication venue
Publication date: 01/01/2023
Field of study

ZORA

Recommended from our members

Machine Translation of Arabic Dialects

Author: Salloum Wael Sameer
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

This thesis discusses different approaches to machine translation (MT) from Dialectal Arabic (DA) to English. These approaches handle the varying stages of Arabic dialects in terms of types of available resources and amounts of training data. The overall theme of this work revolves around building dialectal resources and MT systems or enriching existing ones using the currently available resources (dialectal or standard) in order to quickly and cheaply scale to more dialects without the need to spend years and millions of dollars to create such resources for every dialect. Unlike Modern Standard Arabic (MSA), DA-English parallel corpora is scarcely available for few dialects only. Dialects differ from each other and from MSA in orthography, morphology, phonology, and to some lesser degree syntax. This means that combining all available parallel data, from dialects and MSA, to train DA-to-English statistical machine translation (SMT) systems might not provide the desired results. Similarly, translating dialectal sentences with an SMT system trained on that dialect only is also challenging due to different factors that affect the sentence word choices against that of the SMT training data. Such factors include the level of dialectness (e.g., code switching to MSA versus dialectal training data), topic (sports versus politics), genre (tweets versus newspaper), script (Arabizi versus Arabic), and timespan of test against training. The work we present utilizes any available Arabic resource such as a preprocessing tool or a parallel corpus, whether MSA or DA, to improve DA-to-English translation and expand to more dialects and sub-dialects. The majority of Arabic dialects have no parallel data to English or to any other foreign language. They also have no preprocessing tools such as normalizers, morphological analyzers, or tokenizers. For such dialects, we present an MSA-pivoting approach where DA sentences are translated to MSA first, then the MSA output is translated to English using the wealth of MSA-English parallel data. Since there is virtually no DA-MSA parallel data to train an SMT system, we build a rule-based DA-to-MSA MT system, ELISSA, that uses morpho-syntactic translation rules along with dialect identification and language modeling components. We also present a rule-based approach to quickly and cheaply build a dialectal morphological analyzer, ADAM, which provides ELISSA with dialectal word analyses. Other Arabic dialects have a relatively small-sized DA-English parallel data amounting to a few million words on the DA side. Some of these dialects have dialect-dependent preprocessing tools that can be used to prepare the DA data for SMT systems. We present techniques to generate synthetic parallel data from the available DA-English and MSA- English data. We use this synthetic data to build statistical and hybrid versions of ELISSA as well as improve our rule-based ELISSA-based MSA-pivoting approach. We evaluate our best MSA-pivoting MT pipeline against three direct SMT baselines trained on these three parallel corpora: DA-English data only, MSA-English data only, and the combination of DA-English and MSA-English data. Furthermore, we leverage the use of these four MT systems (the three baselines along with our MSA-pivoting system) in two system combination approaches that benefit from their strengths while avoiding their weaknesses. Finally, we propose an approach to model dialects from monolingual data and limited DA-English parallel data without the need for any language-dependent preprocessing tools. We learn DA preprocessing rules using word embedding and expectation maximization. We test this approach by building a morphological segmentation system and we evaluate its performance on MT against the state-of-the-art dialectal tokenization tool

Columbia University Academic Commons