Search CORE

9 research outputs found

Towards Effective Use of Training Data in Statistical Machine Translation

Author: Haddow Barry
Koehn Philipp
Publication venue
Publication date: 01/01/2012
Field of study

How much hybridisation does machine translation need?

Author: Ruiz Costa-Jussà Marta
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

This is the peer reviewed version of the following article: [Costa-jussà, M. R. (2015), How much hybridization does machine translation Need?. J Assn Inf Sci Tec, 66: 2160–2165. doi:10.1002/asi.23517], which has been published in final form at [10.1002/asi.23517]. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.Rule-based and corpus-based machine translation (MT)have coexisted for more than 20 years. Recently, bound-aries between the two paradigms have narrowed andhybrid approaches are gaining interest from bothacademia and businesses. However, since hybridapproaches involve the multidisciplinary interaction oflinguists, computer scientists, engineers, and informa-tion specialists, understandably a number of issuesexist.While statistical methods currently dominate researchwork in MT, most commercial MT systems are techni-cally hybrid systems. The research community shouldinvestigate the bene¿ts and questions surrounding thehybridization of MT systems more actively. This paperdiscusses various issues related to hybrid MT includingits origins, architectures, achievements, and frustra-tions experienced in the community. It can be said thatboth rule-based and corpus- based MT systems havebene¿ted from hybridization when effectively integrated.In fact, many of the current rule/corpus-based MTapproaches are already hybridized since they do includestatistics/rules at some point.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Feature decay algorithms for fast deployment of accurate statistical machine translation systems

Author: Bicici Ergun
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 08/08/2013
Field of study

We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and LM models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to 86% reduction in the number of OOV tokens and up to 74% reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using signiﬁcantly less resources for training and development

CiteSeerX

Irish Universities

DCU Online Research Access Service

Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

Author: Benyettou Abdelkader
Jouvet Denis
Langlois David
Mezzoudj Freha
Publication venue: HAL CCSD
Publication date: 18/10/2015
Field of study

International audienceThe language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task

INRIA a CCSD electronic archive server

REVIEW OF THE EVOLUTION OF THE TECHNOLOGY OF AUTOMATIC MACHINE TRANSLATION

Author: Dunđer Ivan
Publication venue
Publication date: 01/01/2021
Field of study

Automatsko strojno prevođenje postalo je nezamjenjiv dio velikog broja organizacija koje posluju u međunarodnom okruženju i koje imaju potrebu generirati velike količine prijevoda za svoju dokumentaciju. Strojno prevođenje danas se smatra jednom od neizostavnih disruptivnih tehnologija koja uvelike doprinose cjelovitoj transformaciji poslovnih procesa u segmentu prevođenja tekstova napisanih na prirodnom jeziku. Ideja iza strojnog prevođenje je omogućiti automatizaciju barem dijela procesa prevođenja, posebno kada je riječ o velikoj količini podataka, ne bi li se ubrzalo cjelokupno poslovanje jedne organizacije i time se ostvarila konkurentska prednost na tržištu koje se brzo mijenja i kojemu se brzo treba prilagoditi. No, razvoj tehnologije automatskog strojnog prevođenja nije tekao tako glatko. Naime, razvoj je popraćen nizom uspona i padova, a upravo je cilj ovog znanstvenog rada dati kritičan i sistematiziran pregled svih ključnih faza razvoja navedene tehnologije, i to u kontekstu svjetskih, ali i domaćih istraživanja u tom području.Automatic machine translation has become a truly irreplaceable part of a large number of organizations that operate in an international environment and in need of generating large amounts of translations for their documentation. Today, machine translation is considered one of the indispensable disruptive technologies that greatly contribute to the complete transformation of business processes in the segment of translating texts written in natural language. The idea behind machine translation is to enable the automation of at least part of the translation process, especially when it comes to a large amount of data, in order to speed up the overall business of an organization and thus gain a competitive advantage in a rapidly changing market, to which one needs to adapt quickly. But the development of automatic machine translation technology did not go so smoothly. Namely, the development is accompanied by a series of ups and downs, and the aim of this very research paper is to give a critical and systematic overview of all key stages of development of this technology, in the context of global and domestic research in this area

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Statistical machine translation system and computational domain adaptation

Author: Dunđer Ivan
Publication venue
Publication date: 01/01/2015
Field of study

Statističko strojno prevođenje temeljeno na frazama jedan je od mogućih pristupa automatskom strojnom prevođenju. U radu su predložene metode za poboljšanje kvalitete strojnog prijevoda prilagodbom određenih parametara u modelu sustava za statističko strojno prevođenje. Ideja rada bila jest izgraditi sustave za statističko strojno prevođenje temeljeno na frazama za hrvatski i engleski jezik. Sustavi su trenirani za dva jezična smjera, na dvije domene, na paralelnim korpusima različitih veličina i obilježja za hrvatsko-engleski i englesko-hrvatski jezični par, nakon čega proveden postupak ugađanja sustava. Istraženi su hibridni sustavi koji objedinjuju značajke obiju domena. Time je ispitan izravan utjecaj adaptacije domene na kvalitetu automatskog strojnog prijevoda hrvatskog jezika, a nova saznanja mogu koristiti pri izgradnji novih sustava. Provedena je automatska i ljudska evaluacija (vrednovanje) strojnih prijevoda, a dobiveni rezultati uspoređeni su s rezultatima strojnih prijevoda dobivenih primjenom postojećih web servisa za statističko strojno prevođenje.Phrase-based statistical machine translation is one of possible automatic machine translation approaches. This work proposes methods for increasing the quality of machine translation by adapting certain parameters in the statistical machine translation model. The idea was to build phrase-based statistical machine translation systems for Croatian and English language. The systems were be trained for two directions, on two domains, on parallel corpora of different sizes and characteristics for Croatian-English and English-Croatian language pair, after which the tuning procedure was conducted. Afterwards, hybrid systems which combine features of both domains were investigated. Thereby the direct impact of domain adaptation on the quality of automatic machine translation of Croatian language was explored, whereas new findings can be utilised for building new systems. Automatic and human evaluation of machine translations were carried out, while obtained results were compared with results obtained from applying existing statistical machine translation web services

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Digitalni arhiv Filozofskog fakulteta u Zagrebu

Towards Effective Use of Training Data in Statistical Machine Translation

Author: Barry Haddow
Publication venue
Publication date
Field of study

We report on findings of exploiting large data sets for translation modeling, language modeling and tuning for the development of competitive machine translation systems for eight language pairs.

CiteSeerX