    How much hybridisation does machine translation need?

    This is the peer reviewed version of the following article: [Costa-jussà, M. R. (2015), How much hybridization does machine translation Need?. J Assn Inf Sci Tec, 66: 2160–2165. doi:10.1002/asi.23517], which has been published in final form at [10.1002/asi.23517]. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.Rule-based and corpus-based machine translation (MT)have coexisted for more than 20 years. Recently, bound-aries between the two paradigms have narrowed andhybrid approaches are gaining interest from bothacademia and businesses. However, since hybridapproaches involve the multidisciplinary interaction oflinguists, computer scientists, engineers, and informa-tion specialists, understandably a number of issuesexist.While statistical methods currently dominate researchwork in MT, most commercial MT systems are techni-cally hybrid systems. The research community shouldinvestigate the bene¿ts and questions surrounding thehybridization of MT systems more actively. This paperdiscusses various issues related to hybrid MT includingits origins, architectures, achievements, and frustra-tions experienced in the community. It can be said thatboth rule-based and corpus- based MT systems havebene¿ted from hybridization when effectively integrated.In fact, many of the current rule/corpus-based MTapproaches are already hybridized since they do includestatistics/rules at some point.Peer ReviewedPostprint (author's final draft

    Feature decay algorithms for fast deployment of accurate statistical machine translation systems

    We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and LM models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to 86% reduction in the number of OOV tokens and up to 74% reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using significantly less resources for training and development

    Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

    International audienceThe language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task


    Automatsko strojno prevođenje postalo je nezamjenjiv dio velikog broja organizacija koje posluju u međunarodnom okruženju i koje imaju potrebu generirati velike količine prijevoda za svoju dokumentaciju. Strojno prevođenje danas se smatra jednom od neizostavnih disruptivnih tehnologija koja uvelike doprinose cjelovitoj transformaciji poslovnih procesa u segmentu prevođenja tekstova napisanih na prirodnom jeziku. Ideja iza strojnog prevođenje je omogućiti automatizaciju barem dijela procesa prevođenja, posebno kada je riječ o velikoj količini podataka, ne bi li se ubrzalo cjelokupno poslovanje jedne organizacije i time se ostvarila konkurentska prednost na tržištu koje se brzo mijenja i kojemu se brzo treba prilagoditi. No, razvoj tehnologije automatskog strojnog prevođenja nije tekao tako glatko. Naime, razvoj je popraćen nizom uspona i padova, a upravo je cilj ovog znanstvenog rada dati kritičan i sistematiziran pregled svih ključnih faza razvoja navedene tehnologije, i to u kontekstu svjetskih, ali i domaćih istraživanja u tom području.Automatic machine translation has become a truly irreplaceable part of a large number of organizations that operate in an international environment and in need of generating large amounts of translations for their documentation. Today, machine translation is considered one of the indispensable disruptive technologies that greatly contribute to the complete transformation of business processes in the segment of translating texts written in natural language. The idea behind machine translation is to enable the automation of at least part of the translation process, especially when it comes to a large amount of data, in order to speed up the overall business of an organization and thus gain a competitive advantage in a rapidly changing market, to which one needs to adapt quickly. But the development of automatic machine translation technology did not go so smoothly. Namely, the development is accompanied by a series of ups and downs, and the aim of this very research paper is to give a critical and systematic overview of all key stages of development of this technology, in the context of global and domestic research in this area

    Statistical machine translation system and computational domain adaptation

    Statističko strojno prevođenje temeljeno na frazama jedan je od mogućih pristupa automatskom strojnom prevođenju. U radu su predložene metode za poboljšanje kvalitete strojnog prijevoda prilagodbom određenih parametara u modelu sustava za statističko strojno prevođenje. Ideja rada bila jest izgraditi sustave za statističko strojno prevođenje temeljeno na frazama za hrvatski i engleski jezik. Sustavi su trenirani za dva jezična smjera, na dvije domene, na paralelnim korpusima različitih veličina i obilježja za hrvatsko-engleski i englesko-hrvatski jezični par, nakon čega proveden postupak ugađanja sustava. Istraženi su hibridni sustavi koji objedinjuju značajke obiju domena. Time je ispitan izravan utjecaj adaptacije domene na kvalitetu automatskog strojnog prijevoda hrvatskog jezika, a nova saznanja mogu koristiti pri izgradnji novih sustava. Provedena je automatska i ljudska evaluacija (vrednovanje) strojnih prijevoda, a dobiveni rezultati uspoređeni su s rezultatima strojnih prijevoda dobivenih primjenom postojećih web servisa za statističko strojno prevođenje.Phrase-based statistical machine translation is one of possible automatic machine translation approaches. This work proposes methods for increasing the quality of machine translation by adapting certain parameters in the statistical machine translation model. The idea was to build phrase-based statistical machine translation systems for Croatian and English language. The systems were be trained for two directions, on two domains, on parallel corpora of different sizes and characteristics for Croatian-English and English-Croatian language pair, after which the tuning procedure was conducted. Afterwards, hybrid systems which combine features of both domains were investigated. Thereby the direct impact of domain adaptation on the quality of automatic machine translation of Croatian language was explored, whereas new findings can be utilised for building new systems. Automatic and human evaluation of machine translations were carried out, while obtained results were compared with results obtained from applying existing statistical machine translation web services

    Towards Effective Use of Training Data in Statistical Machine Translation

    We report on findings of exploiting large data sets for translation modeling, language modeling and tuning for the development of competitive machine translation systems for eight language pairs.