11 research outputs found

    Veröffentlichungen und VortrĂ€ge 2009 der Mitglieder der FakultĂ€t fĂŒr Informatik

    Get PDF

    Multilingual Neural Translation

    Get PDF
    Machine translation (MT) refers to the technology that can automatically translate contents in one language into other languages. Being an important research area in the field of natural language processing, machine translation has typically been considered one of most challenging yet exciting problems. Thanks to research progress in the data-driven statistical machine translation (SMT), MT is recently capable of providing adequate translation services in many language directions and it has been widely deployed in various practical applications and scenarios. Nevertheless, there exist several drawbacks in the SMT framework. The major drawbacks of SMT lie in its dependency in separate components, its simple modeling approach, and the ignorance of global context in the translation process. Those inherent drawbacks prevent the over-tuned SMT models to gain any noticeable improvements over its horizon. Furthermore, SMT is unable to formulate a multilingual approach in which more than two languages are involved. The typical workaround is to develop multiple pair-wise SMT systems and connect them in a complex bundle to perform multilingual translation. Those limitations have called out for innovative approaches to address them effectively. On the other hand, it is noticeable how research on artificial neural networks has progressed rapidly since the beginning of the last decade, thanks to the improvement in computation, i.e faster hardware. Among other machine learning approaches, neural networks are known to be able to capture complex dependencies and learn latent representations. Naturally, it is tempting to apply neural networks in machine translation. First attempts revolve around replacing SMT sub-components by the neural counterparts. Later attempts are more revolutionary by fundamentally changing the whole core of SMT with neural networks, which is now popularly known as neural machine translation (NMT). NMT is an end-to-end system which directly estimate the translation model between the source and target sentences. Furthermore, it is later discovered to capture the inherent hierarchical structure of natural language. This is the key property of NMT that enables a new training paradigm and a less complex approach for multilingual machine translation using neural models. This thesis plays an important role in the evolutional course of machine translation by contributing to the transition of using neural components in SMT to the completely end-to-end NMT and most importantly being the first of the pioneers in building a neural multilingual translation system. First, we proposed an advanced neural-based component: the neural network discriminative word lexicon, which provides a global coverage for the source sentence during the translation process. We aim to alleviate the problems of phrase-based SMT models that are caused by the way how phrase-pair likelihoods are estimated. Such models are unable to gather information from beyond the phrase boundaries. In contrast, our discriminative word lexicon facilitates both the local and global contexts of the source sentences and models the translation using deep neural architectures. Our model has improved the translation quality greatly when being applied in different translation tasks. Moreover, our proposed model has motivated the development of end-to-end NMT architectures later, where both of the source and target sentences are represented with deep neural networks. The second and also the most significant contribution of this thesis is the idea of extending an NMT system to a multilingual neural translation framework without modifying its architecture. Based on the ability of deep neural networks to modeling complex relationships and structures, we utilize NMT to learn and share the cross-lingual information to benefit all translation directions. In order to achieve that purpose, we present two steps: first in incorporating language information into training corpora so that the NMT learns a common semantic space across languages and then force the NMT to translate into the desired target languages. The compelling aspect of the approach compared to other multilingual methods, however, lies in the fact that our multilingual extension is conducted in the preprocessing phase, thus, no change needs to be done inside the NMT architecture. Our proposed method, a universal approach for multilingual MT, enables a seamless coupling with any NMT architecture, thus makes the multilingual expansion to the NMT systems effortlessly. Our experiments and the studies from others have successfully employed our approach with numerous different NMT architectures and show the universality of the approach. Our multilingual neural machine translation accommodates cross-lingual information in a learned common semantic space to improve altogether every translation direction. It is then effectively applied and evaluated in various scenarios. We develop a multilingual translation system that relies on both source and target data to boost up the quality of a single translation direction. Another system could be deployed as a multilingual translation system that only requires being trained once using a multilingual corpus but is able to translate between many languages simultaneously and the delivered quality is more favorable than many translation systems trained separately. Such a system able to learn from large corpora of well-resourced languages, such as English → German or English → French, has proved to enhance other translation direction of low-resourced language pairs like English → Lithuania or German → Romanian. Even more, we show that kind of approach can be applied to the extreme case of zero-resourced translation where no parallel data is available for training without the need of pivot techniques. The research topics of this thesis are not limited to broadening application scopes of our multilingual approach but we also focus on improving its efficiency in practice. Our multilingual models have been further improved to adequately address the multilingual systems whose number of languages is large. The proposed strategies demonstrate that they are effective at achieving better performance in multi-way translation scenarios with greatly reduced training time. Beyond academic evaluations, we could deploy the multilingual ideas in the lecture-themed spontaneous speech translation service (Lecture Translator) at KIT. Interestingly, a derivative product of our systems, the multilingual word embedding corpus available in a dozen of languages, can serve as a useful resource for cross-lingual applications such as cross-lingual document classification, information retrieval, textual entailment or question answering. Detailed analysis shows excellent performance with regard to semantic similarity metrics when using the embeddings on standard cross-lingual classification tasks

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    Jahresbericht 2009 der FakultĂ€t fĂŒr Informatik

    Get PDF

    Bibliographie annuelle: recherche suisse sur le plurilinguisme 2014

    Get PDF
    Notre Bibliographie annuelle de la recherche suisse sur le plurilinguisme contient une sĂ©lection de publications consacrĂ©es au plurilinguisme. Ce numĂ©ro de la bibliographie comprend des publications parues en 2015. Il contient des informations sur des articles de revues, des chapitres de livres, des monographies, des volumes collectifs et des documents en ligne publiĂ©s par des chercheurs et chercheuses d’institutions suisses ainsi que des travaux de chercheurs internationaux parus dans certaines revues spĂ©cialisĂ©es. La bibliographie recense des publications dans les langues nationales suisses ainsi qu’en anglais.The Annual Bibliography of Swiss Research on Multilingualism contains a selection of scholarly publications in the disciplines of linguistics, sociology, pedagogy and other fields related to multilingualism. This issue contains bibliographic information on publications from the year 2015. Includes articles from journals, book chapters, monographs, anthologies and online documents by researchers at Swiss institutions. We have also included publications which international researchers have contributed to Swiss journals. The bibliography catalogues publications in Switzerland’s official languages and in English.Unsere Jahresbibliographie Schweizer Mehrsprachigkeitsforschung enthĂ€lt eine Auswahl der linguistischen, soziologischen, erziehungswissenschaftlichen und anderweitig dem Themenkomplex Mehrsprachigkeit gewidmeten wissenschaftlichen Literatur. Diese Ausgabe enthĂ€lt bibliographische Angaben zu Veröffentlichungen aus dem Jahr 2015. In die Bibliographie werden ZeitschriftenaufsĂ€tze, Buchkapitel, Monographien, Sammelwerke und Online-Dokumente von Forscherinnen und Forschern an Schweizer Institutionen sowie Publikationen internationaler Forscher/innen in einigen Schweizer Fachzeitschriften aufgenommen. BerĂŒcksichtigt werden Veröffentlichungen in den Landessprachen der Schweiz sowie in englischer Sprache.La nostra Bibliografia annuale della ricerca svizzera sul plurilinguismo contiene una selezione delle pubblicazioni consacrate al plurilinguismo apparse in linguistica, sociologia, scienze dell’educazione o in altre discipline. Questo numero della bibliografia contiene le pubblicazioni apparse nel 2015. Esso include articoli di riviste, capitoli di libri, monografie, opere collettive e documenti digitali pubblicati da ricercatrici e ricercatori d’istituzioni svizzere, oltre a lavori di ricercatori internazionali apparsi in alcune riviste specializzate. La bibliografia censisce pubblicazioni nelle lingue nazionali svizzere e in inglese

    Untersuchung zur Wechselwirkung ausgewĂ€hlter Regeln der Kontrollierten Sprache mit verschiedenen AnsĂ€tzen der Maschinellen Übersetzung

    Get PDF
    Examining the general impact of the Controlled Languages rules in the context of Machine Translation has been an area of research for many years. The present study focuses on the following question: How do the Controlled Language (CL) rules impact the Machine Translation (MT) output individually? Analyzing a German corpus-based test suite of technical texts that have been translated into English by different MT systems, the study endeavors to answer this question at different levels: the general impact of CL rules (rule- and system-independent), their impact at rule level (system-independent), their impact at system level (rule-independent), and at rule and system level. The results of five MT systems (a rule-based system, a statistical system, two differently constructed hybrid systems, and a neural system) are analyzed and contrasted. For this, a mixed-methods triangulation approach that includes error annotation, human evaluation, and automatic evaluation was applied. The data were analyzed both qualitatively and quantitatively based on the following parameters: number and type of MT errors, style and content quality, and scores from two automatic evaluation metrics. In line with many studies, the results show a general positive impact of the applied CL rules on the MT output. However, at rule level, only four rules proved to have positive effects on all parameters; three rules had negative effects on the parameters; and two rules did not show any significant impact. At rule and system level, the rules affected the MT systems differently, as expected. Some rules that had a positive impact on earlier MT approaches did not show the same impact on the neural MT approach. Furthermore, the neural MT delivered distinctly better results than earlier MT approaches, namely the highest error-free, style and content quality rates both before and after the rules application, which indicates that the neural MT offers a promising solution that no longer requires CL rules for improving the MT output, what in turn allows for a more natural style

    The UniversitÀt Karlsruhe Translation System for the EACL-WMT 2009

    No full text
    In this paper we describe the statistical machine translation system of the UniversitÀt Karlsruhe developed for the translation task of the Fourth Workshop on Statistical Machine Translation. The state-ofthe-art phrase-based SMT system is augmented with alternative word reordering and alignment mechanisms as well as optional phrase table modifications. We participate in the constrained condition of German-English and English-German as well as in the constrained condition of French-English and English-French.

    Incorporating pronoun function into statistical machine translation

    Get PDF
    Pronouns are used frequently in language, and perform a range of functions. Some pronouns are used to express coreference, and others are not. Languages and genres differ in how and when they use pronouns and this poses a problem for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Novák, 2011; Guillou, 2012; Weiner, 2014; Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric) pronouns with NP antecedents, which when translated from English into a language with grammatical gender, must agree with the translation of the head of the antecedent. Despite growing attention to this problem, little progress has been made, and little attention has been given to other pronouns. The central claim of this thesis is that pronouns performing different functions in text should be handled differently by SMT systems and when evaluating pronoun translation. This motivates the introduction of a new framework to categorise pronouns according to their function: Anaphoric/cataphoric reference, event reference, extra-textual reference, pleonastic, addressee reference, speaker reference, generic reference, or other function. Labelling pronouns according to their function also helps to resolve instances of functional ambiguity arising from the same pronoun in the source language having multiple functions, each with different translation requirements in the target language. The categorisation framework is used in corpus annotation, corpus analysis, SMT system development and evaluation. I have directed the annotation and conducted analyses of a parallel corpus of English-German texts called ParCor (Guillou et al., 2014), in which pronouns are manually annotated according to their function. This provides a first step toward understanding the problems that SMT systems face when translating pronouns. In the thesis, I show how analysis of manual translation can prove useful in identifying and understanding systematic differences in pronoun use between two languages and can help inform the design of SMT systems. In particular, the analysis revealed that the German translations in ParCor contain more anaphoric and pleonastic pronouns than their English originals, reflecting differences in pronoun use. This raises a particular problem for the evaluation of pronoun translation. Automatic evaluation methods that rely on reference translations to assess pronoun translation, will not be able to provide an adequate evaluation when the reference translation departs from the original source-language text. I also show how analysis of the output of state-of-the-art SMT systems can reveal how well current systems perform in translating different types of pronouns and indicate where future efforts would be best directed. The analysis revealed that biases in the training data, for example arising from the use of “it” and “es” as both anaphoric and pleonastic pronouns in both English and German, is a problem that SMT systems must overcome. SMT systems also need to disambiguate the function of those pronouns with ambiguous surface forms so that each pronoun may be translated in an appropriate way. To demonstrate the value of this work, I have developed an automated post-editing system in which automated tools are used to construct ParCor-style annotations over the source-language pronouns. The annotations are then used to resolve functional ambiguity for the pronoun “it” with separate rules applied to the output of a baseline SMT system for anaphoric vs. non-anaphoric instances. The system was submitted to the DiscoMT 2015 shared task on pronoun translation for English-French. As with all other participating systems, the automatic post-editing system failed to beat a simple phrase-based baseline. A detailed analysis, including an oracle experiment in which manual annotation replaces the automated tools, was conducted to discover the causes of poor system performance. The analysis revealed that the design of the rules and their strict application to the SMT output are the biggest factors in the failure of the system. The lack of automatic evaluation metrics for pronoun translation is a limiting factor in SMT system development. To alleviate this problem, Christian Hardmeier and I have developed a testing regimen called PROTEST comprising (1) a hand-selected set of pronoun tokens categorised according to the different problems that SMT systems face and (2) an automated evaluation script. Pronoun translations can then be automatically compared against a reference translation, with mismatches referred for manual evaluation. The automatic evaluation was applied to the output of systems submitted to the DiscoMT 2015 shared task on pronoun translation. This again highlighted the weakness of the post-editing system, which performs poorly due to its focus on producing gendered pronoun translations, and its inability to distinguish between pleonastic and event reference pronouns
    corecore