10 research outputs found

    The ITC-irst statistical machine translation system for IWSLT-2004

    Get PDF
    Focus of this paper is the system for statistical machine translation developed at ITC-irst. It has been employed in the evaluation campaign of the International Workshop on Spoken Language Translation 2004 in all the three data set conditions of the Chinese-English track. Both the statistical model underlying the system and the system architecture are presented. Moreover, details are given on how the submitted runs have been produced. 1

    Empirical machine translation and its evaluation

    Get PDF
    Aquesta tesi estudia l'aplicació de les tecnologies del Processament del Llenguatge Natural disponibles actualment al problema de la Traducció Automàtica basada en Mètodes Empírics i la seva Avaluació.D'una banda, tractem el problema de l'avaluació automàtica. Hem analitzat les principals deficiències dels mètodes d'avaluació actuals, les quals es deuen, al nostre parer, als principis de qualitat superficials en els que es basen. En comptes de limitar-nos al nivell lèxic, proposem una nova direcció cap a avaluacions més heterogènies. El nostre enfocament es basa en el disseny d'un ric conjunt de mesures automàtiques destinades a capturar un ampli ventall d'aspectes de qualitat a diferents nivells lingüístics (lèxic, sintàctic i semàntic). Aquestes mesures lingüístiques han estat avaluades sobre diferents escenaris. El resultat més notable ha estat la constatació de que les mètriques basades en un coneixement lingüístic més profund (sintàctic i semàntic) produeixen avaluacions a nivell de sistema més fiables que les mètriques que es limiten a la dimensió lèxica, especialment quan els sistemes avaluats pertanyen a paradigmes de traducció diferents. Tanmateix, a nivell de frase, el comportament d'algunes d'aquestes mètriques lingüístiques empitjora lleugerament en comparació al comportament de les mètriques lèxiques. Aquest fet és principalment atribuïble als errors comesos pels processadors lingüístics. A fi i efecte de millorar l'avaluació a nivell de frase, a més de recòrrer a la similitud lèxica en absència d'anàlisi lingüística, hem estudiat la possibiliat de combinar les puntuacions atorgades per mètriques a diferents nivells lingüístics en una sola mesura de qualitat. S'han presentat dues estratègies no paramètriques de combinació de mètriques, essent el seu principal avantatge no haver d'ajustar la contribució relativa de cadascuna de les mètriques a la puntuació global. A més, el nostre treball mostra com fer servir el conjunt de mètriques heterogènies per tal d'obtenir detallats informes d'anàlisi d'errors automàticament.D'altra banda, hem estudiat el problema de la selecció lèxica en Traducció Automàtica Estadística. Amb aquesta finalitat, hem construit un sistema de Traducció Automàtica Estadística Castellà-Anglès basat en -phrases', i hem iterat en el seu cicle de desenvolupament, analitzant diferents maneres de millorar la seva qualitat mitjançant la incorporació de coneixement lingüístic. En primer lloc, hem extès el sistema a partir de la combinació de models de traducció basats en anàlisi sintàctica superficial, obtenint una millora significativa. En segon lloc, hem aplicat models de traducció discriminatius basats en tècniques d'Aprenentatge Automàtic. Aquests models permeten una millor representació del contexte de traducció en el que les -phrases' ocorren, efectivament conduint a una millor selecció lèxica. No obstant, a partir d'avaluacions automàtiques heterogènies i avaluacions manuals, hem observat que les millores en selecció lèxica no comporten necessàriament una millor estructura sintàctica o semàntica. Així doncs, la incorporació d'aquest tipus de prediccions en el marc estadístic requereix, per tant, un estudi més profund.Com a qüestió complementària, hem estudiat una de les principals crítiques en contra dels sistemes de traducció basats en mètodes empírics, la seva forta dependència del domini, i com els seus efectes negatius poden ésser mitigats combinant adequadament fonts de coneixement externes. En aquest sentit, hem adaptat amb èxit un sistema de traducció estadística Anglès-Castellà entrenat en el domini polític, al domini de definicions de diccionari.Les dues parts d'aquesta tesi estan íntimament relacionades, donat que el desenvolupament d'un sistema real de Traducció Automàtica ens ha permès viure en primer terme l'important paper dels mètodes d'avaluació en el cicle de desenvolupament dels sistemes de Traducció Automàtica.In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an English-to-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

    A System for Simultaneous Translation of Lectures and Speeches

    Get PDF
    This thesis realizes the first existing automatic system for simultaneous speech-to-speech translation. The focus of this system is the automatic translation of (technical oriented) lectures and speeches from English to Spanish, but the different aspects described in this thesis will also be helpful for developing simultaneous translation systems for other domains or languages

    Robust automatic transcription of lectures

    Get PDF
    Automatic transcription of lectures is becoming an important task. Possible applications can be found in the fields of automatic translation or summarization, information retrieval, digital libraries, education and communication research. Ideally those systems would operate on distant recordings, freeing the presenter from wearing body-mounted microphones. This task, however, is surpassingly difficult, given that the speech signal is severely degraded by background noise and reverberation

    Robust Automatic Transcription of Lectures

    Get PDF
    Die automatische Transkription von Vorträgen, Vorlesungen und Präsentationen wird immer wichtiger und ermöglicht erst die Anwendungen der automatischen Übersetzung von Sprache, der automatischen Zusammenfassung von Sprache, der gezielten Informationssuche in Audiodaten und somit die leichtere Zugänglichkeit in digitalen Bibliotheken. Im Idealfall arbeitet ein solches System mit einem Mikrofon das den Vortragenden vom Tragen eines Mikrofons befreit was der Fokus dieser Arbeit ist

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2012von Farag Ahme
    corecore