Search CORE

100 research outputs found

A Machine learning approach to POS tagging

Author: Màrquez Villodre Lluís
Padró Lluís
Rodríguez Hontoria Horacio
Publication venue
Publication date: 01/01/1997
Field of study

We have applied inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities. This model consists of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired language models are complete enough to be directly used as sets of POS disambiguation rules, and include more complex contextual information than simple collections of n-grams usually used in statistical taggers. We have implemented a quite simple and fast tagger that has been tested and evaluated on the Wall Street Journal (WSJ) corpus with a remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine learned decision trees. Simultaneously, we address the problem of tagging when only small training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that quite high accuracy can be achieved with our system in this situation.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Robust Estimation of Feature Weights in Statistical Machine Translation

Author: España Bonet Cristina
Màrquez Villodre Lluís
Publication venue
Publication date: 01/01/2010
Field of study

Weights of the various components in a standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more robust weights to be used on out-of-domain corpora without the need for specialised data. For an Arabic-to-English translation system, the generalisation of weights represents an improvement of more than 2 points of BLEU with respect to the MERT baseline using the same information.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A graph-based strategy to streamline translation quality assessments

Author: Formiga Fanals Lluís
Màrquez Villodre Lluís
Pighin Daniele
Publication venue
Publication date: 01/01/2012
Field of study

We present a detailed analysis of a graph- based annotation strategy that we employed to annotate a corpus of 11,292 real-world En- glish to Spanish automatic translations with relative (ranking) and absolute (adequate/non- adequate) quality assessments. The proposed approach, inspired by previous work in In- teractive Evolutionary Computation and Inter- active Genetic Algorithms, results in a sim- pler and faster annotation process. We em- pirically compare the method against a tra- ditional, explicit ranking approach, and show that the graph-based strategy: 1) is consider- ably faster, and 2) produces consistently more reliable annotationsPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Real-life translation quality estimation for MT system selection

Author: Formiga Fanals Lluís
Màrquez Villodre Lluís
Pujantell Traserra Jaume
Publication venue
Publication date: 01/01/2013
Field of study

Research on translation quality annotation and estimation usually makes use of standard language, sometimes related to a specific language genre or domain. However, real-life machine translation (MT), performed for instance by on-line translation services, has to cope with some extra dif- ficulties related to the usage of open, non-standard and noisy language. In this paper we study the learning of quality estimation (QE) models able to rank translations from real-life input according to their goodness without the need of translation references. For that, we work with a corpus collected from the 24/7 Reverso.net MT service, translated by 5 different MT systems, and manually annotated with quality scores. We define several families of features and train QE predictors in the form of regressors or direct rankers. The predictors show a remarkable correlation with gold standard rankings and prove to be useful in a system combination scenario, obtaining better results than any individual translation system.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A graphical interface for MT evaluation and error analysis

Author: Giménez J.
González Bermúdez Meritxell
Màrquez Villodre Lluís
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2012
Field of study

Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the ASIYA toolkit, a rich repository of evaluation measures working at different linguistic levels. The current implementation of the interface shows constituency and dependency trees as well as shallow syntactic and semantic annotations, and word alignments. The intelligent visualization of the linguistic structures used by the metrics, as well as a set of navigational functionalities, may lead towards advanced methods for automatic error analysis.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

The UPC submission to the WMT 2012 shared task on quality estimation

Author: González Bermúdez Meritxell
Màrquez Villodre Lluís
Pighin Daniele
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2012
Field of study

In this paper, we describe the UPC system that participated in the WMT 2012 shared task on Quality Estimation for Machine Translation. Based on the empirical evidence that fluencyrelated features have a very high correlation with post-editing effort, we present a set of features for the assessment of quality estimation for machine translation designed around different kinds of n-gram language models, plus another set of features that model the quality of dependency parses automatically projected from source sentences to translations. We document the results obtained on the shared task dataset, obtained by combining the features that we designed with the baseline features provided by the task organizers.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Projective dependency parsing with perceptron

Author: Carreras Pérez Xavier
Màrquez Villodre Lluís
Surdeanu Mihai
Publication venue
Publication date: 01/01/2010
Field of study

We describe an online learning dependency parser for the CoNLL-X Shared Task, based on the bottom-up projective algorithm of Eisner (2000). We experiment with a large feature set that models: the tokens involved in dependencies and their immediate context, the surfacetext distance between tokens, and the syntactic context dominated by each dependency. In experiments, the treatment of multilingual information was totally blind.Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

Towards learning a constraint grammar from annotated corpora using decision trees

Author: Màrquez Villodre Lluís
Rodríguez Hontoria Horacio
Publication venue
Publication date: 01/01/1996
Field of study

Inside the framework of robust parsers for the syntactic analysis of unrestricted text, the aim of this work is the construction of a system capable of automatically learning Constraint Grammar rules from a POS annotated Corpus. The system presented is able by now to acquire constraint rules for POS tagging and we plan to extend it to cover syntactic rules. The learning process uses a supervised learning algorithm based on building a discrimination forest, with a decision tree attached to each case of POS ambiguity. The system has been applied to four representative cases of ambiguity performing on a Spanish Corpus. The results obtained in these experiments and some discussion about the appropriateness of the proposed learning technique are presented in this paper.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Document-level machine translation as a re-translation process

Author: España Bonet Cristina
Martínez Garcia Eva
Màrquez Villodre Lluís
Publication venue
Publication date: 01/01/2014
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Traducción automática a nivel de documento como proceso de retraducción

Author: España Bonet Cristina
Martínez Garcia Eva
Màrquez Villodre Lluís
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2014
Field of study

Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.Los sistemas de Traducción Automática suelen estar diseñados para traducir un texto oración por oración ignorando la información del discurso y provocando así la aparición de incoherencias en las traducciones. En este artículo se presentan varios sistemas que detectan incoherencias a nivel de documento y proponen nuevas traducciones parciales para mejorar el nivel de cohesión y coherencia global. El estudio se centra en dos casos: palabras con traducciones inconsistentes en un texto y la concordancia de género y número entre palabras. Dado que se trata de fenómenos concretos, los cambios no se ven reflejados en una evaluación automática global pero una evaluación manual muestra mejoras en las traducciones.Supported by an FPI grant within the OpenMT2 project (TIN2009-14675-C03) from the Spanish Ministry of Science and Innovation (MICINN) and by the TACARDI project (TIN2012-38523-C02) of the Spanish Ministerio de Economía y Competitividad (MEC)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas