100 research outputs found
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
Robust Estimation of Feature Weights in Statistical Machine Translation
Weights of the various components in a
standard Statistical Machine Translation
model are usually estimated via Minimum
Error Rate Training. With this, one finds
their optimum value on a development set with the expectation that these optimal
weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more robust weights to be used on out-of-domain corpora without the need for specialised data. For an Arabic-to-English translation system, the generalisation of weights represents an improvement of more than 2 points of BLEU with respect to the MERT baseline using the same information.Peer ReviewedPostprint (published version
A graph-based strategy to streamline translation quality assessments
We present a detailed analysis of a graph-
based annotation strategy that we employed
to annotate a corpus of 11,292 real-world En-
glish to Spanish automatic translations with
relative (ranking) and absolute (adequate/non-
adequate) quality assessments. The proposed
approach, inspired by previous work in In-
teractive Evolutionary Computation and Inter-
active Genetic Algorithms, results in a sim-
pler and faster annotation process. We em-
pirically compare the method against a tra-
ditional, explicit ranking approach, and show
that the graph-based strategy: 1) is consider-
ably faster, and 2) produces consistently more
reliable annotationsPeer ReviewedPostprint (published version
Real-life translation quality estimation for MT system selection
Research on translation quality annotation and estimation usually makes use of standard language, sometimes related to a specific language genre or domain. However, real-life machine translation (MT), performed for instance by on-line translation services, has to cope with some extra dif- ficulties related to the usage of open, non-standard and noisy language. In this paper we study the learning of quality estimation (QE) models able to rank translations from real-life input according to their goodness without the need of translation references. For that, we work with a corpus collected from the 24/7 Reverso.net MT service, translated by 5 different MT systems, and manually annotated with quality scores. We define several families of features and train QE predictors in the form of regressors or direct rankers. The predictors show a remarkable correlation with gold standard rankings and prove to be useful in a system combination scenario, obtaining better results than any individual translation system.Peer ReviewedPostprint (published version
A graphical interface for MT evaluation and error analysis
Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the ASIYA
toolkit, a rich repository of evaluation
measures working at different linguistic levels. The current implementation of the interface shows constituency and dependency trees as well as shallow syntactic and semantic annotations, and word alignments. The intelligent visualization of the linguistic structures used by the metrics, as well as a set of navigational functionalities, may lead towards advanced methods for automatic error analysis.Peer ReviewedPostprint (published version
The UPC submission to the WMT 2012 shared task on quality estimation
In this paper, we describe the UPC system that
participated in the WMT 2012 shared task on
Quality Estimation for Machine Translation.
Based on the empirical evidence that fluencyrelated
features have a very high correlation
with post-editing effort, we present a set of
features for the assessment of quality estimation
for machine translation designed around
different kinds of n-gram language models,
plus another set of features that model the
quality of dependency parses automatically
projected from source sentences to translations.
We document the results obtained on
the shared task dataset, obtained by combining
the features that we designed with the baseline
features provided by the task organizers.Peer ReviewedPostprint (published version
Projective dependency parsing with perceptron
We describe an online learning dependency parser for the CoNLL-X Shared Task, based on the bottom-up projective algorithm of Eisner (2000). We experiment with a large feature set that models: the tokens involved in dependencies and their immediate context, the surfacetext distance between tokens, and the syntactic context dominated by each dependency. In experiments, the treatment of multilingual information was totally blind.Peer ReviewedPostprint (author’s final draft
Towards learning a constraint grammar from annotated corpora using decision trees
Inside the framework of robust parsers for the syntactic analysis of
unrestricted text, the aim of this work is the construction of a system
capable of automatically learning Constraint Grammar rules from a POS
annotated Corpus. The system presented is able by now to acquire constraint
rules for POS tagging and we plan to extend it to cover syntactic rules.
The learning process uses a supervised learning algorithm based on
building a discrimination forest, with a decision tree attached to each
case of POS ambiguity. The system has been applied to four representative
cases of ambiguity performing on a Spanish Corpus. The results obtained
in these experiments and some discussion about the appropriateness of the
proposed learning technique are presented in this paper.Postprint (published version
Document-level machine translation as a re-translation process
Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.Peer ReviewedPostprint (published version
Traducción automática a nivel de documento como proceso de retraducción
Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.Los sistemas de Traducción Automática suelen estar diseñados para traducir un texto oración por oración ignorando la información del discurso y provocando asà la aparición de incoherencias en las traducciones. En este artÃculo se presentan varios sistemas que detectan incoherencias a nivel de documento y proponen nuevas traducciones parciales para mejorar el nivel de cohesión y coherencia global. El estudio se centra en dos casos: palabras con traducciones inconsistentes en un texto y la concordancia de género y número entre palabras. Dado que se trata de fenómenos concretos, los cambios no se ven reflejados en una evaluación automática global pero una evaluación manual muestra mejoras en las traducciones.Supported by an FPI grant within the OpenMT2 project (TIN2009-14675-C03) from the Spanish Ministry of Science and Innovation (MICINN) and by the TACARDI project (TIN2012-38523-C02) of the Spanish Ministerio de EconomÃa y Competitividad (MEC)
- …