100 research outputs found

    A Machine learning approach to POS tagging

    Get PDF
    We have applied inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities. This model consists of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired language models are complete enough to be directly used as sets of POS disambiguation rules, and include more complex contextual information than simple collections of n-grams usually used in statistical taggers. We have implemented a quite simple and fast tagger that has been tested and evaluated on the Wall Street Journal (WSJ) corpus with a remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine learned decision trees. Simultaneously, we address the problem of tagging when only small training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that quite high accuracy can be achieved with our system in this situation.Postprint (published version

    Robust Estimation of Feature Weights in Statistical Machine Translation

    Get PDF
    Weights of the various components in a standard Statistical Machine Translation model are usually estimated via Minimum Error Rate Training. With this, one finds their optimum value on a development set with the expectation that these optimal weights generalise well to other test sets. However, this is not always the case when domains differ. This work uses a perceptron algorithm to learn more robust weights to be used on out-of-domain corpora without the need for specialised data. For an Arabic-to-English translation system, the generalisation of weights represents an improvement of more than 2 points of BLEU with respect to the MERT baseline using the same information.Peer ReviewedPostprint (published version

    A graph-based strategy to streamline translation quality assessments

    Get PDF
    We present a detailed analysis of a graph- based annotation strategy that we employed to annotate a corpus of 11,292 real-world En- glish to Spanish automatic translations with relative (ranking) and absolute (adequate/non- adequate) quality assessments. The proposed approach, inspired by previous work in In- teractive Evolutionary Computation and Inter- active Genetic Algorithms, results in a sim- pler and faster annotation process. We em- pirically compare the method against a tra- ditional, explicit ranking approach, and show that the graph-based strategy: 1) is consider- ably faster, and 2) produces consistently more reliable annotationsPeer ReviewedPostprint (published version

    Real-life translation quality estimation for MT system selection

    Get PDF
    Research on translation quality annotation and estimation usually makes use of standard language, sometimes related to a specific language genre or domain. However, real-life machine translation (MT), performed for instance by on-line translation services, has to cope with some extra dif- ficulties related to the usage of open, non-standard and noisy language. In this paper we study the learning of quality estimation (QE) models able to rank translations from real-life input according to their goodness without the need of translation references. For that, we work with a corpus collected from the 24/7 Reverso.net MT service, translated by 5 different MT systems, and manually annotated with quality scores. We define several families of features and train QE predictors in the form of regressors or direct rankers. The predictors show a remarkable correlation with gold standard rankings and prove to be useful in a system combination scenario, obtaining better results than any individual translation system.Peer ReviewedPostprint (published version

    A graphical interface for MT evaluation and error analysis

    Get PDF
    Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the ASIYA toolkit, a rich repository of evaluation measures working at different linguistic levels. The current implementation of the interface shows constituency and dependency trees as well as shallow syntactic and semantic annotations, and word alignments. The intelligent visualization of the linguistic structures used by the metrics, as well as a set of navigational functionalities, may lead towards advanced methods for automatic error analysis.Peer ReviewedPostprint (published version

    The UPC submission to the WMT 2012 shared task on quality estimation

    Get PDF
    In this paper, we describe the UPC system that participated in the WMT 2012 shared task on Quality Estimation for Machine Translation. Based on the empirical evidence that fluencyrelated features have a very high correlation with post-editing effort, we present a set of features for the assessment of quality estimation for machine translation designed around different kinds of n-gram language models, plus another set of features that model the quality of dependency parses automatically projected from source sentences to translations. We document the results obtained on the shared task dataset, obtained by combining the features that we designed with the baseline features provided by the task organizers.Peer ReviewedPostprint (published version

    Projective dependency parsing with perceptron

    Get PDF
    We describe an online learning dependency parser for the CoNLL-X Shared Task, based on the bottom-up projective algorithm of Eisner (2000). We experiment with a large feature set that models: the tokens involved in dependencies and their immediate context, the surfacetext distance between tokens, and the syntactic context dominated by each dependency. In experiments, the treatment of multilingual information was totally blind.Peer ReviewedPostprint (author’s final draft

    Towards learning a constraint grammar from annotated corpora using decision trees

    Get PDF
    Inside the framework of robust parsers for the syntactic analysis of unrestricted text, the aim of this work is the construction of a system capable of automatically learning Constraint Grammar rules from a POS annotated Corpus. The system presented is able by now to acquire constraint rules for POS tagging and we plan to extend it to cover syntactic rules. The learning process uses a supervised learning algorithm based on building a discrimination forest, with a decision tree attached to each case of POS ambiguity. The system has been applied to four representative cases of ambiguity performing on a Spanish Corpus. The results obtained in these experiments and some discussion about the appropriateness of the proposed learning technique are presented in this paper.Postprint (published version

    Document-level machine translation as a re-translation process

    Get PDF
    Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.Peer ReviewedPostprint (published version

    Traducción automática a nivel de documento como proceso de retraducción

    Get PDF
    Most of the current Machine Translation systems are designed to translate a document sentence by sentence ignoring discourse information and producing incoherencies in the final translations. In this paper we present some document-level-oriented post-processes to improve translations' coherence and consistency. Incoherences are detected and new partial translations are proposed. The work focuses on studying two phenomena: words with inconsistent translations throughout a text and also, gender and number agreement among words. Since we deal with specific phenomena, an automatic evaluation does not reflect significant variations in the translations. However, improvements are observed through a manual evaluation.Los sistemas de Traducción Automática suelen estar diseñados para traducir un texto oración por oración ignorando la información del discurso y provocando así la aparición de incoherencias en las traducciones. En este artículo se presentan varios sistemas que detectan incoherencias a nivel de documento y proponen nuevas traducciones parciales para mejorar el nivel de cohesión y coherencia global. El estudio se centra en dos casos: palabras con traducciones inconsistentes en un texto y la concordancia de género y número entre palabras. Dado que se trata de fenómenos concretos, los cambios no se ven reflejados en una evaluación automática global pero una evaluación manual muestra mejoras en las traducciones.Supported by an FPI grant within the OpenMT2 project (TIN2009-14675-C03) from the Spanish Ministry of Science and Innovation (MICINN) and by the TACARDI project (TIN2012-38523-C02) of the Spanish Ministerio de Economía y Competitividad (MEC)
    • …
    corecore