372 research outputs found

    Learning labelled dependencies in machine translation evaluation

    Get PDF
    Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and which correlate better than other existing metrics with human judgements. Other research in this area has presented machine learning methods which learn directly from human judgements. In this paper, we present a novel combination of dependency- and machine learning-based approaches to automatic MT evaluation, and demonstrate greater correlations with human judgement than the existing state-of-the-art methods. In addition, we examine the extent to which our novel method can be generalised across different tasks and domains

    Using F-structures in machine translation evaluation

    Get PDF
    Despite a growing interest in automatic evaluation methods for Machine Translation (MT) quality, most existing automatic metrics are still limited to surface comparison of translation and reference strings. In this paper we show how Lexical-Functional Grammar (LFG) labelled dependencies obtained from an automatic parse can be used to assess the quality of MT on a deeper linguistic level, giving as a result higher correlations with human judgements

    An analysis of question processing of English and Chinese for the NTCIR 5 cross-language question answering task

    Get PDF
    An important element in question answering systems is the analysis and interpretation of questions. Using the NTCIR 5 Cross-Language Question Answering (CLQA) question test set we demonstrate that the accuracy of deep question analysis is dependent on the quantity and suitability of the available linguistic resources. We further demonstrate that applying question analysis tools developed on monolingual training materials to questions translated Chinese-English and English-Chinese using machine translation produces much reduced effectiveness in interpretation of the question. This latter result indicates that question analysis for CLQA should primarily be conducted in the question language prior to translation

    Wide-coverage deep statistical parsing using automatic dependency structure annotation

    Get PDF
    A number of researchers (Lin 1995; Carroll, Briscoe, and Sanfilippo 1998; Carroll et al. 2002; Clark and Hockenmaier 2002; King et al. 2003; Preiss 2003; Kaplan et al. 2004;Miyao and Tsujii 2004) have convincingly argued for the use of dependency (rather than CFG-tree) representations for parser evaluation. Preiss (2003) and Kaplan et al. (2004) conducted a number of experiments comparing “deep” hand-crafted wide-coverage with “shallow” treebank- and machine-learning based parsers at the level of dependencies, using simple and automatic methods to convert tree output generated by the shallow parsers into dependencies. In this article, we revisit the experiments in Preiss (2003) and Kaplan et al. (2004), this time using the sophisticated automatic LFG f-structure annotation methodologies of Cahill et al. (2002b, 2004) and Burke (2006), with surprising results. We compare various PCFG and history-based parsers (based on Collins, 1999; Charniak, 2000; Bikel, 2002) to find a baseline parsing system that fits best into our automatic dependency structure annotation technique. This combined system of syntactic parser and dependency structure annotation is compared to two hand-crafted, deep constraint-based parsers (Carroll and Briscoe 2002; Riezler et al. 2002). We evaluate using dependency-based gold standards (DCU 105, PARC 700, CBS 500 and dependencies for WSJ Section 22) and use the Approximate Randomization Test (Noreen 1989) to test the statistical significance of the results. Our experiments show that machine-learning-based shallow grammars augmented with sophisticated automatic dependency annotation technology outperform hand-crafted, deep, widecoverage constraint grammars. Currently our best system achieves an f-score of 82.73% against the PARC 700 Dependency Bank (King et al. 2003), a statistically significant improvement of 2.18%over the most recent results of 80.55%for the hand-crafted LFG grammar and XLE parsing system of Riezler et al. (2002), and an f-score of 80.23% against the CBS 500 Dependency Bank (Carroll, Briscoe, and Sanfilippo 1998), a statistically significant 3.66% improvement over the 76.57% achieved by the hand-crafted RASP grammar and parsing system of Carroll and Briscoe (2002)

    Using machine-learning to assign function labels to parser output for Spanish

    Get PDF
    Data-driven grammatical function tag assignment has been studied for English using the Penn-II Treebank data. In this paper we address the question of whether such methods can be applied successfully to other languages and treebank resources. In addition to tag assignment accuracy and f-scores we also present results of a task-based evaluation. We use three machine-learning methods to assign Cast3LB function tags to sentences parsed with Bikel’s parser trained on the Cast3LB treebank. The best performing method, SVM, achieves an f-score of 86.87% on gold-standard trees and 66.67% on parser output - a statistically significant improvement of 6.74% over the baseline. In a task-based evaluation we generate LFG functional-structures from the function tag-enriched trees. On this task we achive an f-score of 75.67%, a statistically significant 3.4% improvement over the baseline

    A novel dependency-based evaluation metric for machine translation

    Get PDF
    Automatic evaluation measures such as BLEU (Papineni et al. (2002)) and NIST (Doddington (2002)) are indispensable in the development of Machine Translation (MT) systems, because they allow MT developers to conduct frequent, fast, and cost-effective evaluations of their evolving translation models. However, most of the automatic evaluation metrics rely on a comparison of word strings, measuring only the surface similarity of the candidate and reference translations, and will penalize any divergence. In effect,a candidate translation expressing the source meaning accurately and fluently will be given a low score if the lexical and syntactic choices it contains, even though perfectly legitimate, are not present in at least one of the references. Necessarily, this score would differ from a much more favourable human judgment that such a translation would receive. This thesis presents a method that automatically evaluates the quality of translation based on the labelled dependency structure of the sentence, rather than on its surface form. Dependencies abstract away from the some of the particulars of the surface string realization and provide a more "normalized" representation of (some) syntactic variants of a given sentence. The translation and reference files are analyzed by a treebank-based, probabilistic Lexical-Functional Grammar (LFG) parser (Cahill et al. (2004)) for English, which produces a set of dependency triples for each input. The translation set is compared to the reference set, and the number of matches is calculated, giving the precision, recall, and f-score for that particular translation. The use of WordNet synonyms and partial matching during the evaluation process allows for adequate treatment of lexical variation, while employing a number of best parses helps neutralize the noise introduced during the parsing stage. The dependency-based method is compared against a number of other popular MT evaluation metrics, including BLEU, NIST, GTM (Turian et al. (2003)), TER (Snover et al. (2006)), and METEOR (Banerjee and Lavie (2005)), in terms of segment- and system-level correlations with human judgments of fluency and adequacy. We also examine whether it shows bias towards statistical MT models. The comparison of the dependency-based method with other evaluation metrics is then extended to languages other than English: French, German, Spanish, and Japanese, where we apply our method to dependencies generated by Microsoft's NLPWin analyzer (Corston-Oliver and Dolan (1999); Heidorn (2000)) as well as, in the case of the Spanish data, those produced by the treebank-based, probabilistic LFG parser of Chrupa la and van Genabith (2006a,b)

    F-structure transfer-based statistical machine translation

    Get PDF
    In this paper, we describe a statistical deep syntactic transfer decoder that is trained fully automatically on parsed bilingual corpora. Deep syntactic transfer rules are induced automatically from the f-structures of a LFG parsed bitext corpus by automatically aligning local f-structures, and inducing all rules consistent with the node alignment. The transfer decoder outputs the n-best TL f-structures given a SL f-structure as input by applying large numbers of transfer rules and searching for the best output using a log-linear model to combine feature scores. The decoder includes a fully integrated dependency-based tri-gram language model. We include an experimental evaluation of the decoder using different parsing disambiguation resources for the German data to provide a comparison of how the system performs with different German training and test parses

    Comparative evaluation of research vs. Online MT systems

    Get PDF
    This paper reports MT evaluation experiments that were conducted at the end of year 1 of the EU-funded CoSyne 1 project for three language combinations, considering translations from German, Italian and Dutch into English. We present a comparative evaluation of the MT software developed within the project against four of the leading free webbased MT systems across a range of state-of-the-art automatic evaluation metrics. The data sets from the news domain that were created and used for training purposes and also for this evaluation exercise, which are available to the research community, are also described. The evaluation results for the news domain are very encouraging: the CoSyne MT software consistently beats the rule-based MT systems, and for translations from Italian and Dutch into English in particular the scores given by some of the standard automatic evaluation metrics are not too distant from those obtained by wellestablished statistical online MT systems

    Lost in translation: the problems of using mainstream MT evaluation metrics for sign language translation

    Get PDF
    In this paper we consider the problems of applying corpus-based techniques to minority languages that are neither politically recognised nor have a formally accepted writing system, namely sign languages. We discuss the adoption of an annotated form of sign language data as a suitable corpus for the development of a data-driven machine translation (MT) system, and deal with issues that arise from its use. Useful software tools that facilitate easy annotation of video data are also discussed. Furthermore, we address the problems of using traditional MT evaluation metrics for sign language translation. Based on the candidate translations produced from our example-based machine translation system, we discuss why standard metrics fall short of providing an accurate evaluation and suggest more suitable evaluation methods

    Irish treebanking and parsing: a preliminary evaluation

    Get PDF
    Language resources are essential for linguistic research and the development of NLP applications. Low- density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish – namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development
    corecore