7 research outputs found

    Rich Linguistic Structure from Large-Scale Web Data

    Get PDF
    The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science

    Resourcing machine translation with parallel treebanks

    Get PDF
    The benefits of syntax-based approaches to data-driven machine translation (MT) are clear: given the right model, a combination of hierarchical structure, constituent labels and morphological information can be exploited to produce more fluent, grammatical translation output. This has been demonstrated by the recent shift in research focus towards such linguistically motivated approaches. However, one issue facing developers of such models that is not encountered in the development of state-of-the-art string-based statistical MT (SMT) systems is the lack of available syntactically annotated training data for many languages. In this thesis, we propose a solution to the problem of limited resources for syntax-based MT by introducing a novel sub-sentential alignment algorithm for the induction of translational equivalence links between pairs of phrase structure trees. This algorithm, which operates on a language pair-independent basis, allows for the automatic generation of large-scale parallel treebanks which are useful not only for machine translation, but also across a variety of natural language processing tasks. We demonstrate the viability of our automatically generated parallel treebanks by means of a thorough evaluation process during which they are compared to a manually annotated gold standard parallel treebank both intrinsically and in an MT task. Following this, we hypothesise that these parallel treebanks are not only useful in syntax-based MT, but also have the potential to be exploited in other paradigms of MT. To this end, we carry out a large number of experiments across a variety of data sets and language pairs, in which we exploit the information encoded within the parallel treebanks in various components of phrase-based statistical MT systems. We demonstrate that improvements in translation accuracy can be achieved by enhancing SMT phrase tables with linguistically motivated phrase pairs extracted from a parallel treebank, while showing that a number of other features in SMT can also be supplemented with varying degrees of effectiveness. Finally, we examine ways in which synchronous grammars extracted from parallel treebanks can improve the quality of translation output, focussing on real translation examples from a syntax-based MT system

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
    corecore