8 research outputs found

    Open Challenges in Treebanking: Some Thoughts Based on the Copenhagen Dependency Treebanks

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 1-13. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Cross-Lingual Adaptation for Type Inference

    Full text link
    Deep learning-based techniques have been widely applied to the program analysis tasks, in fields such as type inference, fault localization, and code summarization. Hitherto deep learning-based software engineering systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label a prohibitively large amount of data. However, most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose cross-lingual adaptation of program analysis, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others. Specifically, we implemented a cross-lingual adaptation framework, PLATO, to transfer a deep learning-based type inference procedure across weakly typed languages, e.g., Python to JavaScript and vice versa. PLATO incorporates a novel joint graph kernelized attention based on abstract syntax tree and control flow graph, and applies anchor word augmentation across different languages. Besides, by leveraging data from strongly typed languages, PLATO improves the perplexity of the backbone cross-programming-language model and the performance of downstream cross-lingual transfer for type inference. Experimental results illustrate that our framework significantly improves the transferability over the baseline method by a large margin

    Ambiguity-aware ensemble training for semi-supervised dependency parsing

    Get PDF
    Abstract This paper proposes a simple yet effective framework for semi-supervised dependency parsing at entire tree level, referred to as ambiguity-aware ensemble training. Instead of only using 1-best parse trees in previous work, our core idea is to utilize parse forest (ambiguous labelings) to combine multiple 1-best parse trees generated from diverse parsers on unlabeled data. With a conditional random field based probabilistic dependency parser, our training objective is to maximize mixed likelihood of labeled data and auto-parsed unlabeled data with ambiguous labelings. This framework offers two promising advantages. 1) ambiguity encoded in parse forests compromises noise in 1-best parse trees. During training, the parser is aware of these ambiguous structures, and has the flexibility to distribute probability mass to its preferred parse trees as long as the likelihood improves. 2) diverse syntactic structures produced by different parsers can be naturally compiled into forest, offering complementary strength to our single-view parser. Experimental results on benchmark data show that our method significantly outperforms the baseline supervised parser and other entire-tree based semi-supervised methods, such as self-training, co-training and tri-training

    Revisiting tri-training of dependency parsers

    Get PDF
    We compare two orthogonal semi-supervised learning techniques, namely tri-training and pretrained word embeddings, in the task of dependency parsing. We explore language-specific FastText and ELMo embeddings and multilingual BERT embeddings. We focus on a low resource scenario as semi-supervised learning can be expected to have the most impact here. Based on treebank size and available ELMo models, we select Hungarian, Uyghur (a zero-shot language for mBERT) and Vietnamese. Furthermore, we include English in a simulated low-resource setting. We find that pretrained word embeddings make more effective use of unlabelled data than tri-training but that the two approaches can be successfully combined

    Semi-supervised dependency parsing using generalized tri-training

    No full text
    Martins et al. (2008) presented what to the best of our knowledge still ranks as the best overall result on the CONLL-X Shared Task datasets. The paper shows how triads of stacked dependency parsers described in Martins et al. (2008) can label unlabeled data for each other in a way similar to co-training and produce end parsers that are significantly better than any of the stacked input parsers. We evaluate our system on five datasets from the CONLL-X Shared Task and obtain 10–20 % error reductions, incl. the best reported results on four of them. We compare our approach to other semisupervised learning algorithms.

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893
    corecore