Search CORE

3 research outputs found

Dependency Grammar Induction with Neural Lexicalization and Big Training Data

Author: Han Wenjuan
Jiang Yong
Tu Kewei
Publication venue
Publication date: 01/01/2017
Field of study

We study the impact of big models (in terms of the degree of lexicalization) and big data (in terms of the training corpus size) on dependency grammar induction. We experimented with L-DMV, a lexicalized version of Dependency Model with Valence and L-NDMV, our lexicalized extension of the Neural Dependency Model with Valence. We find that L-DMV only benefits from very small degrees of lexicalization and moderate sizes of training corpora. L-NDMV can benefit from big training data and lexicalization of greater degrees, especially when enhanced with good model initialization, and it achieves a result that is competitive with the current state-of-the-art.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

文に隠れた構文構造を発見する統計モデル

Author: Hiroshi Noji
能地宏
Publication venue: 統計数理研究所
Publication date: 01/12/2016
Field of study

要旨あり統計的言語研究の現在研究詳

RISM (Repository of the Institute of Statistical Mathematics) / 統計数理研究所学術研究リポジトリ

Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

Author: Wang Dingquan
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 06/02/2020
Field of study

This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language. The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language

Johns Hopkins University

JScholarship