Search CORE

4 research outputs found

言語変化と系統への統計的アプローチ

Author: Yugo Murawaki
村脇有吾
Publication venue: 統計数理研究所
Publication date: 01/12/2016
Field of study

要旨あり統計的言語研究の現在研究詳

RISM (Repository of the Institute of Statistical Mathematics) / 統計数理研究所学術研究リポジトリ

Phylogenetics of Indo-European Language Families via an Algebro-Geometric Analysis of Their Syntactic Structures

Author: Berwick Robert C.
Marcolli Matilde
Ortegaray Andrew
Shu Kevin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/04/2021
Field of study

Using Phylogenetic Algebraic Geometry, we analyze computationally the phylogenetic tree of subfamilies of the Indo-European language family, using data of syntactic structures. The two main sources of syntactic data are the SSWL database and Longobardi’s recent data of syntactic parameters. We compute phylogenetic invariants and estimates of the Euclidean distance functions for two sets of Germanic languages, a set of Romance languages, a set of Slavic languages and a set of early Indo-European languages, and we compare the results with what is known through historical linguistics

DSpace@MIT

Caltech Authors

Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

Author: Wang Dingquan
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 06/02/2020
Field of study

This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language. The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language

Johns Hopkins University

JScholarship