7 research outputs found
Unsupervised spectral learning of WCFG as low-rank matrix completion
We derive a spectral method for unsupervised
learning ofWeighted Context Free Grammars.
We frame WCFG induction as finding a Hankel
matrix that has low rank and is linearly
constrained to represent a function computed
by inside-outside recursions. The proposed algorithm picks the grammar that agrees with a sample and is the simplest with respect to the nuclear norm of the Hankel matrix.Peer ReviewedPreprin
文に隠れた構文構造を発見する統計モデル
要旨あり統計的言語研究の現在研究詳
A cascaded unsupervised model for PoS tagging
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in March 2021.
The accepted version of the publication may differ from the final published version.Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing (NLP), that assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective etc). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g. dependency parsing) and thereby extract the meaning of the sentence (e.g. semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora.
One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework.
Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a
log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging
Generative Non-Markov Models for Information Extraction
Learning from unlabeled data is a long-standing challenge in machine learning. A
principled solution involves modeling the full joint distribution over inputs
and the latent structure of interest, and imputing the missing data via
marginalization. Unfortunately, such marginalization is expensive for most
non-trivial problems, which places practical limits on the expressiveness of
generative models. As a result, joint models often encode strict assumptions
about the underlying process such as fixed-order Markovian assumptions and
employ simple count-based features of the inputs. In contrast, conditional
models, which do not directly model the observed data, are free to incorporate
rich overlapping features of the input in order to predict the latent structure
of interest. It would be desirable to develop expressive generative models that
retain tractable inference. This is the topic of this thesis. In particular, we
explore joint models which relax fixed-order Markov assumptions, and investigate
the use of recurrent neural networks for automatic feature induction in the
generative process.
We focus on two structured prediction problems: (1) imputing labeled segmentions
of input character sequences, and (2) imputing directed spanning trees relating
strings in text corpora. These problems arise in many applications of practical
interest, but we are primarily concerned with named-entity recognition and
cross-document coreference resolution in this work.
For named-entity recognition, we propose a generative model in which the
observed characters originate from a latent non-Markov process over words, and
where the characters are themselves produced via a non-Markov process: a
recurrent neural network (RNN). We propose a sampler for the proposed model in
which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler.
The kernel is amenable to a fast parallel implementation, and results in fast
mixing in practice.
For cross-document coreference resolution, we move beyond sequence modeling to
consider string-to-string transduction. We stipulate a generative process for a
corpus of documents in which entity names arise from copying---and optionally
transforming---previous names of the same entity. Our proposed model is
sensitive to both the context in which the names occur as well as their
spelling. The string-to-string transformations correspond to systematic
linguistic processes such as abbreviation, typos, and nicknaming, and by analogy
to biology, we think of them as mutations along the edges of a phylogeny. We
propose a novel block Gibbs sampler for this problem that alternates between
sampling an ordering of the mentions and a spanning tree relating all mentions
in the corpus
Graphical Models with Structured Factors, Neural Factors, and Approximation-aware Training
This thesis broadens the space of rich yet practical models for structured prediction. We introduce a general framework for modeling with four ingredients: (1) latent variables, (2) structural constraints, (3) learned (neural) feature representations of the inputs, and (4) training that takes the approximations made during inference into account. The thesis builds up to this framework through an empirical study of three NLP tasks: semantic role labeling, relation extraction, and dependency parsing -- obtaining state-of-the-art results on the former two. We apply the resulting graphical models with structured and neural factors, and approximation-aware learning to jointly model part-of-speech tags, a syntactic dependency parse, and semantic roles in a low-resource setting where the syntax is unobserved. We present an alternative view of these models as neural networks with a topology inspired by inference on graphical models that encode our intuitions about the data
Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing
This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language.
The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well.
This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language