637 research outputs found
Two knowledge-based methods for High-Performance Sense Distribution Learning
Knowing the correct distribution of senses within a corpus can potentially boost the performance of Word Sense Disambiguation (WSD) systems by many points. We present two fully automatic and language-independent methods for computing the distribution of senses given a raw corpus of sentences. Intrinsic and extrinsic evaluations show that our methods outperform the current state of the art in sense distribution learning and the strongest baselines for the most frequent sense in multiple languages and on domain-specific test sets. Our sense distributions are available at http://trainomatic.org
Recommended from our members
Unsupervised Formal Grammar Induction with Confidence
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically-motivated grammar formalism. This algorithm, called the Missing Link algorithm (ML), is built off of classic chart parsing methods, but makes use of a probabilistic confidence measure to keep track of potentially ambiguous lexical items. Because ML uses a structured grammar formalism, each step of the algorithm can be easily understood by linguists, making it ideal for studying the learnability of different linguistic phenomena. The algorithm requires minimal annotation in its training data, but is capable of learning nuanced data from relatively small training sets and can be applied to a variety of grammar formalisms. Though evaluating an unsupervised syntactic model is difficult, I present an evaluation using the Corpus of Linguistic Acceptability and show state-of-the-art performance
Producing power-law distributions and damping word frequencies with two-stage language models
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s
Minimal supervision for language learning: bootstrapping global patterns from local knowledge
A fundamental step in sentence comprehension involves assigning semantic roles
to sentence constituents. To accomplish this, the listener
must parse the sentence, find constituents that are candidate arguments, and
assign semantic roles to those constituents. Each step depends on prior lexical
and syntactic knowledge. Where do children begin in solving this problem when
learning their first languages? To experiment with different representations
that children may use to begin understanding language, we have built a
computational model for this early point in language acquisition. This system,
BabySRL, learns from transcriptions of natural child-directed speech and makes
use of psycholinguistically plausible background knowledge and realistically
noisy semantic feedback to begin to classify sentences at the level of ``who
does what to whom.''
Starting with simple, psycholinguistically-motivated representations of
sentence structure, the BabySRL is able to learn from full semantic feedback,
as well as a supervision signal derived from partial semantic background
knowledge. In addition we combine the BabySRL with an unsupervised Hidden
Markov Model part-of-speech tagger, linking clusters with syntactic categories
using background noun knowledge so that they can be used to parse input for the
SRL system. The results show that proposed shallow representations of sentence
structure are robust to reductions in parsing accuracy, and that the
contribution of alternative representations of sentence structure to successful
semantic role labeling varies with the integrity of the parsing and
argument-identification stages. Finally, we enable the BabySRL to improve both
an intermediate syntactic representation and its final semantic role
classification. Using this system we show that it is possible for a simple
learner in a plausible (noisy) setup to begin comprehending simple semantics
when initialized with a small amount of concrete noun knowledge and some simple
syntax-semantics mapping biases, before acquiring any specific verb knowledge
Effective weakly supervised semantic frame induction using expression sharing in hierarchical hidden Markov models
We present a framework for the induction of semantic frames from utterances
in the context of an adaptive command-and-control interface. The system is
trained on an individual user's utterances and the corresponding semantic
frames representing controls. During training, no prior information on the
alignment between utterance segments and frame slots and values is available.
In addition, semantic frames in the training data can contain information that
is not expressed in the utterances. To tackle this weakly supervised
classification task, we propose a framework based on Hidden Markov Models
(HMMs). Structural modifications, resulting in a hierarchical HMM, and an
extension called expression sharing are introduced to minimize the amount of
training time and effort required for the user.
The dataset used for the present study is PATCOR, which contains commands
uttered in the context of a vocally guided card game, Patience. Experiments
were carried out on orthographic and phonetic transcriptions of commands,
segmented on different levels of n-gram granularity. The experimental results
show positive effects of all the studied system extensions, with some effect
differences between the different input representations. Moreover, evaluation
experiments on held-out data with the optimal system configuration show that
the extended system is able to achieve high accuracies with relatively small
amounts of training data
Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation
We describe an implemented system for robust domain-independent syntactic
parsing of English, using a unification-based grammar of part-of-speech and
punctuation labels coupled with a probabilistic LR parser. We present
evaluations of the system's performance along several different dimensions;
these enable us to assess the contribution that each individual part is making
to the success of the system as a whole, and thus prioritise the effort to be
devoted to its further enhancement. Currently, the system is able to parse
around 80% of sentences in a substantial corpus of general text containing a
number of distinct genres. On a random sample of 250 such sentences the system
has a mean crossing bracket rate of 0.71 and recall and precision of 83% and
84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the
Conference on Empirical Methods in Natural Language Processing, University of
Pennsylvania, May 199
- …