3,787 research outputs found
Modeling Global Syntactic Variation in English Using Dialect Classification
This paper evaluates global-scale dialect identification for 14 national
varieties of English as a means for studying syntactic variation. The paper
makes three main contributions: (i) introducing data-driven language mapping as
a method for selecting the inventory of national varieties to include in the
task; (ii) producing a large and dynamic set of syntactic features using
grammar induction rather than focusing on a few hand-selected features such as
function words; and (iii) comparing models across both web corpora and social
media corpora in order to measure the robustness of syntactic variation across
registers
Learning unification-based grammars using the Spoken English Corpus
This paper describes a grammar learning system that combines model-based and
data-driven learning within a single framework. Our results from learning
grammars using the Spoken English Corpus (SEC) suggest that combined
model-based and data-driven learning can produce a more plausible grammar than
is the case when using either learning style isolation.Comment: 10 page
Unsupervised Extraction of Representative Concepts from Scientific Literature
This paper studies the automated categorization and extraction of scientific
concepts from titles of scientific articles, in order to gain a deeper
understanding of their key contributions and facilitate the construction of a
generic academic knowledgebase. Towards this goal, we propose an unsupervised,
domain-independent, and scalable two-phase algorithm to type and extract key
concept mentions into aspects of interest (e.g., Techniques, Applications,
etc.). In the first phase of our algorithm we propose PhraseType, a
probabilistic generative model which exploits textual features and limited POS
tags to broadly segment text snippets into aspect-typed phrases. We extend this
model to simultaneously learn aspect-specific features and identify academic
domains in multi-domain corpora, since the two tasks mutually enhance each
other. In the second phase, we propose an approach based on adaptor grammars to
extract fine grained concept mentions from the aspect-typed phrases without the
need for any external resources or human effort, in a purely data-driven
manner. We apply our technique to study literature from diverse scientific
domains and show significant gains over state-of-the-art concept extraction
techniques. We also present a qualitative analysis of the results obtained.Comment: Published as a conference paper at CIKM 201
Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing
In a lexicalized grammar formalism such as Lexicalized Tree-Adjoining Grammar
(LTAG), each lexical item is associated with at least one elementary structure
(supertag) that localizes syntactic and semantic dependencies. Thus a parser
for a lexicalized grammar must search a large set of supertags to choose the
right ones to combine for the parse of the sentence. We present techniques for
disambiguating supertags using local information such as lexical preference and
local lexical dependencies. The similarity between LTAG and Dependency grammars
is exploited in the dependency model of supertag disambiguation. The
performance results for various models of supertag disambiguation such as
unigram, trigram and dependency-based models are presented.Comment: ps file. 8 page
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
- …