35,919 research outputs found
Adapting a general parser to a sublanguage
In this paper, we propose a method to adapt a general parser (Link Parser) to
sublanguages, focusing on the parsing of texts in biology. Our main proposal is
the use of terminology (identication and analysis of terms) in order to reduce
the complexity of the text to be parsed. Several other strategies are explored
and finally combined among which text normalization, lexicon and
morpho-guessing module extensions and grammar rules adaptation. We compare the
parsing results before and after these adaptations
Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches
We study the adaptation of Link Grammar Parser to the biomedical sublanguage
with a focus on domain terms not found in a general parser lexicon. Using two
biomedical corpora, we implement and evaluate three approaches to addressing
unknown words: automatic lexicon expansion, the use of morphological clues, and
disambiguation using a part-of-speech tagger. We evaluate each approach
separately for its effect on parsing performance and consider combinations of
these approaches. In addition to a 45% increase in parsing efficiency, we find
that the best approach, incorporating information from a domain part-of-speech
tagger, offers a statistically signicant 10% relative decrease in error. The
adapted parser is available under an open-source license at
http://www.it.utu.fi/biolg
Comparing and evaluating extended Lambek calculi
Lambeks Syntactic Calculus, commonly referred to as the Lambek calculus, was
innovative in many ways, notably as a precursor of linear logic. But it also
showed that we could treat our grammatical framework as a logic (as opposed to
a logical theory). However, though it was successful in giving at least a basic
treatment of many linguistic phenomena, it was also clear that a slightly more
expressive logical calculus was needed for many other cases. Therefore, many
extensions and variants of the Lambek calculus have been proposed, since the
eighties and up until the present day. As a result, there is now a large class
of calculi, each with its own empirical successes and theoretical results, but
also each with its own logical primitives. This raises the question: how do we
compare and evaluate these different logical formalisms? To answer this
question, I present two unifying frameworks for these extended Lambek calculi.
Both are proof net calculi with graph contraction criteria. The first calculus
is a very general system: you specify the structure of your sequents and it
gives you the connectives and contractions which correspond to it. The calculus
can be extended with structural rules, which translate directly into graph
rewrite rules. The second calculus is first-order (multiplicative
intuitionistic) linear logic, which turns out to have several other,
independently proposed extensions of the Lambek calculus as fragments. I will
illustrate the use of each calculus in building bridges between analyses
proposed in different frameworks, in highlighting differences and in helping to
identify problems.Comment: Empirical advances in categorial grammars, Aug 2015, Barcelona,
Spain. 201
Unsupervised Extraction of Representative Concepts from Scientific Literature
This paper studies the automated categorization and extraction of scientific
concepts from titles of scientific articles, in order to gain a deeper
understanding of their key contributions and facilitate the construction of a
generic academic knowledgebase. Towards this goal, we propose an unsupervised,
domain-independent, and scalable two-phase algorithm to type and extract key
concept mentions into aspects of interest (e.g., Techniques, Applications,
etc.). In the first phase of our algorithm we propose PhraseType, a
probabilistic generative model which exploits textual features and limited POS
tags to broadly segment text snippets into aspect-typed phrases. We extend this
model to simultaneously learn aspect-specific features and identify academic
domains in multi-domain corpora, since the two tasks mutually enhance each
other. In the second phase, we propose an approach based on adaptor grammars to
extract fine grained concept mentions from the aspect-typed phrases without the
need for any external resources or human effort, in a purely data-driven
manner. We apply our technique to study literature from diverse scientific
domains and show significant gains over state-of-the-art concept extraction
techniques. We also present a qualitative analysis of the results obtained.Comment: Published as a conference paper at CIKM 201
- …