35,919 research outputs found

    Adapting a general parser to a sublanguage

    Full text link
    In this paper, we propose a method to adapt a general parser (Link Parser) to sublanguages, focusing on the parsing of texts in biology. Our main proposal is the use of terminology (identication and analysis of terms) in order to reduce the complexity of the text to be parsed. Several other strategies are explored and finally combined among which text normalization, lexicon and morpho-guessing module extensions and grammar rules adaptation. We compare the parsing results before and after these adaptations

    Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches

    Get PDF
    We study the adaptation of Link Grammar Parser to the biomedical sublanguage with a focus on domain terms not found in a general parser lexicon. Using two biomedical corpora, we implement and evaluate three approaches to addressing unknown words: automatic lexicon expansion, the use of morphological clues, and disambiguation using a part-of-speech tagger. We evaluate each approach separately for its effect on parsing performance and consider combinations of these approaches. In addition to a 45% increase in parsing efficiency, we find that the best approach, incorporating information from a domain part-of-speech tagger, offers a statistically signicant 10% relative decrease in error. The adapted parser is available under an open-source license at http://www.it.utu.fi/biolg

    Comparing and evaluating extended Lambek calculi

    Get PDF
    Lambeks Syntactic Calculus, commonly referred to as the Lambek calculus, was innovative in many ways, notably as a precursor of linear logic. But it also showed that we could treat our grammatical framework as a logic (as opposed to a logical theory). However, though it was successful in giving at least a basic treatment of many linguistic phenomena, it was also clear that a slightly more expressive logical calculus was needed for many other cases. Therefore, many extensions and variants of the Lambek calculus have been proposed, since the eighties and up until the present day. As a result, there is now a large class of calculi, each with its own empirical successes and theoretical results, but also each with its own logical primitives. This raises the question: how do we compare and evaluate these different logical formalisms? To answer this question, I present two unifying frameworks for these extended Lambek calculi. Both are proof net calculi with graph contraction criteria. The first calculus is a very general system: you specify the structure of your sequents and it gives you the connectives and contractions which correspond to it. The calculus can be extended with structural rules, which translate directly into graph rewrite rules. The second calculus is first-order (multiplicative intuitionistic) linear logic, which turns out to have several other, independently proposed extensions of the Lambek calculus as fragments. I will illustrate the use of each calculus in building bridges between analyses proposed in different frameworks, in highlighting differences and in helping to identify problems.Comment: Empirical advances in categorial grammars, Aug 2015, Barcelona, Spain. 201

    Unsupervised Extraction of Representative Concepts from Scientific Literature

    Full text link
    This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we propose PhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human effort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.Comment: Published as a conference paper at CIKM 201
    • …
    corecore