1,269 research outputs found

    Recovering non-local dependencies for Chinese

    Get PDF
    To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it is an open research question how well these approaches migrate to other languages. This paper surveys non-local dependency constructions in Chinese as represented in the Penn Chinese Treebank (CTB) and provides an approach for generating proper predicate-argument-modifier structures including NLDs from surface contextfree phrase structure trees. Our approach recovers non-local dependencies at the level of Lexical-Functional Grammar f-structures, using automatically acquired subcategorisation frames and f-structure paths linking antecedents and traces in NLDs. Currently our algorithm achieves 92.2% f-score for trace insertion and 84.3% for antecedent recovery evaluating on gold-standard CTB trees, and 64.7% and 54.7%, respectively, on CTBtrained state-of-the-art parser output trees

    Treebank-based acquisition of LFG resources for Chinese

    Get PDF
    This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexical-Functional Grammar resources for Chinese from the Penn Chinese Treebank (CTB). Our starting point is the earlier, proofof- concept work of (Burke et al., 2004) on automatic f-structure annotation, LFG grammar acquisition and parsing for Chinese using the CTB version 2 (CTB2). We substantially extend and improve on this earlier research as regards coverage, robustness, quality and fine-grainedness of the resulting LFG resources. We achieve this through (i) improved LFG analyses for a number of core Chinese phenomena; (ii) a new automatic f-structure annotation architecture which involves an intermediate dependency representation; (iii) scaling the approach from 4.1K trees in CTB2 to 18.8K trees in CTB version 5.1 (CTB5.1) and (iv) developing a novel treebank-based approach to recovering non-local dependencies (NLDs) for Chinese parser output. Against a new 200-sentence good standard of manually constructed f-structures, the method achieves 96.00% f-score for f-structures automatically generated for the original CTB trees and 80.01%for NLD-recovered f-structures generated for the trees output by Bikel’s parser

    Improving Syntactic Parsing of Clinical Text Using Domain Knowledge

    Get PDF
    Syntactic parsing is one of the fundamental tasks of Natural Language Processing (NLP). However, few studies have explored syntactic parsing in the medical domain. This dissertation systematically investigated different methods to improve the performance of syntactic parsing of clinical text, including (1) Constructing two clinical treebanks of discharge summaries and progress notes by developing annotation guidelines that handle missing elements in clinical sentences; (2) Retraining four state-of-the-art parsers, including the Stanford parser, Berkeley parser, Charniak parser, and Bikel parser, using clinical treebanks, and comparing their performance to identify better parsing approaches; and (3) Developing new methods to reduce syntactic ambiguity caused by Prepositional Phrase (PP) attachment and coordination using semantic information. Our evaluation showed that clinical treebanks greatly improved the performance of existing parsers. The Berkeley parser achieved the best F-1 score of 86.39% on the MiPACQ treebank. For PP attachment, our proposed methods improved the accuracies of PP attachment by 2.35% on the MiPACQ corpus and 1.77% on the I2b2 corpus. For coordination, our method achieved a precision of 94.9% and a precision of 90.3% for the MiPACQ and i2b2 corpus, respectively. To further demonstrate the effectiveness of the improved parsing approaches, we applied outputs of our parsers to two external NLP tasks: semantic role labeling and temporal relation extraction. The experimental results showed that performance of both tasks’ was improved by using the parse tree information from our optimized parsers, with an improvement of 3.26% in F-measure for semantic role labelling and an improvement of 1.5% in F-measure for temporal relation extraction

    Hybrid tag-set for natural language processing.

    Get PDF
    Leung Wai Kwong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 90-95).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of thesis --- p.3Chapter 2 --- Background --- p.5Chapter 2.1 --- Chinese Noun Phrases Parsing --- p.5Chapter 2.2 --- Chinese Noun Phrases --- p.6Chapter 2.3 --- Problems with Syntactic Parsing --- p.11Chapter 2.3.1 --- Conjunctive Noun Phrases --- p.11Chapter 2.3.2 --- De-de Noun Phrases --- p.12Chapter 2.3.3 --- Compound Noun Phrases --- p.13Chapter 2.4 --- Observations --- p.15Chapter 2.4.1 --- Inadequacy in Part-of-Speech Categorization for Chi- nese NLP --- p.16Chapter 2.4.2 --- The Need of Semantic in Noun Phrase Parsing --- p.17Chapter 2.5 --- Summary --- p.17Chapter 3 --- Hybrid Tag-set --- p.19Chapter 3.1 --- Objectives --- p.19Chapter 3.1.1 --- Resolving Parsing Ambiguities --- p.19Chapter 3.1.2 --- Investigation of Nominal Compound Noun Phrases --- p.20Chapter 3.2 --- Definition of Hybrid Tag-set --- p.20Chapter 3.3 --- Introduction to Cilin --- p.21Chapter 3.4 --- Problems with Cilin --- p.23Chapter 3.4.1 --- Unknown words --- p.23Chapter 3.4.2 --- Multiple Semantic Classes --- p.25Chapter 3.5 --- Introduction to Chinese Word Formation --- p.26Chapter 3.5.1 --- Disyllabic Word Formation --- p.26Chapter 3.5.2 --- Polysyllabic Word Formation --- p.28Chapter 3.5.3 --- Observation --- p.29Chapter 3.6 --- Automatic Assignment of Hybrid Tag to Chinese Word --- p.31Chapter 3.7 --- Summary --- p.34Chapter 4 --- Automatic Semantic Assignment --- p.35Chapter 4.1 --- Previous Researches on Semantic Tagging --- p.36Chapter 4.2 --- SAUW - Automatic Semantic Assignment of Unknown Words --- p.37Chapter 4.2.1 --- POS-to-SC Association (Process 1) --- p.38Chapter 4.2.2 --- Morphology-based Deduction (Process 2) --- p.39Chapter 4.2.3 --- Di-syllabic Word Analysis (Process 3 and 4) --- p.41Chapter 4.2.4 --- Poly-syllabic Word Analysis (Process 5) --- p.47Chapter 4.3 --- Illustrative Examples --- p.47Chapter 4.4 --- Evaluation and Analysis --- p.49Chapter 4.4.1 --- Experiments --- p.49Chapter 4.4.2 --- Error Analysis --- p.51Chapter 4.5 --- Summary --- p.52Chapter 5 --- Word Sense Disambiguation --- p.53Chapter 5.1 --- Introduction to Word Sense Disambiguation --- p.54Chapter 5.2 --- Previous Works on Word Sense Disambiguation --- p.55Chapter 5.2.1 --- Linguistic-based Approaches --- p.56Chapter 5.2.2 --- Corpus-based Approaches --- p.58Chapter 5.3 --- Our Approach --- p.60Chapter 5.3.1 --- Bi-gram Co-occurrence Probabilities --- p.62Chapter 5.3.2 --- Tri-gram Co-occurrence Probabilities --- p.63Chapter 5.3.3 --- Design consideration --- p.65Chapter 5.3.4 --- Error Analysis --- p.67Chapter 5.4 --- Summary --- p.68Chapter 6 --- Hybrid Tag-set for Chinese Noun Phrase Parsing --- p.69Chapter 6.1 --- Resolving Ambiguous Noun Phrases --- p.70Chapter 6.1.1 --- Experiment --- p.70Chapter 6.1.2 --- Results --- p.72Chapter 6.2 --- Summary --- p.78Chapter 7 --- Conclusion --- p.80Chapter 7.1 --- Summary --- p.80Chapter 7.2 --- Difficulties Encountered --- p.83Chapter 7.2.1 --- Lack of Training Corpus --- p.83Chapter 7.2.2 --- Features of Chinese word formation --- p.84Chapter 7.2.3 --- Problems with linguistic sources --- p.85Chapter 7.3 --- Contributions --- p.86Chapter 7.3.1 --- Enrichment to the Cilin --- p.86Chapter 7.3.2 --- Enhancement in syntactic parsing --- p.87Chapter 7.4 --- Further Researches --- p.88Chapter 7.4.1 --- Investigation into words that undergo semantic changes --- p.88Chapter 7.4.2 --- Incorporation of more information into the hybrid tag-set --- p.89Chapter A --- POS Tag-set by Tsinghua University (清華大學) --- p.96Chapter B --- Morphological Rules --- p.100Chapter C --- Syntactic Rules for Di-syllabic Words Formation --- p.10

    The Influence of Pseudo-relatives on Attachment Preferences in Spanish

    Full text link
    This paper presents the results from an off-line experiment on the extent to which the availability of pseudo-relatives modulates attachment preferences in Spanish. Participants were presented with sentences in which different syntactic and semantic factors had been manipulated to allow for either both a pseudo-relative (PR) and a relative-clause (RC) reading or a RC reading only. All the experimental items included two potential antecedents with which the constituents of interest could be associated. The experimental items can be divided into four groups: group 1 consists of stimuli allowing for a double reading in direct object position, and groups 2, 3 and 4 consist of stimuli containing RCs in prepositional complement position, preverbal subject position, and postverbal subject position, respectively. A stronger preference for the higher antecedent was expected in the first group of experimental items. The results indicate that the availability of pseudo-relatives seems to influence attachment preferences; however, the results ensuing from the statistical comparison of groups 3 and 4 need further investigation

    Neural Combinatory Constituency Parsing

    Get PDF
    東京都立大学Tokyo Metropolitan University博士(情報科学)doctoral thesi

    Treebank-based acquisition of Chinese LFG resources for parsing and generation

    Get PDF
    This thesis describes a treebank-based approach to automatically acquire robust,wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the f-structure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structure annotated treebank, I develop a PCFG-based chart generator and a new n-gram based pure dependency generator to realise Chinese sentences from LFG f-structures. The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and the PCFG- and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG

    Treebank-based grammar acquisition for German

    Get PDF
    Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations from labelled data (treebanks) and can be applied to large data sets. My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC 700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill, 2004; Cahill et al., 2005). While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures provided by the Penn-II treebank (Marcus et al., 1993). English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task. In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks

    Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations

    Get PDF
    Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, timeconsuming and expensive. The work reported in this thesis is part of a larger project to automate as much as possible the construction of wide-coverage, deep, constraint-based grammatical resources from treebanks. The Penn-II treebank is a large collection of parse-annotated newspaper text. We have designed a Lexical-Functional Grammar (LFG) (Kaplan and Bresnan, 1982) f-structure annotation algorithm to automatically annotate this treebank with f-structure information approximating to basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). We then use the f-structure-annotated treebank resource to automatically extract grammars and lexical resources for parsing new text into f-structures. We have designed and implemented the Treebank Tool Suite (TTS) to support the linguistic work that seeds the automatic f-structure annotation algorithm (Cahill and van Genabith, 2002) and the F-Structure Annotation Tool (FSAT) to validate and visualise the results of automatic f-structure annotation. We have designed and implemented two PCFG-based probabilistic parsing architectures for parsing unseen text into f-structures: the pipeline and the integrated model. Both architectures parse raw text into basic, but possibly incomplete, predicate-argument structures (“proto f-structures”) with long distance dependencies (LDDs) unresolved (Cahill et al., 2002c). We have designed and implemented a method for automatically resolving LDDs at f-structure level based on a finite approximation of functional uncertainty equations (Kaplan and Zaenen, 1989) automatically acquired from the f structure-annotated treebank resource (Cahill et al., 2004b). To date, the best result achieved by our own Penn-II induced grammars is a dependency f-score of 80.33% against the PARC 700, an improvement of 0.73% over the best handcrafted grammar of (Kaplan et al., 2004). The processing architecture developed in this thesis is highly flexible: using external, state-of-the-art parsing technologies (Charniak, 2000) in our pipeline model, we achieve a dependency f-score of 81.79% against the PARC 700, an improvement of 2.19% over the results reported in Kaplan et al. (2004). We have also ported our grammar induction methodology to German and the TIGER treebank resource (Cahill et al., 2003a). We have developed a method for treebank-based, wide-coverage, deep, constraintbased grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. We believe that our approach successfully addresses the knowledge-acquisition bottleneck (familiar from rule-based approaches to Al and NLP) in wide-coverage, constraint-based grammar development. Our approach can provide an attractive, wide-coverage, multilingual, deep, constraint-based grammar acquisition paradigm
    corecore