328 research outputs found

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Computational Approaches to the Syntax–Prosody Interface: Using Prosody to Improve Parsing

    Full text link
    Prosody has strong ties with syntax, since prosody can be used to resolve some syntactic ambiguities. Syntactic ambiguities have been shown to negatively impact automatic syntactic parsing, hence there is reason to believe that prosodic information can help improve parsing. This dissertation considers a number of approaches that aim to computationally examine the relationship between prosody and syntax of natural languages, while also addressing the role of syntactic phrase length, with the ultimate goal of using prosody to improve parsing. Chapter 2 examines the effect of syntactic phrase length on prosody in double center embedded sentences in French. Data collected in a previous study were reanalyzed using native speaker judgment and automatic methods (forced alignment). Results demonstrate similar prosodic splitting behavior as in English in contradiction to the original study’s findings. Chapter 3 presents a number of studies examining whether syntactic ambiguity can yield different prosodic patterns, allowing humans and/or computers to resolve the ambiguity. In an experimental study, humans disambiguated sentences with prepositional phrase- (PP)-attachment ambiguity with 49% accuracy presented as text, and 63% presented as audio. Machine learning on the same data yielded an accuracy of 63-73%. A corpus study on the Switchboard corpus used both prosodic breaks and phrase lengths to predict the attachment, with an accuracy of 63.5% for PP-attachment sentences, and 71.2% for relative clause attachment. Chapter 4 aims to identify aspects of syntax that relate to prosody and use these in combination with prosodic cues to improve parsing. The aspects identified (dependency configurations) are based on dependency structure, reflecting the relative head location of two consecutive words, and are used as syntactic features in an ensemble system based on Recurrent Neural Networks, to score parse hypotheses and select the most likely parse for a given sentence. Using syntactic features alone, the system achieved an improvement of 1.1% absolute in Unlabelled Attachment Score (UAS) on the test set, above the best parser in the ensemble, while using syntactic features combined with prosodic features (pauses and normalized duration) led to a further improvement of 0.4% absolute. The results achieved demonstrate the relationship between syntax, syntactic phrase length, and prosody, and indicate the ability and future potential of prosody to resolve ambiguity and improve parsing

    From Discourse Structure To Text Specificity: Studies Of Coherence Preferences

    Get PDF
    To successfully communicate through text, a writer needs to organize information into an understandable and well-structured discourse for the targeted audience. This involves deciding when to convey general statements, when to elaborate on details, and gauging how much details to convey, i.e., the level of specificity. This thesis explores the automatic prediction of text specificity, and whether the perception of specificity varies across different audiences. We characterize text specificity from two aspects: the instantiation discourse relation, and the specificity of sentences and words. We identify characteristics of instantiation that signify a change of specificity between sentences. Features derived from these characteristics substantially improve the detection of the relation. Using instantiation sentences as the basis for training, we propose a semi-supervised system to predict sentence specificity with speed and accuracy. Furthermore, we present insights into the effect of underspecified words and phrases on the comprehension of text, and the prediction of such words. We show distinct preferences in specificity and discourse structure among different audiences. We investigate these distinctions in both cross-lingual and monolingual context. Cross-lingually, we identify discourse factors that significantly impact the quality of text translated from Chinese to English. Notably, a large portion of Chinese sentences are significantly more specific and need to be translated into multiple English sentences. We introduce a system using rich syntactic features to accurately detect such sentences. We also show that simplified text is more general, and that specific sentences are more likely to need simplification. Finally, we present evidence that the perception of sentence specificity differs among male and female readers

    Larger-first partial parsing

    Get PDF
    Larger-first partial parsing is a primarily top-down approach to partial parsing that is opposite to current easy-first, or primarily bottom-up, strategies. A rich partial tree structure is captured by an algorithm that assigns a hierarchy of structural tags to each of the input tokens in a sentence. Part-of-speech tags are first assigned to the words in a sentence by a part-of-speech tagger. A cascade of Deterministic Finite State Automata then uses this part-of-speech information to identify syntactic relations primarily in a descending order of their size. The cascade is divided into four specialized sections: (1) a Comma Network, which identifies syntactic relations associated with commas; (2) a Conjunction Network, which partially disambiguates phrasal conjunctions and llly disambiguates clausal conjunctions; (3) a Clause Network, which identifies non-comma-delimited clauses; and (4) a Phrase Network, which identifies the remaining base phrases in the sentence. Each automaton is capable of adding one or more levels of structural tags to the tokens in a sentence. The larger-first approach is compared against a well-known easy-first approach. The results indicate that this larger-first approach is capable of (1) producing a more detailed partial parse than an easy first approach; (2) providing better containment of attachment ambiguity; (3) handling overlapping syntactic relations; and (4) achieving a higher accuracy than the easy-first approach. The automata of each network were developed by an empirical analysis of several sources and are presented here in detail

    A robust unification-based parser for Chinese natural language processing.

    Get PDF
    Chan Shuen-ti Roy.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 168-175).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.12Chapter 1.1. --- The nature of natural language processing --- p.12Chapter 1.2. --- Applications of natural language processing --- p.14Chapter 1.3. --- Purpose of study --- p.17Chapter 1.4. --- Organization of this thesis --- p.18Chapter 2. --- Organization and methods in natural language processing --- p.20Chapter 2.1. --- Organization of natural language processing system --- p.20Chapter 2.2. --- Methods employed --- p.22Chapter 2.3. --- Unification-based grammar processing --- p.22Chapter 2.3.1. --- Generalized Phase Structure Grammar (GPSG) --- p.27Chapter 2.3.2. --- Head-driven Phrase Structure Grammar (HPSG) --- p.31Chapter 2.3.3. --- Common drawbacks of UBGs --- p.33Chapter 2.4. --- Corpus-based processing --- p.34Chapter 2.4.1. --- Drawback of corpus-based processing --- p.35Chapter 3. --- Difficulties in Chinese language processing and its related works --- p.37Chapter 3.1. --- A glance at the history --- p.37Chapter 3.2. --- Difficulties in syntactic analysis of Chinese --- p.37Chapter 3.2.1. --- Writing system of Chinese causes segmentation problem --- p.38Chapter 3.2.2. --- Words serving multiple grammatical functions without inflection --- p.40Chapter 3.2.3. --- Word order of Chinese --- p.42Chapter 3.2.4. --- The Chinese grammatical word --- p.43Chapter 3.3. --- Related works --- p.45Chapter 3.3.1. --- Unification grammar processing approach --- p.45Chapter 3.3.2. --- Corpus-based processing approach --- p.48Chapter 3.4. --- Restatement of goal --- p.50Chapter 4. --- SERUP: Statistical-Enhanced Robust Unification Parser --- p.54Chapter 5. --- Step One: automatic preprocessing --- p.57Chapter 5.1. --- Segmentation of lexical tokens --- p.57Chapter 5.2. --- "Conversion of date, time and numerals" --- p.61Chapter 5.3. --- Identification of new words --- p.62Chapter 5.3.1. --- Proper nouns ´ؤ Chinese names --- p.63Chapter 5.3.2. --- Other proper nouns and multi-syllabic words --- p.67Chapter 5.4. --- Defining smallest parsing unit --- p.82Chapter 5.4.1. --- The Chinese sentence --- p.82Chapter 5.4.2. --- Breaking down the paragraphs --- p.84Chapter 5.4.3. --- Implementation --- p.87Chapter 6. --- Step Two: grammar construction --- p.91Chapter 6.1. --- Criteria in choosing a UBG model --- p.91Chapter 6.2. --- The grammar in details --- p.92Chapter 6.2.1. --- The PHON feature --- p.93Chapter 6.2.2. --- The SYN feature --- p.94Chapter 6.2.3. --- The SEM feature --- p.98Chapter 6.2.4. --- Grammar rules and features principles --- p.99Chapter 6.2.5. --- Verb phrases --- p.101Chapter 6.2.6. --- Noun phrases --- p.104Chapter 6.2.7. --- Prepositional phrases --- p.113Chapter 6.2.8. --- """Ba2"" and ""Bei4"" constructions" --- p.115Chapter 6.2.9. --- The terminal node S --- p.119Chapter 6.2.10. --- Summary of phrasal rules --- p.121Chapter 6.2.11. --- Morphological rules --- p.122Chapter 7. --- Step Three: resolving structural ambiguities --- p.128Chapter 7.1. --- Sources of ambiguities --- p.128Chapter 7.2. --- The traditional practices: an illustration --- p.132Chapter 7.3. --- Deficiency of current practices --- p.134Chapter 7.4. --- A new point of view: Wu (1999) --- p.140Chapter 7.5. --- Improvement over Wu (1999) --- p.142Chapter 7.6. --- Conclusion on semantic features --- p.146Chapter 8. --- "Implementation, performance and evaluation" --- p.148Chapter 8.1. --- Implementation --- p.148Chapter 8.2. --- Performance and evaluation --- p.150Chapter 8.2.1. --- The test set --- p.150Chapter 8.2.2. --- Segmentation of lexical tokens --- p.150Chapter 8.2.3. --- New word identification --- p.152Chapter 8.2.4. --- Parsing unit segmentation --- p.156Chapter 8.2.5. --- The grammar --- p.158Chapter 8.3. --- Overall performance of SERUP --- p.162Chapter 9. --- Conclusion --- p.164Chapter 9.1. --- Summary of this thesis --- p.164Chapter 9.2. --- Contribution of this thesis --- p.165Chapter 9.3. --- Future work --- p.166References --- p.168Appendix I --- p.176Appendix II --- p.181Appendix III --- p.18

    Detecting grammatical errors with treebank-induced, probabilistic parsers

    Get PDF
    Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893
    corecore