28 research outputs found

    Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

    Get PDF
    The amount of Semantic Web data is growing rapidly today. Individual users, academic institutions and businesses have already published and are continuing to publish their data in Semantic Web standards, such as RDF and OWL. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. Furthermore, data published by each individual publisher may not be complete. This situation makes it difficult for end users to consume the available Semantic Web data effectively. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. In the Semantic Web, the owl:sameAs predicate is used to link two equivalent (coreferent) ontology instances. An important question is where these owl:sameAs links come from. Although manual interlinking is possible on small scales, when dealing with large-scale datasets (e.g., millions of ontology instances), automated linking becomes necessary. This dissertation summarizes contributions to several aspects of entity coreference research in the Semantic Web. First of all, by developing the EPWNG algorithm, we advance the performance of the state-of-the-art by 1% to 4%. EPWNG finds coreferent ontology instances from different data sources by comparing every pair of instances and focuses on achieving high precision and recall by appropriately collecting and utilizing instance context information domain-independently. We further propose a sampling and utility function based context pruning technique, which provides a runtime speedup factor of 30 to 75. Furthermore, we develop an on-the-fly candidate selection algorithm, P-EPWNG, that enables the coreference process to run 2 to 18 times faster than the state-of-the-art on up to 1 million instances while only making a small sacrifice in the coreference F1-scores. This is achieved by utilizing the matching histories of the instances to prune instance pairs that are not likely to be coreferent. We also propose Offline, another candidate selection algorithm, that not only provides similar runtime speedup to P-EPWNG but also helps to achieve higher candidate selection and coreference F1-scores due to its more accurate filtering of true negatives. Different from P-EPWNG, Offline pre-selects candidate pairs by only comparing their partial context information that is selected in an unsupervised, automatic and domain-independent manner.In order to be able to handle really heterogeneous datasets, a mechanism for automatically determining predicate comparability is proposed. Combing this property matching approach with EPWNG and Offline, our system outperforms state-of-the-art algorithms on the 2012 Billion Triples Challenge dataset on up to 2 million instances for both coreference F1-score and runtime. An interesting project, where we apply the EPWNG algorithm for assisting cervical cancer screening, is discussed in detail. By applying our algorithm to a combination of different patient clinical test results and biographic information, we achieve higher accuracy compared to its ablations. We end this dissertation with the discussion of promising and challenging future work

    Improved Coreference Resolution Using Cognitive Insights

    Get PDF
    Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution

    Optimization issues in machine learning of coreference resolution

    Get PDF

    Reasoning-Driven Question-Answering For Natural Language Understanding

    Get PDF
    Natural language understanding (NLU) of text is a fundamental challenge in AI, and it has received significant attention throughout the history of NLP research. This primary goal has been studied under different tasks, such as Question Answering (QA) and Textual Entailment (TE). In this thesis, we investigate the NLU problem through the QA task and focus on the aspects that make it a challenge for the current state-of-the-art technology. This thesis is organized into three main parts: In the first part, we explore multiple formalisms to improve existing machine comprehension systems. We propose a formulation for abductive reasoning in natural language and show its effectiveness, especially in domains with limited training data. Additionally, to help reasoning systems cope with irrelevant or redundant information, we create a supervised approach to learn and detect the essential terms in questions. In the second part, we propose two new challenge datasets. In particular, we create two datasets of natural language questions where (i) the first one requires reasoning over multiple sentences; (ii) the second one requires temporal common sense reasoning. We hope that the two proposed datasets will motivate the field to address more complex problems. In the final part, we present the first formal framework for multi-step reasoning algorithms, in the presence of a few important properties of language use, such as incompleteness, ambiguity, etc. We apply this framework to prove fundamental limitations for reasoning algorithms. These theoretical results provide extra intuition into the existing empirical evidence in the field

    Knowledge acquisition for coreference resolution

    Get PDF
    Diese Arbeit befasst sich mit dem Problem der statistischen Koreferenzauflösung. Theoretische Studien bezeichnen Koreferenz als ein vielseitiges linguistisches PhĂ€nomen, das von verschiedenen Faktoren beeinflusst wird. Moderne statistiche Algorithmen dagegen basieren sich typischerweise auf einfache wissensarme Modelle. Ziel dieser Arbeit ist das Schließen der LĂŒcke zwischen Theorie und Praxis. Ausgehend von den Erkentnissen der theoretischen Studien erfolgt die Bestimmung der linguistischen Faktoren die fuer die Koreferenz besonders relevant erscheinen. Unterschiedliche Informationsquellen werden betrachtet: von der OberflĂ€chenĂŒbereinstimmung bis zu den tieferen syntaktischen, semantischen und pragmatischen Merkmalen. Die PrĂ€zision der untersuchten Faktoren wird mit korpus-basierten Methoden evaluiert. Die Ergebnisse beweisen, dass die Koreferenz mit den linguistischen, in den theoretischen Studien eingebrachten Merkmalen interagiert. Die Arbeit zeigt aber auch, dass die Abdeckung der untersuchten theoretischen Aussagen verbessert werden kann. Die Merkmale stellen die Grundlage fĂŒr den Aufbau eines einerseits linguistisch gesehen reichen andererseits auf dem Machinellen Lerner basierten, d.h. eines flexiblen und robusten Systems zur Koreferenzauflösung. Die aufgestellten Untersuchungen weisen darauf hin dass das wissensreiche Model erfolgversprechende Leistung zeigt und im Vergleich mit den Algorithmen, die sich auf eine einzelne Informationsquelle verlassen, sowie mit anderen existierenden Anwendungen herausragt. Das System erreicht einen F-wert von 65.4% auf dem MUC-7 Korpus. In den bereits veröffentlichen Studien ist kein besseres Ergebnis verzeichnet. Die Lernkurven zeigen keine Konvergenzzeichen. Somit kann der Ansatz eine gute Basis fuer weitere Experimente bilden: eine noch bessere Leistung kann dadurch erreicht werden, dass man entweder mehr Texte annotiert oder die bereits existierende Daten effizienter einsetzt. Diese Arbeit beweist, dass statistiche Algorithmen fuer Koreferenzauflösung stark von den theoretischen linguistischen Studien profitiern können und sollen: auch unvollstĂ€ndige Informationen, die automatische fehleranfĂ€llige Sprachmodule liefern, können die Leistung der Anwendung signifikant verbessern.This thesis addresses the problem of statistical coreference resolution. Theoretical studies describe coreference as a complex linguistic phenomenon, affected by various different factors. State-of-the-art statistical approaches, on the contrary, rely on rather simple knowledge-poor modeling. This thesis aims at bridging the gap between the theory and the practice. We use insights from linguistic theory to identify relevant linguistic parameters of co-referring descriptions. We consider different types of information, from the most shallow name-matching measures to deeper syntactic, semantic, and discourse knowledge. We empirically assess the validity of the investigated theoretic predictions for the corpus data. Our data-driven evaluation experiments confirm that various linguistic parameters, suggested by theoretical studies, interact with coreference and may therefore provide valuable information for resolution systems. At the same time, our study raises several issues concerning the coverage of theoretic claims. It thus brings feedback to linguistic theory. We use the investigated knowledge sources to build a linguistically informed statistical coreference resolution engine. This framework allows us to combine the flexibility and robustness of a machine learning-based approach with wide variety of data from different levels of linguistic description. Our evaluation experiments with different machine learners show that our linguistically informed model, on the one side, outperforms algorithms, based on a single knowledge source and, on the other side, yields the best result on the MUC-7 data, reported in the literature (F-score of 65.4% with the SVM-light learning algorithm). The learning curves for our classifiers show no signs of convergence. This suggests that our approach makes a good basis for further experimentation: one can obtain even better results by annotating more material or by using the existing data more intelligently. Our study proves that statistical approaches to the coreference resolution task may and should benefit from linguistic theories: even imperfect knowledge, extracted from raw text data with off-the-shelf error-prone NLP modules, helps achieve significant improvements

    Syntax at the edge: cross-clausal phenomena and the syntax of passamaquoddy

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 2001.Includes bibliographical references (p. 307-319).by Benjamin Bruening.Ph.D

    Warlpiri : theoretical implications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 2002.Includes bibliographical references (v. 2, leaves 230-241).The issue of non-configurationality is fundamental in determining the possible range of variation in Universal Grammar. This dissertation investigates this issue in the context of Warlpiri, the prototypical non-configurational language. I argue that positing a macroparameter, a single parameter that distinguishes configurational languages from non-configurational, requires variation on a magnitude not permitted by Universal Grammar. After refuting in detail previous macroparametric approaches, I propose a microparametric analysis: non-configurational languages are fully configurational and analysed through fine-grained parameters with independent motivation. I develop this approach for Warlpiri,partially on the basis of new data collected through work with Warlpiri consultants and analysis of Warlpiri texts. Beginning with A-syntax, I show that Warlpiri exhibits short-distance A-scrambling through binding and WCO data. I present an analysis of split ergativity in Warlpiri (ergative-/absolutive case-marking, nominative/accusative agreement), deriving the split from a dissociation of case and agreement, and the inherent nature of ergative case, rather than from non-configurationality. Extending the analysis to applicative constructions in Warlpiri, I identify both symmetric and asymmetric applicatives. I argue that the principled distinctions between them are explained structurally rather than lexically; therefore the applicative data provide evidence for a hierarchical verb phrase in Warlpiri. The analysis reveals the first reported distinction between unaccusative and unergative verbs in the language.(cont.) Turning to A'-syntax, I argue that word order is not free in Warlpiri; rather Warlpiri displays an articulated left peripheral structure. Thus, word order variations are largely determined by positioning of elements in ordered functional projections based on their status in the discourse. Furthermore, I present evidence from WCO and island effects that elements appear in these projections through movement. Finally, I investigate the wh-scope marking construction, arguing for an indirect dependency approach. In developing the analysis, I argue, contrary to standard assumptions, that the dependent clauses related with verbs of saying in Warlpiri are embedded rather than adjoined. On the basis of a poverty of the stimulus argument, I conclude the construction must follow from independent properties of the language. I propose that it follows from the discontinuous constituent construction, which I equate with split DPs/PPs in Germanic and Slavic languages. The syntactic structure of Warlpiri that emerges from the dissertation strongly supports a configurational analysis of the language, and thereby the microparameter approach to nonconfigurationality.by Julie Anne Legate.Ph.D
    corecore