13 research outputs found

    The TALP participation at TAC-KBP 2012

    Get PDF
    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its first participation at TAC-KBP 2012 in both the Entity Linking and the Slot Filling tasks.Peer ReviewedPostprint (author’s final draft

    The TALP participation at TAC-KBP 2013

    Get PDF
    This document describes the work performed by the Universitat Politècnica de Catalunya (UPC) in its second participation at TAC-KBP 2013 in both the Entity Linking and the Slot Filling tasks.Peer ReviewedPostprint (published version

    PolyUCOMP in TAC 2011 entity linking and slot filling

    Get PDF
    The Text Analysis Conference (TAC) is organized by the U.S. National Institute of Standards and Technology (NIST).2011-2012 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Enhancing knowledge acquisition systems with user generated and crowdsourced resources

    Get PDF
    This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and finally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

    Entitate izendunen desanbiguazioa ezagutza-base erraldoien arabera

    Get PDF
    130 p.Gaur egun, interneten nabigatzeko orduan, ia-ia ezinbestekoak dira bilatza-ileak, eta guztietatik ezagunena Google da. Bilatzaileek egungo arrakastarenzati handi bat ezagutza-baseen ustiaketatik eskuratu dute. Izan ere, bilaketasemantikoekin kontsulta soilak ezagutza-baseetako informazioaz aberastekogai dira. Esate baterako, musika talde bati buruzko informazioa bilatzean,bere diskografia edo partaideetara esteka gehigarriak eskaintzen dituzte. Her-rialde bateko lehendakariari buruzko informazioa bilatzean, lehendakari izan-dakoen estekak edo lurralde horretako informazio gehigarria eskaintzen dute.Hala ere, gaur egun pil-pilean dauden bilaketa semantikoen arrakasta kolokanjarriko duen arazoa existitzen da. Termino anbiguoek ezagutza-baseetatikeskuratuko den informazioaren egokitasuna baldintzatuko dute. Batez ere,arazo handienak izen berezien edo entitate izendunen aipamenek sortuko di-tuzte.Tesi-lan honen helburu nagusia entitate izendunen desanbiguazioa (EID)aztertu, eta hau burutzeko teknika berriak proposatzea da. EID sistemektestuetako izen-aipamenak desanbiguatu, eta ezagutza-baseetako entitateekinlotuko dituzte. Izen-aipamenen izaera anbiguoa dela eta, hainbat entitateizendatu ditzakete. Gainera, entitate berdina hainbat izen ezberdinekinizendatu daiteke, beraz, aipamen hauek egoki desanbiguatzea tesiaren gakoaizango da.Horretarako, lehenik, arloaren egoeraren oinarri diren bi desanbiguazioeredu aztertuko dira. Batetik, ezagutza-baseen egituraz baliatzen den ereduvglobala, eta bestetik, aipamenaren testuinguruko hitzen informazioa usti-atzen duen eredu lokala. Ondoren, bi informazio iturriak modu osagarriankonbinatuko dira. Konbinazioak arloaren egoerako emaitzak hainbat datu-multzo ezberdinetan gaindituko ditu, eta gainontzekoetan pareko emaitzaklortuko ditu.Bigarrenik, edozein desanbiguazio-sistema hobetzeko helburuarekin ideiaberritzaileak proposatu, aztertu eta ebaluatu dira. Batetik, diskurtso, bil-duma eta agerkidetza mailan entitateen portaera aztertu da, entitateek pa-troi jakin bat betetzen dutela baieztatuz. Ondoren, patroi horretan oinar-rituz eredu globalaren, lokalaren eta beste EID sistema baten emaitzak moduadierazgarrian hobetu dira. Bestetik, eredu lokala kanpotiko corpusetatik es-kuratutako ezagutzarekin elikatu da. Ekarpen honekin kanpo-ezagutza honenkalitatea ebaluatu da sistemari egiten dion ekarpena justifikatuz. Gainera,eredu lokalaren emaitzak hobetzea lortu da, berriz ere arloaren egoerakobalioak eskuratuz.Tesia artikuluen bilduma gisa aurkeztuko da. Sarrera eta arloaren ego-era azaldu ondoren, tesiaren oinarri diren ingelesezko lau artikulu erantsikodira. Azkenik, lau artikuluetan jorratu diren gaiak biltzeko ondorio orokorrakplanteatuko dira

    Collective Approaches to Named Entity Disambiguation

    Get PDF
    Internet content has become one of the most important resources of information. Much of this information is in the form of natural language text and one of the important components of natural language text is named entities. So automatic recognition and classification of named entities has attracted researchers for many years. Named entities are mentioned in different textual forms in different documents. Also, the same textual mention may refer to different named entities. This problem is well known in NLP as a disambiguation problem. Named Entity Disambiguation (NED) refers to the task of mapping different named entity mentions in running text to their correct interpretations in a specific knowledge base (KB). NED is important for many applications like search engines and software agents that aim to aggregate information on real world entities from sources such as the Web. The main goal of this research is to develop new methods for named entity disambiguation, emphasising the importance of interdependency of named entity candidates of different textual mentions in the document. The thesis focuses on two connected problems related to disambiguation. The first is Candidates Generation, the process of finding a small set of named entity candidate entries in the knowledge base for a specific textual mention, where this set contains the correct entry in the knowledge base. The second problem is Collective Disambiguation, where all named entity textual mentions in the document are disambiguated jointly, using interdependence and semantic relations between the different NE candidates of different textual mentions. Wikipedia is used as a reference knowledge base in this research. An information retrieval framework is used to generate the named entity candidates for a textual mention. A novel document similarity function (NEBSim) based on NE co-occurrence is introduced to calculate the similarity between two documents given a specific named entity textual mention. NEB-sim is also used in conjunction with the traditional cosine similarity measure to learn a model for ranking the named entity candidates. Na\"{i}ve Bayes and SVM classifiers are used to re-rank the retrieved documents. Our experiments, carried out on TAC-KBP 2011 data, show NEBsim achieves significant improvement in accuracy as compared with a cosine similarity approach. Two novel approaches to collectively disambiguate textual mentions of named entities against Wikipedia are developed and tested using the AIDA dataset. The first represents the conditional dependencies between different named entities across Wikipedia as a Markov network, where named entities are treated as hidden variables and textual mentions as observations. The number of states and observations is huge, and na\"{i}vely using the Viterbi algorithm to find the hidden state sequence which emits the query observation sequence is computationally infeasible given a state space of this size. Based on an observation that is specific to the disambiguation problem, we develop an approach that uses a tailored approximation to reduce the size of the state space, making the Viterbi algorithm feasible. Results show good improvement in disambiguation accuracy relative to the baseline approach, and to some state-of-the-art approaches. Our approach also shows how, with suitable approximations, HMMs can be used in such large-scale state space problems. The second collective disambiguation approach uses a graph model, where all possible NE candidates are represented as nodes in the graph, and associations between different candidates are represented by edges between the nodes. Each node has an initial confidence score, e.g. entity popularity. Page-Rank is used to rank nodes, and the final rank is combined with the initial confidence for candidate selection. Experiments show the effectiveness of using Page-Rank in conjunction with initial confidence, achieving 87\% accuracy, outperforming both baseline and state-of-the-art approaches

    Joint Discourse-aware Concept Disambiguation and Clustering

    Get PDF
    This thesis addresses the tasks of concept disambiguation and clustering. Concept disambiguation is the task of linking common nouns and proper names in a text – henceforth called mentions – to their corresponding concepts in a predefined inventory. Concept clustering is the task of clustering mentions, so that all mentions in one cluster denote the same concept. In this thesis, we investigate concept disambiguation and clustering from a discourse perspective and propose a discourse-aware approach for joint concept disambiguation and clustering in the framework of Markov logic. The contributions of this thesis are fourfold: Joint Concept Disambiguation and Clustering. In previous approaches, concept disambiguation and concept clustering have been considered as two separate tasks (Schütze, 1998; Ji & Grishman, 2011). We analyze the relationship between concept disambiguation and concept clustering and argue that these two tasks can mutually support each other. We propose the – to our knowledge – first joint approach for concept disambiguation and clustering. Discourse-Aware Concept Disambiguation. One of the determining factors for concept disambiguation and clustering is the context definition. Most previous approaches use the same context definition for all mentions (Milne & Witten, 2008b; Kulkarni et al., 2009; Ratinov et al., 2011, inter alia). We approach the question which context is relevant to disambiguate a mention from a discourse perspective and state that different mentions require different notions of contexts. We state that the context that is relevant to disambiguate a mention depends on its embedding into discourse. However, how a mention is embedded into discourse depends on its denoted concept. Hence, the identification of the denoted concept and the relevant concept mutually depend on each other. We propose a binwise approach with three different context definitions and model the selection of the context definition and the disambiguation jointly. Modeling Interdependencies with Markov Logic. To model the interdependencies between concept disambiguation and concept clustering as well as the interdependencies between the context definition and the disambiguation, we use Markov logic (Domingos & Lowd, 2009). Markov logic combines first order logic with probabilities and allows us to concisely formalize these interdependencies. We investigate how we can balance between linguistic appropriateness and time efficiency and propose a hybrid approach that combines joint inference with aggregation techniques. Concept Disambiguation and Clustering beyond English: Multi- and Cross-linguality. Given the vast amount of texts written in different languages, the capability to extend an approach to cope with other languages than English is essential. We thus analyze how our approach copes with other languages than English and show that our approach largely scales across languages, even without retraining. Our approach is evaluated on multiple data sets originating from different sources (e.g. news, web) and across multiple languages. As an inventory, we use Wikipedia. We compare our approach to other approaches and show that it achieves state-of-the-art results. Furthermore, we show that joint concept disambiguating and clustering as well as joint context selection and disambiguation leads to significant improvements ceteris paribus
    corecore