12 research outputs found

    Fast and Large-scale Unsupervised Relation Extraction

    Get PDF
    A common approach to unsupervised relation extraction builds clusters of patterns express-ing the same relation. In order to obtain clus-ters of relational patterns of good quality, we have two major challenges: the semantic rep-resentation of relational patterns and the scal-ability to large data. In this paper, we ex-plore various methods for modeling the mean-ing of a pattern and for computing the similar-ity of patterns mined from huge data. In order to achieve this goal, we apply algorithms for approximate frequency counting and efficient dimension reduction to unsupervised relation extraction. The experimental results show that approximate frequency counting and dimen-sion reduction not only speeds up similarity computation but also improves the quality of pattern vectors.

    Modeling semantic compositionality of relational patterns

    Get PDF
    AbstractVector representation is a common approach for expressing the meaning of a relational pattern. Most previous work obtained a vector of a relational pattern based on the distribution of its context words (e.g., arguments of the relational pattern), regarding the pattern as a single ‘word’. However, this approach suffers from the data sparseness problem, because relational patterns are productive, i.e., produced by combinations of words. To address this problem, we propose a novel method for computing the meaning of a relational pattern based on the semantic compositionality of constituent words. We extend the Skip-gram model (Mikolov et al., 2013) to handle semantic compositions of relational patterns using recursive neural networks. The experimental results show the superiority of the proposed method for modeling the meanings of relational patterns, and demonstrate the contribution of this work to the task of relation extraction

    Évaluation et amélioration de la qualité de DBpedia pour la représentation de la connaissance du domaine

    Get PDF
    RÉSUMÉ L’évolution récente du Web sémantique, tant par la quantité d’information offerte que par la multiplicité des usages possibles, rend indispensable l’évaluation de la qualité des divers ensembles de données (datasets) disponibles. Le Web sémantique étant basé sur la syntaxe RDF, i.e. des triplets (par exemple ), on peut le voir comme un immense graphe, où un triplet relie un nœud « sujet » et un nœud « objet » par une arête « relation ». Chaque dataset représente ainsi un sous-graphe. Dans cette représentation, DBpedia, un des datasets majeurs du Web sémantique, en est souvent considéré comme le nœud central. En effet, DBpedia a pour vocation, à terme, de pouvoir représenter toute l’information présente dans Wikipedia, et couvre donc une très grande variété de sujets, permettant de faire le lien avec tous les autres datasets, incluant les plus spécialisés. C’est de cette multiplicité des sujets couverts qu’apparait un point fondamental de ce projet : la notion de « domaine ». Informellement, nous considérons un domaine comme étant un ensemble de sujets reliés par une thématique commune. Par exemple, le domaine Mathématiques contient plusieurs sujets, comme algèbre, fonction ou addition. Formellement, nous considérons un domaine comme un sous-graphe de DBpedia, où l’on ne conserve que les nœuds représentant des concepts liés à ce domaine. En l’état actuel, les méthodes d’extraction de données de DBpedia sont généralement beaucoup moins efficaces lorsque le sujet est abstrait, conceptuel, que lorsqu’il s’agit d’une entité nommée, par exemple une personne, ville ou compagnie. Par conséquent, notre première hypothèse est que l’information disponible sur DBpedia liée à un domaine est souvent pauvre, car nos domaines sont essentiellement constitués de concepts abstraits. La première étape de ce travail de recherche fournit une évaluation de la qualité de l’information conceptuelle d’un ensemble de 17 domaines choisis semi-aléatoirement, et confirme cette hypothèse. Pour cela, nous identifions plusieurs axes permettant de chiffrer la « qualité » d’un domaine : 1 - nombre de liens entrants et sortants pour chaque concept, 2 - nombre de liens reliant deux concepts du domaine par rapport aux liens reliant le domaine au reste de DBpedia, 3 - nombre de concepts typés (i.e. représentant l’instance d’une classe, par exemple Addition est une instance de la classe Opération mathématique : le concept Addition est donc typé si la relation apparait dans DBpedia). Nous arrivons à la conclusion que l’information conceptuelle contenue dans DBpedia est effectivement incomplète, et ce selon les trois axes. La seconde partie de ce travail de recherche est de tenter de répondre au problème posé dans la première partie. Pour cela, nous proposons deux approches possibles. La première permet de fournir des classes potentielles, répondant en partie à la problématique de la quantité de concepts typés. La seconde utilise des systèmes d’extraction de relations à partir de texte (ORE – Open Relation Extraction) sur l’abstract (i.e. premier paragraphe de la page Wikipedia) de chaque concept. En classifiant les relations extraites, cela nous permet 1) de proposer des relations inédites entre concepts d’un domaine, 2) de proposer des classes potentielles, comme dans la première approche. Ces deux approches ne sont, en l’état, qu’un début de solution, mais nos résultats préliminaires sont très encourageants, et indiquent qu’il s’agit sans aucun doute de solutions pertinentes pour aider à corriger les problèmes démontrés dans la première partie.----------ABSTRACT In the current state of the semantic web, the quantity of available data and the multiplicity of its uses impose the continuous evaluation of the quality of this data, on the various Linked Open Data (LOD) datasets. These datasets are based on the RDF syntax, i.e. triples, such as . As a consequence, the LOD cloud can be represented as a huge graph, where every triple links the two nodes “subject” and “object”, by an edge “relation”. In this representation, each dataset is a sub-graph. DBpedia, one of the major datasets, is colloquially considered to be the central hub of this cloud. Indeed, the ultimate purpose of DBpedia is to provide all the information present in Wikipedia, “translated” into RDF, and therefore covers a wide range of domains, allowing a linkage with every other LOD dataset, including the most specialized. From this wide coverage arises one of the fundamental concepts of this project: the notion of “domain”. Informally, a domain is a set of subjects with a common thematic. For instance, the domain Mathematics contains several subjects such as algebra, function or addition. More formally, a domain is a sub-graph of DBpedia, where the nodes represent domain-related concepts. Currently, the automatic extraction methods for DBpedia are usually far less efficient when the target subject is conceptual than when it is a named entity (such as a person, city or company). Hence our first hypothesis: the domain-related information available on DBpedia is often poor, since domains are constituted of concepts. In the first part of this research project, we confirm this hypothesis by evaluating the quality of domain-related knowledge in DBpedia for 17 domains chosen semi-randomly. This evaluation is based on three numerical aspects of the “quality” of a domain: 1 – number of inbound and outbound links for each concepts, 2 – number of links between two domain concepts compared to the number of links between the domain and the rest of DBpedia, 3- number of typed concepts (i.e. representing the instance of a class : for example, Addition is an instance of the class Mathematical operation : the concept Addition is typed if the relation appears in DBpedia). We reach the conclusion that the domain-related, conceptual information present in DBpedia is indeed poor on the three axis. In the second half of this work, we give two solutions to the quality problem highlighted in the first half. The first one allows to propose potential classes that could be added in DBpedia, addressing the 3rd quality aspect: number of typed concepts. The second one uses an Open Relation Extraction (ORE) system that allows to detect relations in a text. By using this system on the abstract (i.e. the first paragraph of the Wikipedia page) of each concept, and classifying the extracted relation depending on their semantic meaning, we can 1) propose novel relations between domain concepts, and 2) propose additional potential classes. These two methods currently only represent the first step, but the preliminary results we obtain are very encouraging, and seem to indicate that they are absolutely relevant to help correcting the issues highlighted in the first part

    Extraction d'axiomes et de règles logiques à partir de définitions de wikipédia en langage naturel

    Get PDF
    RÉSUMÉ Le Web sémantique repose sur la création de bases de connaissances complexes reliant les données du Web. Notamment, la base de connaissance DBpedia a été créée et est considérée aujourd’hui comme le « noyau du réseau Linked Open Data ». Cependant DBpedia repose sur une ontologie très peu riche en définitions de concepts et ne prend pas en compte l’information textuelle de Wikipedia. L’ontologie de DBpedia contient principalement des liens taxonomiques et des informations sur les instances. L’objectif de notre recherche est d’interpréter le texte en langue naturelle de Wikipédia, afin d’enrichir DBpedia avec des définitions de classes, une hiérarchie de classes (relations taxonomiques) plus riche et de nouvelles informations sur les instances. Pour ce faire, nous avons recours à une approche basée sur des patrons syntaxiques implémentés sous forme de requêtes SPARQL. Ces patrons sont exécutés sur des graphes RDF représentant l’analyse syntaxique des définitions textuelles extraites de Wikipédia. Ce travail a résulté en la création de AXIOpedia, une base de connaissances expressive contenant des axiomes complexes définissant les classes, et des triplets rdf:type reliant les instances à leurs classes.----------ABSTRACT The Semantic Web relies on the creation of rich knowledge bases which links data on the Web. In that matter, DBpedia started as a community effort and is considered today as the central interlinking hub for the emerging Web of data. However, DBpedia relies on a lighweight ontology and deals with some substantial limitations and lacks some important information that could be found in the text and the unstructured data of Wikipedia. Furthermore, the DBpedia ontology contains mainly taxonomical links and data about the instances, and lacks class definitions. The objective of this work is to enrich DBpedia with class definitions and taxonomical links using text in natural language. For this purpose, we rely on a pattern-based approach that transforms textual definitions from Wikipedia into RDF graphs, which are processed to query syntactical pattern occurrences using SPARQL. This work resulted in the creation of AXIOpedia, a rich knowledge base containing complex axioms defining classes and rdf:type relations relating instances with these classes

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Effective distant supervision for end-to-end knowledge base population systems

    Get PDF
    The growing amounts of textual data require automatic methods for structuring relevant information so that it can be further processed by computers and systematically accessed by humans. The scenario dealt with in this dissertation is known as Knowledge Base Population (KBP), where relational information about entities is retrieved from a large text collection and stored in a database, structured according to a pre-specified schema. Most of the research in this dissertation is placed in the context of the KBP benchmark of the Text Analysis Conference (TAC KBP), which provides a test-bed to examine all steps in a complex end-to-end relation extraction setting. In this dissertation a new state of the art for the TAC KBP benchmark was achieved by focussing on the following research problems: (1) The KBP task was broken down into a modular pipeline of sub-problems, and the most pressing issues were identified and quantified at all steps. (2) The quality of semi-automatically generated training data was increased by developing noise-reduction methods, decreasing the influence of false-positive training examples. (3) A focus was laid on fine-grained entity type modelling, entity expansion, entity matching and tagging, to maintain as much recall as possible on the relational argument level. (4) A new set of effective methods for generating training data, encoding features and training relational classifiers was developed and compared with previous state-of-the-art methods.Die wachsende Menge an Textdaten erfordert Methoden, relevante Informationen so zu strukturieren, dass sie von Computern weiterverarbeitet werden können, und dass Menschen systematisch auf sie zugreifen können. Das in dieser Dissertation behandelte Szenario ist unter dem Begriff Knowledge Base Population (KBP) bekannt. Hier werden relationale Informationen über Entitäten aus großen Textbeständen automatisch zusammengetragen und gemäß einem vorgegebenen Schema strukturiert. Ein Großteil der Forschung der vorliegenden Dissertation ist im Kontext des TAC KBP Vergleichstests angesiedelt. Dieser stellt ein Testumfeld dar, um alle Schritte eines anfragebasierten Relationsextraktions-Systems zu untersuchen. Die in der vorliegenden Dissertation entwickelten Verfahren setzen einen neuen Standard für TAC KBP. Dies wurde durch eine Schwerpunktsetzung auf die folgenden Forschungsfragen erreicht: Erstens wurden die wichtigsten Unterprobleme von KBP identifiziert und die jeweiligen Effekte genau quantifiziert. Zweitens wurde die Qualität von halbautomatischen Trainingsdaten durch Methoden erhöht, die den Einfluss von falsch positiven Trainingsbeispielen verringern. Drittens wurde ein Schwerpunkt auf feingliedrige Typmodellierung, die Expansion von Entitätennamen und das Auffinden von Entitäten gelegt, um eine größtmögliche Abdeckung von relationalen Argumenten zu erreichen. Viertens wurde eine Reihe von neuen leistungsstarken Methoden entwickelt und untersucht, um Trainingsdaten zu erzeugen, Klassifizierungsmerkmale zu kodieren und relationale Klassifikatoren zu trainieren

    Effective distant supervision for end-to-end knowledge base population systems

    Get PDF
    The growing amounts of textual data require automatic methods for structuring relevant information so that it can be further processed by computers and systematically accessed by humans. The scenario dealt with in this dissertation is known as Knowledge Base Population (KBP), where relational information about entities is retrieved from a large text collection and stored in a database, structured according to a pre-specified schema. Most of the research in this dissertation is placed in the context of the KBP benchmark of the Text Analysis Conference (TAC KBP), which provides a test-bed to examine all steps in a complex end-to-end relation extraction setting. In this dissertation a new state of the art for the TAC KBP benchmark was achieved by focussing on the following research problems: (1) The KBP task was broken down into a modular pipeline of sub-problems, and the most pressing issues were identified and quantified at all steps. (2) The quality of semi-automatically generated training data was increased by developing noise-reduction methods, decreasing the influence of false-positive training examples. (3) A focus was laid on fine-grained entity type modelling, entity expansion, entity matching and tagging, to maintain as much recall as possible on the relational argument level. (4) A new set of effective methods for generating training data, encoding features and training relational classifiers was developed and compared with previous state-of-the-art methods.Die wachsende Menge an Textdaten erfordert Methoden, relevante Informationen so zu strukturieren, dass sie von Computern weiterverarbeitet werden können, und dass Menschen systematisch auf sie zugreifen können. Das in dieser Dissertation behandelte Szenario ist unter dem Begriff Knowledge Base Population (KBP) bekannt. Hier werden relationale Informationen über Entitäten aus großen Textbeständen automatisch zusammengetragen und gemäß einem vorgegebenen Schema strukturiert. Ein Großteil der Forschung der vorliegenden Dissertation ist im Kontext des TAC KBP Vergleichstests angesiedelt. Dieser stellt ein Testumfeld dar, um alle Schritte eines anfragebasierten Relationsextraktions-Systems zu untersuchen. Die in der vorliegenden Dissertation entwickelten Verfahren setzen einen neuen Standard für TAC KBP. Dies wurde durch eine Schwerpunktsetzung auf die folgenden Forschungsfragen erreicht: Erstens wurden die wichtigsten Unterprobleme von KBP identifiziert und die jeweiligen Effekte genau quantifiziert. Zweitens wurde die Qualität von halbautomatischen Trainingsdaten durch Methoden erhöht, die den Einfluss von falsch positiven Trainingsbeispielen verringern. Drittens wurde ein Schwerpunkt auf feingliedrige Typmodellierung, die Expansion von Entitätennamen und das Auffinden von Entitäten gelegt, um eine größtmögliche Abdeckung von relationalen Argumenten zu erreichen. Viertens wurde eine Reihe von neuen leistungsstarken Methoden entwickelt und untersucht, um Trainingsdaten zu erzeugen, Klassifizierungsmerkmale zu kodieren und relationale Klassifikatoren zu trainieren
    corecore