3 research outputs found

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Gestion de l'incertitude dans le processus d'extraction de connaissances à partir de textes

    Get PDF
    The increase of textual sources over the Web offers an opportunity for knowledge extraction and knowledge base creation. Recently, several research works on this topic have appeared or intensified. They generally highlight that to extract relevant and precise information from text, it is necessary to define a collaboration between linguistic approaches, e.g., to extract certain concepts regarding named entities, temporal and spatial aspects, and methods originating from the field of semantics' processing. Moreover, successful approaches also need to qualify and quantify the uncertainty present in the text. Finally, in order to be relevant in the context of the Web, the linguistic processing need to be consider several sources in different languages. This PhD thesis tackles this problematic in its entirety since our contributions cover the extraction, representation of uncertain knowledge as well as the visualization of generated graphs and their querying. This research work has been conducted within a CIFRE funding involving the Laboratoire d'Informatique Gaspard Monge (LIGM) of the Université Paris-Est Marne la Vallée and the GEOLSemantics start-up. It was leveraging from years of accumulated experience in natural language processing (GeolSemantics) and semantics processing (LIGM).In this context, our contributions are the following:- the integration of a qualifation of different forms of uncertainty, based on ontology processing, within the knowledge extraction processing,- the quantification of uncertainties based on a set of heuristics,- a representation, using RDF graphs, of the extracted knowledge and their uncertainties,- an evaluation and an analysis of the results obtained using our approachLa multiplication de sources textuelles sur le Web offre un champ pour l'extraction de connaissances depuis des textes et à la création de bases de connaissances. Dernièrement, de nombreux travaux dans ce domaine sont apparus ou se sont intensifiés. De ce fait, il est nécessaire de faire collaborer des approches linguistiques, pour extraire certains concepts relatifs aux entités nommées, aspects temporels et spatiaux, à des méthodes issues des traitements sémantiques afin de faire ressortir la pertinence et la précision de l'information véhiculée. Cependant, les imperfections liées au langage naturel doivent être gérées de manière efficace. Pour ce faire, nous proposons une méthode pour qualifier et quantifier l'incertitude des différentes portions des textes analysés. Enfin, pour présenter un intérêt à l'échelle du Web, les traitements linguistiques doivent être multisources et interlingue. Cette thèse s'inscrit dans la globalité de cette problématique, c'est-à-dire que nos contributions couvrent aussi bien les aspects extraction et représentation de connaissances incertaines que la visualisation des graphes générés et leur interrogation. Les travaux de recherche se sont déroulés dans le cadre d'une bourse CIFRE impliquant le Laboratoire d'Informatique Gaspard Monge (LIGM) de l'Université Paris-Est Marne la Vallée et la société GEOLSemantics. Nous nous appuyons sur une expérience cumulée de plusieurs années dans le monde de la linguistique (GEOLSemantics) et de la sémantique (LIGM).Dans ce contexte, nos contributions sont les suivantes :- participation au développement du système d'extraction de connaissances de GEOLSemantics, en particulier : (1) le développement d'une ontologie expressive pour la représentation des connaissances, (2) le développement d'un module de mise en cohérence, (3) le développement d'un outil visualisation graphique.- l'intégration de la qualification de différentes formes d'incertitude, au sein du processus d'extraction de connaissances à partir d'un texte,- la quantification des différentes formes d'incertitude identifiées ;- une représentation, à l'aide de graphes RDF, des connaissances et des incertitudes associées ;- une méthode d'interrogation SPARQL intégrant les différentes formes d'incertitude ;- une évaluation et une analyse des résultats obtenus avec notre approch
    corecore