Search CORE

3 research outputs found

Units of measure identification in unstructured scientific documents in microbial risk in food

Author: Berrahou Soumia Lilia
Buche Patrice
Dibie Juliette
Roche Mathieu
Publication venue: CNIEL
Publication date: 01/07/2013
Field of study

International audienceOBJECTIVE(S) A preliminary step in microbial risk assessment in food is to gather and capitalize experimental data. Data capitalization is a crucial stake in an overall decision support system which consists of predicting microbial behavior [1]. In the framework of the French ANR project MAP'OPT (Equilibrium Gas Composition in Modified Atmosphere Packaging and Food Quality), the predictive modeling platform Sym'Previus (www.symprevius.org) should be able to propose a global approach to establish a scientifically sound method for choosing an appropriate modified atmosphere and associated packaging solution. Our work is part of this overall system and aims at extracting semi-automatically experimental data from unstructured scientific documents. Indeed, these documents use natural language combined with domain-specific terminology that is extremely time-consuming and tedious to extract in the free form of text and therefore to gather and capitalize. Our work relies on the MAP'OPT-Onto ontology [4], which has been built as an extension of the ontology used in Sym'Previus by adding concepts about food packaging, quantity concepts and concepts managing units of measures. Experimental data are often expressed with concepts (e.g packaging, permeability) or a numerical value often followed with its unit of measure (e.g. 258 amol m-1 s-1 Pa-1). In this paper, our work deals with unit recognition, known as a scientific challenge. METHOD(S) Extracting automatically quantitative data is a painstaking process because units suffer from different ways of writing within documents. We can encounter same units written in different manners such as amol m-1 s-1 Pa-1 written as amol.m-1 .s-1 .Pa-1 or as amol/m/s/Pa. We aim at focusing on the extraction and identification of these variant units seen as synonyms, in order to enrich iteratively an ontology, which represents a predefined vocabulary used to annotate, capitalize and query experimental data extracted from texts [2]. Our work addresses unit extraction and identification issues from texts to enrich an ontology in a two-step approach. First, we use text-mining methods and supervised learning approaches in order to predict relevant parts of the text where synonyms of units or new units are. The second step of our method consists in extracting specific strings representing units in the segments of texts found in the previous step. The extracted candidates are compared to units already present in the ontology using a new edit measure based on Damerau-Levenshtein [3]. RESULTS We have made experiments on 115 scientific documents (i.e. around 35 000 sentences) on food packaging. Each unit is recognized from a list of 211 units already defined in the MAP'OPT-Onto. Our learning algorithms predict that almost 5 000 sentences contain units. This prediction is correct for 95,5% of cases. In the second step, we have successfully extracted 38 terms as either synonyms or new units from sentences selected in the first step. So, we can propose 18% of enrichment of the pre-existing MAP'OPT-Onto

INRIA a CCSD electronic archive server

HAL Descartes

HAL-CIRAD

Hal-Diderot

An Ontological and Terminological Resource for n-ary Relation Annotation in Web Data Tables

Author: Buche Patrice
Dibie-Barthelemy Juliette
Ibanescu Liliana,
Touhami Rim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/10/2011
Field of study

International audienceWe propose, in this paper, a model for an Ontological and Terminological Resource (OTR) dedicated to the task of n-ary relations annotation in Web data tables. This task relies on the identi cation of the symbolic concepts and the quantities, de ned in the OTR, which are represented in the tables' columns. We propose to guide the annotation by an OTR because it allows a separation between the terminological and conceptual components and allows dealing with abbreviations and synonyms which could denote the same concept in a multilingual context. The OTR is composed of a generic part to represent the structure of the ontology dedicated to the task of n-ary relations annotation in data tables for any application and of a speci c part to represent a particular domain of interest. We present the model of our OTR and its use in an existing method for semantic annotation and querying of Web tables

INRIA a CCSD electronic archive server

HAL Descartes

HAL-CIRAD

Hal-Diderot

Extraction de relations en domaine de spécialité

Author: GRAU Brigitte
MINARD Anne-Lyse
Publication venue
Publication date: 01/01/2012
Field of study

La quantité d'information disponible dans le domaine biomédical ne cesse d'augmenter. Pour que cette information soit facilement utilisable par les experts d'un domaine, il est nécessaire de l'extraire et de la structurer. Pour avoir des données structurées, il convient de détecter les relations existantes entre les entités dans les textes. Nos recherches se sont focalisées sur la question de l'extraction de relations complexes représentant des résultats expérimentaux, et sur la détection et la catégorisation de relations binaires entre des entités biomédicales. Nous nous sommes intéressée aux résultats expérimentaux présentés dans les articles scientifiques. Nous appelons résultat expérimental, un résultat quantitatif obtenu suite à une expérience et mis en relation avec les informations permettant de décrire cette expérience. Ces résultats sont importants pour les experts en biologie, par exemple pour faire de la modélisation. Dans le domaine de la physiologie rénale, une base de données a été créée pour centraliser ces résultats d'expérimentation, mais l'alimentation de la base est manuelle et de ce fait longue. Nous proposons une solution pour extraire automatiquement des articles scientifiques les connaissances pertinentes pour la base de données, c'est-à-dire des résultats expérimentaux que nous représentons par une relation n-aire. La méthode procède en deux étapes : extraction automatique des documents et proposition de celles-ci pour validation ou modification par l'expert via une interface. Nous avons également proposé une méthode à base d'apprentissage automatique pour l'extraction et la classification de relations binaires en domaine de spécialité. Nous nous sommes intéressée aux caractéristiques et variétés d'expressions des relations, et à la prise en compte de ces caractéristiques dans un système à base d'apprentissage. Nous avons étudié la prise en compte de la structure syntaxique de la phrase et la simplification de phrases dirigée pour la tâche d'extraction de relations. Nous avons en particulier développé une méthode de simplification à base d'apprentissage automatique, qui utilise en cascade plusieurs classifieurs.The amount of available scientific literature is constantly growing. If the experts of a domain want to easily access this information, it must be extracted and structured. To obtain structured data, both entities and relations of the texts must be detected. Our research is about the problem of complex relation extraction which represent experimental results, and detection and classification of binary relations between biomedical entities. We are interested in experimental results presented in scientific papers. An experimental result is a quantitative result obtained by an experimentation and linked with information that describes this experimentation. These results are important for biology experts, for example for doing modelization. In the domain of renal physiology, a database was created to centralize these experimental results, but the base is manually populated, therefore the population takes a long time. We propose a solution to automatically extract relevant knowledge for the database from the scientific papers, that is experimental results which are represented by a n-ary relation. The method proceeds in two steps: automatic extraction from documents and proposal of information extracted for approval or modification by the experts via an interface. We also proposed a method based on machine learning for extraction and classification of binary relations in specialized domains. We focused on the variations of the expression of relations, and how to represent them in a machine learning system. We studied the way to take into account syntactic structure of the sentence and the sentence simplification guided by the task of relation extraction. In particular, we developed a simplification method based on machine learning, which uses a series of classifiers.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

OpenGrey Repository