16 research outputs found

    Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach

    Get PDF
    International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field

    Deep analysis in IQA: evaluation on real users dialogues.

    No full text
    Interactive Question Answering (IQA) is a natural and cohesive way for a user to obtain information by interactive with a system using natural language. With the advancement in Natural Language Processing, research in the eld of IQA has started to focus on the role of semantics and the discourse structure in these systems. The need for a deeper analysis, which examines the syntax and semantics of the questions and the answers is evident. Using this deeper analysis allows us to model the context of the interaction. I will look at a current closeddomain IQA system which is based on Linear Regression modeling. This system uses super cial and non-semantically motivated features. I propose adding deep analysis and semantic features in order to improve the system and show the need for such analysis. Particular attention will be placed on the so-called follow-up questions (questions that the user poses after having received some answer from the system) and the role of context. I propose that adding the linguistically heavy features will prove bene cial, thereby showing the need for such analysis in IQA systems

    Analyse prédicative pour l'extraction d'information : application au domaine de la biologie

    No full text
    L’abondance de textes dans le domaine biomĂ©dical nĂ©cessite le recours Ă  des mĂ©thodes de traitement automatique pour amĂ©liorer la recherche d’informations prĂ©cises. L’extraction d’information (EI) vise prĂ©cisĂ©ment Ă  extraire de l’information pertinente Ă  partir de donnĂ©es non-structurĂ©es. Une grande partie des mĂ©thodes dans ce domaine se concentre sur les approches d’apprentissage automatique, en ayant recours Ă  des traitements linguistiques profonds. L’analyse syntaxique joue notamment un rĂŽle important, en fournissant une analyse prĂ©cise des relations entre les Ă©lĂ©ments de la phrase.Cette thĂšse Ă©tudie le rĂŽle de l’analyse syntaxique en dĂ©pendances dans le cadre d’applications d’EI dans le domaine biomĂ©dical. Elle comprend l’évaluation de diffĂ©rents analyseurs ainsi qu’une analyse dĂ©taillĂ©e des erreurs. Une fois l’analyseur le plus adaptĂ© sĂ©lectionnĂ©, les diffĂ©rentes Ă©tapes de traitement linguistique pour atteindre une EI de haute qualitĂ©, fondĂ©e sur la syntaxe, sont abordĂ©s : ces traitements incluent des Ă©tapes de prĂ©-traitement (segmentation en mots) et des traitements linguistiques de plus haut niveau (liĂ© Ă  la sĂ©mantique et Ă  l’analyse de la corĂ©fĂ©rence). Cette thĂšse explore Ă©galement la maniĂšre dont les diffĂ©rents niveaux de traitement linguistique peuvent ĂȘtre reprĂ©sentĂ©s puis exploitĂ©s par l’algorithme d’apprentissage. Enfin, partant du constat que le domaine biomĂ©dical est en fait extrĂȘmement diversifiĂ©, cette thĂšse explore l’adaptation des techniques Ă  diffĂ©rents sous-domaines, en utilisant des connaissances et des ressources dĂ©jĂ  existantes. Les mĂ©thodes et les approches dĂ©crites sont explorĂ©es en utilisant deux corpus biomĂ©dicaux diffĂ©rents, montrant comment les rĂ©sultats d’IE sont utilisĂ©s dans des tĂąches concrĂštes.The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain-adaptationcan be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks

    Analyse prĂ©dicative pour l’extraction d’information : application au domaine de la biologie

    No full text
    The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain adaptation can be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks.La thÚse s'inscrit dans le contexte décrit précédemment : il s'agit d'explorer des techniques d'acquisition de connaissances lexicales à partir de textes, à des fins tant théoriques qu'applicatives. l'analyse portera plus particuliÚrement sur le prédicat verbal et ses nominalisations car celui-ci joue un rÎle essentiel pour les applications de tal (repérage d'événements, extraction d'information, etc.). on s'intéressera par exemple à l'acquisition de cadres de sous-catégorisation et de restrictions de sélections afin de déterminer des familles de verbes ayant un comportement syntaxico-sémantique proche. la stratégie envisagée est fortement inspirée des travaux de z. harris et de ses collÚgues (harris 1951, 1988 ; harris et al., 1989). celui-ci a montré que les textes techniques n'utilisent pas toute la complexité de la langue mais font au contraire usage de « sous-langages ». un sous-langage a un vocabulaire spécialisé et une syntaxe simplifiée par rapport à la langue courante. les textes de spécialités font donc apparaßtre des régularités qui peuvent s'analyser par analyse distributionnelle (en simplifiant : les éléments apparaissant dans des contextes similaires ont des sens similaires, ou tout au moins proches). seulement, l'analyse distributionnelle en peut fonctionner que si le texte a été « nettoyé » des variations linguistiques de surface. une pré-analyse des textes est donc cruciale

    Deep analysis in IQA: evaluation on real users dialogues.

    Get PDF
    Interactive Question Answering (IQA) is a natural and cohesive way for a user to obtain information by interactive with a system using natural language. With the advancement in Natural Language Processing, research in the eld of IQA has started to focus on the role of semantics and the discourse structure in these systems. The need for a deeper analysis, which examines the syntax and semantics of the questions and the answers is evident. Using this deeper analysis allows us to model the context of the interaction. I will look at a current closeddomain IQA system which is based on Linear Regression modeling. This system uses super cial and non-semantically motivated features. I propose adding deep analysis and semantic features in order to improve the system and show the need for such analysis. Particular attention will be placed on the so-called follow-up questions (questions that the user poses after having received some answer from the system) and the role of context. I propose that adding the linguistically heavy features will prove bene cial, thereby showing the need for such analysis in IQA systems

    Improving term extraction with linguistic analysis in the biomedical domain

    No full text
    Ă  ce jour 31/01/2014 cette parution est en " draft version "International audienceThis paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier

    Improving term extraction with linguistic analysis in the biomedical domain

    No full text
    International audienceThis paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier

    Improving term extraction with linguistic analysis in the biomedical domain

    No full text
    This paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier
    corecore