16 research outputs found
Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach
International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field
Deep analysis in IQA: evaluation on real users dialogues.
Interactive Question Answering (IQA) is a natural and cohesive way for a user to obtain information by interactive with a system using natural language. With the advancement in Natural Language Processing, research in the eld of IQA has started to focus on the role of semantics and the discourse structure in these systems. The need for a deeper analysis, which examines the syntax and semantics of the questions and the answers is evident. Using this deeper analysis allows us to model the context of the interaction. I will look at a current closeddomain IQA system which is based on Linear Regression modeling. This system uses super cial and non-semantically motivated features. I propose adding deep analysis and semantic features in order to improve the system and show the need for such analysis. Particular attention will be placed on the so-called follow-up questions (questions that the user poses after having received some answer from the system) and the role of context. I propose that adding the linguistically heavy features will prove bene cial, thereby showing the need for such analysis in IQA systems
Analyse prédicative pour l'extraction d'information : application au domaine de la biologie
Lâabondance de textes dans le domaine biomĂ©dical nĂ©cessite le recours Ă des mĂ©thodes de traitement automatique pour amĂ©liorer la recherche dâinformations prĂ©cises. Lâextraction dâinformation (EI) vise prĂ©cisĂ©ment Ă extraire de lâinformation pertinente Ă partir de donnĂ©es non-structurĂ©es. Une grande partie des mĂ©thodes dans ce domaine se concentre sur les approches dâapprentissage automatique, en ayant recours Ă des traitements linguistiques profonds. Lâanalyse syntaxique joue notamment un rĂŽle important, en fournissant une analyse prĂ©cise des relations entre les Ă©lĂ©ments de la phrase.Cette thĂšse Ă©tudie le rĂŽle de lâanalyse syntaxique en dĂ©pendances dans le cadre dâapplications dâEI dans le domaine biomĂ©dical. Elle comprend lâĂ©valuation de diffĂ©rents analyseurs ainsi quâune analyse dĂ©taillĂ©e des erreurs. Une fois lâanalyseur le plus adaptĂ© sĂ©lectionnĂ©, les diffĂ©rentes Ă©tapes de traitement linguistique pour atteindre une EI de haute qualitĂ©, fondĂ©e sur la syntaxe, sont abordĂ©s : ces traitements incluent des Ă©tapes de prĂ©-traitement (segmentation en mots) et des traitements linguistiques de plus haut niveau (liĂ© Ă la sĂ©mantique et Ă lâanalyse de la corĂ©fĂ©rence). Cette thĂšse explore Ă©galement la maniĂšre dont les diffĂ©rents niveaux de traitement linguistique peuvent ĂȘtre reprĂ©sentĂ©s puis exploitĂ©s par lâalgorithme dâapprentissage. Enfin, partant du constat que le domaine biomĂ©dical est en fait extrĂȘmement diversifiĂ©, cette thĂšse explore lâadaptation des techniques Ă diffĂ©rents sous-domaines, en utilisant des connaissances et des ressources dĂ©jĂ existantes. Les mĂ©thodes et les approches dĂ©crites sont explorĂ©es en utilisant deux corpus biomĂ©dicaux diffĂ©rents, montrant comment les rĂ©sultats dâIE sont utilisĂ©s dans des tĂąches concrĂštes.The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain-adaptationcan be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks
Analyse prĂ©dicative pour lâextraction dâinformation : application au domaine de la biologie
The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain adaptation can be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks.La thÚse s'inscrit dans le contexte décrit précédemment : il s'agit d'explorer des techniques d'acquisition de connaissances lexicales à partir de textes, à des fins tant théoriques qu'applicatives. l'analyse portera plus particuliÚrement sur le prédicat verbal et ses nominalisations car celui-ci joue un rÎle essentiel pour les applications de tal (repérage d'événements, extraction d'information, etc.). on s'intéressera par exemple à l'acquisition de cadres de sous-catégorisation et de restrictions de sélections afin de déterminer des familles de verbes ayant un comportement syntaxico-sémantique proche. la stratégie envisagée est fortement inspirée des travaux de z. harris et de ses collÚgues (harris 1951, 1988 ; harris et al., 1989). celui-ci a montré que les textes techniques n'utilisent pas toute la complexité de la langue mais font au contraire usage de « sous-langages ». un sous-langage a un vocabulaire spécialisé et une syntaxe simplifiée par rapport à la langue courante. les textes de spécialités font donc apparaßtre des régularités qui peuvent s'analyser par analyse distributionnelle (en simplifiant : les éléments apparaissant dans des contextes similaires ont des sens similaires, ou tout au moins proches). seulement, l'analyse distributionnelle en peut fonctionner que si le texte a été « nettoyé » des variations linguistiques de surface. une pré-analyse des textes est donc cruciale
Deep analysis in IQA: evaluation on real users dialogues.
Interactive Question Answering (IQA) is a natural and cohesive way for a user to obtain information by interactive with a system using natural language. With the advancement in Natural Language Processing, research in the eld of IQA has started to focus on the role of semantics and the discourse structure in these systems. The need for a deeper analysis, which examines the syntax and semantics of the questions and the answers is evident. Using this deeper analysis allows us to model the context of the interaction. I will look at a current closeddomain IQA system which is based on Linear Regression modeling. This system uses super cial and non-semantically motivated features. I propose adding deep analysis and semantic features in order to improve the system and show the need for such analysis. Particular attention will be placed on the so-called follow-up questions (questions that the user poses after having received some answer from the system) and the role of context. I propose that adding the linguistically heavy features will prove bene cial, thereby showing the need for such analysis in IQA systems
Improving term extraction with linguistic analysis in the biomedical domain
Ă ce jour 31/01/2014 cette parution est en " draft version "International audienceThis paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier
Improving term extraction with linguistic analysis in the biomedical domain
International audienceThis paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier
Improving term extraction with linguistic analysis in the biomedical domain
This paper presents a linguistic-based approach to term extraction in the biomedical domain. The method is based on a linguistic analysis of constraints on terms and their context, focusing on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrichment of terminologies and ontologies. We report on the evaluations conducted following two complementary strategies, using a reference terminology and a manual validation. They were applied to two corpora of differing genre and domain, namely pharmacology patents and animal physiology scientific articles. Our work shows that the linguistic analysis-based developments significantly improve extraction results. The method is especially efficient when dealing with gerunds and "to" prepositional modifier