28 research outputs found
Recuperação de informação baseada em frases para textos biomédicos
Mestrado em Engenharia de Computadores e TelemáticaO desenvolvimento de novos métodos experimentais e tecnologias de alto
rendimento no campo biomédico despoletou um crescimento acelerado do
volume de publicações científicas na área. Inúmeros repositórios estruturados
para dados biológicos foram criados ao longo das últimas décadas, no
entanto, os utilizadores estão cada vez mais a recorrer a sistemas de recuperação
de informação, ou motores de busca, em detrimento dos primeiros.
Motores de pesquisa apresentam-se mais fáceis de usar devido à sua flexibilidade
e capacidade de interpretar os requisitos dos utilizadores, tipicamente
expressos na forma de pesquisas compostas por algumas palavras.
Sistemas de pesquisa tradicionais devolvem documentos completos, que geralmente
requerem um grande esforço de leitura para encontrar a informação
procurada, encontrando-se esta, em grande parte dos casos, descrita num
trecho de texto composto por poucas frases. Além disso, estes sistemas falham
frequentemente na tentativa de encontrar a informação pretendida porque,
apesar de a pesquisa efectuada estar normalmente alinhada semanticamente
com a linguagem usada nos documentos procurados, os termos
usados são lexicalmente diferentes.
Esta dissertação foca-se no desenvolvimento de técnicas de recuperação de
informação baseadas em frases que, para uma dada pesquisa de um utilizador,
permitam encontrar frases relevantes da literatura científica que respondam
aos requisitos do utilizador. O trabalho desenvolvido apresenta-se em
duas partes. Primeiro foi realizado trabalho de investigação exploratória para
identificação de características de frases informativas em textos biomédicos.
Para este propósito foi usado um método de aprendizagem automática. De
seguida foi desenvolvido um sistema de pesquisa de frases informativas. Este
sistema suporta pesquisas de texto livre e baseadas em conceitos, os resultados
de pesquisa apresentam-se enriquecidos com anotações de conceitos
relevantes e podem ser ordenados segundo várias estratégias de classificação.Modern advances of experimental methods and high-throughput technology
in the biomedical domain are causing a fast-paced, rising growth of the volume
of published scientific literature in the field. While a myriad of structured
data repositories for biological knowledge have been sprouting over the last
decades, Information Retrieval (IR) systems are increasingly replacing them.
IR systems are easier to use due to their flexibility and ability to interpret user
needs in the form of queries, typically formed by a few words.
Traditional document retrieval systems return entire documents, which may
require a lot of subsequent reading to find the specific information sought, frequently
contained in a small passage of only a few sentences. Additionally, IR
often fails to find what is wanted because the words used in the query are lexically
different, despite semantically aligned, from the words used in relevant
sources.
This thesis focuses on the development of sentence-based information retrieval
approaches that, for a given user query, allow seeking relevant sentences
from scientific literature that answer the user information need. The
presented work is two-fold. First, exploratory research experiments were conducted
for the identification of features of informative sentences from biomedical
texts. A supervised machine learning method was used for this purpose.
Second, an information retrieval system for informative sentences was developed.
It supports free text and concept-based queries, search results are enriched
with relevant concept annotations and sentences can be ranked using
multiple configurable strategies
Agile in-litero experiments:how can semi-automated information extraction from neuroscientific literature help neuroscience model building?
In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles in peer-reviewed journals. One challenge for modern neuroinformatics is to design methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and its integration into computational models. In this thesis, we introduce novel natural language processing (NLP) models and systems to mine the neuroscientific literature. In addition to in vivo, in vitro or in silico experiments, we coin the NLP methods developed in this thesis as in litero experiments, aiming at analyzing and making accessible the extended body of neuroscientific literature. In particular, we focus on two important neuroscientific entities: brain regions and neural cells. An integrated NLP model is designed to automatically extract brain region connectivity statements from very large corpora. This system is applied to a large corpus of 25M PubMed abstracts and 600K full-text articles. Central to this system is the creation of a searchable database of brain region connectivity statements, allowing neuroscientists to gain an overview of all brain regions connected to a given region of interest. More importantly, the database enables researcher to provide feedback on connectivity results and links back to the original article sentence to provide the relevant context. The database is evaluated by neuroanatomists on real connectomics tasks (targets of Nucleus Accumbens) and results in significant effort reduction in comparison to previous manual methods (from 1 week to 2h). Subsequently, we introduce neuroNER to identify, normalize and compare instances of identify neuronsneurons in the scientific literature. Our method relies on identifying and analyzing each of the domain features used to annotate a specific neuron mention, like the morphological term 'basket' or brain region 'hippocampus'. We apply our method to the same corpus of 25M PubMed abstracts and 600K full-text articles and find over 500K unique neuron type mentions. To demonstrate the utility of our approach, we also apply our method towards cross-comparing the NeuroLex and Human Brain Project (HBP) cell type ontologies. By decoupling a neuron mention's identity into its specific compositional features, our method can successfully identify specific neuron types even if they are not explicitly listed within a predefined neuron type lexicon, thus greatly facilitating cross-laboratory studies. In order to build such large databases, several tools and infrastructureslarge-scale NLP were developed: a robust pipeline to preprocess full-text PDF articles, as well as bluima, an NLP processing pipeline specialized on neuroscience to perform text-mining at PubMed scale. During the development of those two NLP systems, we acknowledged the need for novel NLP approaches to rapidly develop custom text mining solutions. This led to the formalization of the agile text miningagile text-mining methodology to improve the communication and collaboration between subject matter experts and text miners. Agile text mining is characterized by short development cycles, frequent tasks redefinition and continuous performance monitoring through integration tests. To support our approach, we developed Sherlok, an NLP framework designed for the development of agile text mining applications
The BioLexicon: a Large-Scale Domain-Specific Lexical Resource for Biomedical Text Mining
The talk will focus on building a biolexicon by leveraging existing bio-resources, combining them within a common, standardized lexical, terminological, framework and employing advanced NL technologies to discover new terms, concepts, relations and linguistic lexical information from text
Clinical Natural Language Processing in languages other than English: opportunities and challenges
Background: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages
Modelling characters’ mental depth in stories told by children aged 4-10
From age 3-4, children are generally capable of telling stories about a topic free of choice. Over the years their stories become more sophisticated in content and structure, reflecting various aspects of cognitive development. Here we focus on children’s ability to construe characters with increasing levels of mental depth, arguably reflecting socio-cognitive capacities including Theory of Mind. Within our sample of 51 stories told by children aged 4-10, characters range from flat “actors” performing simple actions, to “agents” having basic perceptive, emotional, and intentional capacities, to fully-blown “persons” with complex inner lives. We argue for the underexplored potential of computationally extracted story-internal factors (e.g. lexical/syntactic complexity) in explaining variance in character depth, as opposed to story-external factors (e.g. age, socioeconomic status) on which existing work has focused. We show that especially lexical richness explains variance in character depth, and this effect is larger than and not moderated by age.NWOVI.Veni.191C.051Computer Systems, Imagery and Medi
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available