18 research outputs found
Mining clinical attributes of genomic variants through assisted literature curation in Egas
The veritable deluge of biological data over recent years has led to the establishment of a considerable number of knowledge resources that compile curated information extracted from the literature and store it in structured form, facilitating its use and exploitation. In this article, we focus on the curation of inherited genetic variants and associated clinical attributes, such as zygosity, penetrance or inheritance mode, and describe the use of Egas for this task. Egas is a web-based platform for text-mining assisted literature curation that focuses on usability through modern design solutions and simple user interactions. Egas offers a flexible and customizable tool that allows defining the concept types and relations of interest for a given annotation task, as well as the ontologies used for normalizing each concept type. Further, annotations may be performed on raw documents or on the results of automated concept identification and relation extraction tools. Users can inspect, correct or remove automatic text-mining results, manually add new annotations, and export the results to standard formats. Egas is compatible with the most recent versions of Google Chrome, Mozilla Firefox, Internet Explorer and Safari and is available for use at https://demo.bmd-software.com/egas/
Plataforma colaborativa de anotação de literatura biomédica
Mestrado em Engenharia de Computadores e TelemáticaWith the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. Since manual annotation is a very demanding and expensive task, computerized solutions were developed to perform such tasks automatically. Nevertheless, high-end information extraction techniques are still not widely used by biomedical research communities, mainly due to the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks.
This thesis presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic inline annotation of concepts and relations. Furthermore, a comprehensive set of knowledge bases are integrated and indexed to provide straightforward concept normalization features. Additionally, curators can also rely on real-time collaboration and conversation functionalities allowing discussing details of the annotation task as well as providing instant feedback of curators interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein-protein interactions described in PubMed abstracts related to neuropathological disorders. Thereby, when evaluated by expert curators, Egas obtained very positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases in a more consistent way.Com o acréscimo da quantidade de literatura biomédica a ser produzida todos os dias, vários esforços têm sido feitos para tentar extrair e armazenar de forma estruturada os conceitos e as relações nela presentes. Por outro lado, uma vez que a extração manual de conceitos compreende uma tarefa extremamente exigente e exaustiva, algumas soluções de anotação automática foram surgindo. No entanto, mesmo os sistemas de anotação mais completos não têm sido muito bem recebidos no seio das equipas de investigação, em grande parte devido às falhas a nível de usabilidade e de interface standards. Para colmatar esta falha são necessárias ferramentas de anotação interativa, que tirem proveito de sistemas de anotação automática e de bases de dados já existentes, para ajudar os anotadores nas suas tarefas do dia-a-dia.
Nesta dissertação iremos apresentar uma plataforma de anotação de literatura biomédica orientada para a usabilidade e que suporta anotação manual e automática. No mesmo sentido, integramos no sistema várias bases de dados, no intuito de facilitar a normalização dos conceitos anotados. Por outro lado, os utilizadores podem também contar com funcionalidades colaborativas em toda a aplicação, estimulando assim a interação entre os anotadores e, desta forma, a produção de melhores resultados. O sistema apresenta ainda funcionalidades para importar e exportar ficheiros, gestão de projetos e diretivas de anotação.
Com esta plataforma, Egas, participámos na tarefa de anotação interativa do BioCreative IV (IAT), nomeadamente na identificação de interações proteína-proteína. Depois de avaliado por um conjunto de anotadores, o Egas obteve os melhores resultados entre os sistemas apresentados, relativamente à usabilidade, confiança e desempenho
Overview of the interactive task in BioCreative V
Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested
A Roadmap for Natural Language Processing Research in Information Systems
Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between human and computers. Although many NLP studies have been published, none have comprehensively reviewed or synthesized tasks most commonly addressed in NLP research. We conduct a thorough review of IS literature to assess the current state of NLP research, and identify 12 prototypical tasks that are widely researched. Our analysis of 238 articles in Information Systems (IS) journals between 2004 and 2015 shows an increasing trend in NLP research, especially since 2011. Based on our analysis, we propose a roadmap for NLP research, and detail how it may be useful to guide future NLP research in IS. In addition, we employ Association Rules (AR) mining for data analysis to investigate co-occurrence of prototypical tasks and discuss insights from the findings
Mineração de informação biomédica a partir de literatura científica
Doutoramento conjunto MAP-iThe rapid evolution and proliferation of a world-wide computerized network,
the Internet, resulted in an overwhelming and constantly growing
amount of publicly available data and information, a fact that was also verified
in biomedicine. However, the lack of structure of textual data inhibits
its direct processing by computational solutions. Information extraction is
the task of text mining that intends to automatically collect information
from unstructured text data sources. The goal of the work described in this
thesis was to build innovative solutions for biomedical information extraction
from scientific literature, through the development of simple software
artifacts for developers and biocurators, delivering more accurate, usable
and faster results. We started by tackling named entity recognition - a crucial
initial task - with the development of Gimli, a machine-learning-based
solution that follows an incremental approach to optimize extracted linguistic
characteristics for each concept type. Afterwards, Totum was built to
harmonize concept names provided by heterogeneous systems, delivering a
robust solution with improved performance results. Such approach takes
advantage of heterogenous corpora to deliver cross-corpus harmonization
that is not constrained to specific characteristics. Since previous solutions
do not provide links to knowledge bases, Neji was built to streamline the
development of complex and custom solutions for biomedical concept name
recognition and normalization. This was achieved through a modular and
flexible framework focused on speed and performance, integrating a large
amount of processing modules optimized for the biomedical domain. To
offer on-demand heterogenous biomedical concept identification, we developed
BeCAS, a web application, service and widget. We also tackled relation
mining by developing TrigNER, a machine-learning-based solution for
biomedical event trigger recognition, which applies an automatic algorithm
to obtain the best linguistic features and model parameters for each event
type. Finally, in order to assist biocurators, Egas was developed to support
rapid, interactive and real-time collaborative curation of biomedical documents,
through manual and automatic in-line annotation of concepts and
relations. Overall, the research work presented in this thesis contributed
to a more accurate update of current biomedical knowledge bases, towards
improved hypothesis generation and knowledge discovery.A rápida evolução e proliferação de uma rede mundial de computadores, a
Internet, resultou num esmagador e constante crescimento na quantidade
de dados e informação publicamente disponíveis, o que também se verificou
na biomedicina. No entanto, a inexistência de estrutura em dados textuais
inibe o seu processamento direto por parte de soluções informatizadas. Extração
de informação é a tarefa de mineração de texto que pretende extrair
automaticamente informação de fontes de dados de texto não estruturados.
O objetivo do trabalho descrito nesta tese foi essencialmente focado em
construir soluções inovadoras para extração de informação biomédica a partir
da literatura científica, através do desenvolvimento de aplicações simples
de usar por programadores e bio-curadores, capazes de fornecer resultados
mais precisos, usáveis e de forma mais rápida. Começámos por abordar o
reconhecimento de nomes de conceitos - uma tarefa inicial e fundamental -
com o desenvolvimento de Gimli, uma solução baseada em inteligência artificial
que aplica uma estratégia incremental para otimizar as características
linguísticas extraídas do texto para cada tipo de conceito. Posteriormente,
Totum foi implementado para harmonizar nomes de conceitos provenientes
de sistemas heterogéneos, oferecendo uma solução mais robusta e com melhores
resultados. Esta aproximação recorre a informação contida em corpora
heterogéneos para disponibilizar uma solução não restrita às característica
de um único corpus. Uma vez que as soluções anteriores não oferecem
ligação dos nomes a bases de conhecimento, Neji foi construído para facilitar
o desenvolvimento de soluções complexas e personalizadas para o
reconhecimento de conceitos nomeados e respectiva normalização. Isto foi
conseguido através de uma plataforma modular e flexível focada em rapidez
e desempenho, integrando um vasto conjunto de módulos de processamento
optimizados para o domínio biomédico. De forma a disponibilizar identificação
de conceitos biomédicos em tempo real, BeCAS foi desenvolvido para
oferecer um serviço, aplicação e widget Web. A extracção de relações entre
conceitos também foi abordada através do desenvolvimento de TrigNER,
uma solução baseada em inteligência artificial para o reconhecimento de
palavras que desencadeiam a ocorrência de eventos biomédicos. Esta ferramenta
aplica um algoritmo automático para encontrar as melhores características
linguísticas e parâmetros para cada tipo de evento. Finalmente,
de forma a auxiliar o trabalho de bio-curadores, Egas foi desenvolvido para
suportar a anotação rápida, interactiva e colaborativa em tempo real de
documentos biomédicos, através da anotação manual e automática de conceitos
e relações de forma contextualizada. Resumindo, este trabalho contribuiu
para a actualização mais precisa das actuais bases de conhecimento,
auxiliando a formulação de hipóteses e a descoberta de novo conhecimento
Anotação automática e interativa de documentos PDF
Mestrado em Engenharia de Computadores e TelemáticaO aumento acelerado da literatura biomédica levou ao desenvolvimento de
vários esforços para extrair e armazenar, de forma estruturada, a informação
relativa aos conceitos e relações presentes nesses textos, oferecendo aos investigadores
e clínicos um acesso rápido e fácil à informação. No entanto,
este processo de "curadoria de conhecimento" é uma tarefa extremamente
exaustiva, sendo cada vez mais comum o uso de ferramentas de anotação
automática, fazendo uso de técnicas de mineração de texto. Apesar de já
existirem sistemas de anotação bastante completos e que apresentam um
alto desempenho, estes não são largamente usados pela comunidade biomédica,
principalmente por serem complexos e apresentarem limitações ao
nível de usabilidade. Por outro lado, o PDF tornou-se nos últimos anos num
dos formatos mais populares para publicar e partilhar documentos visto poder
ser apresentado exatamente da mesma maneira independentemente do
sistema ou plataforma em que é acedido. A maioria das ferramentas de anotação
foram principalmente desenhadas para extrair informação de texto livre,
contudo hoje em dia uma grande parte da literatura biomédica é publicada e
distribuída em PDF, e portanto a extração de informação de documentos PDF
deve ser um ponto de foco para a comunidade de mineração de texto biomédico.
O objetivo do trabalho descrito nesta dissertação foi a extensão da framework
Neji, permitindo o processamento de documentos em formato PDF, e a integração
dessas funcionalidades na plataforma Egas, permitindo que um utilizador
possa visualizar e anotar, simultaneamente, o artigo original no formato
PDF e o texto extraído deste.
Os sistemas desenvolvidos apresentam bons resultados de desempenho,
tanto em termos de velocidade de processamento como de representação da
informação, o que também contribui para uma melhor experiência de utilizador.
Além disso, apresentam várias vantagens para a comunidade de mineração
de texto e curadores, permitindo a anotação direta de artigos no formato
PDF e simplificando o uso e configuração destes sistemas de anotação por
parte de investigadores.The accelerated increase of the biomedical literature has led to various efforts
to extract and store, in a structured way, the information related with the
concepts and relations presented in those texts, providing to investigators and
researchers a fast and easy access to knowledge. However, this process of
“knowledge curation” is an extremely exhaustive task, being more and more
common demanding the application of automatic annotation tools, that make
use of text mining techniques. Even thought complete annotation systems already
exist and produce high performance results, they are not widely used by
the biomedical community, mainly because of their complexity and also due to
some limitations in usability. On the other hand, the PDF has become in the
last years one of the most popular formats for publishing and sharing documents
because of it can be displayed exactly in the same way independently
of the system or platform where it is accessed. The majority of annotation
tools were mainly designed to extract information from raw text, although a big
part of the biomedical literature is published and distributed in PDF, and thus
the information extraction from PDF documents should be a focus point for the
biomedical text mining community.
The objective of the work described in this document is the extension of Neji
framework, allowing the processing of documents in PDF format, and the integration
of these features in Egas platform, allowing a user to simultaneously
visualize the original article in PDF format and its extracted text.
The improved and developed systems present good performing results, both
in terms of processing speed and representation of the information, contributing
also for a better user experience. Besides that, they present several advantages
for the biomedical community, allowing the direct annotation of PDF
articles and simplifying the use and configuration of these annotation systems
by researchers
Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing
The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline.
This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research.
The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Development of methods for Omics Network inference and analysis and their application to disease modeling
With the advent of Next Generation Sequencing (NGS) technologies and the emergence of large publicly available genomics data comes an unprecedented opportunity to model biological networks through a holistic lens using a systems-based approach. Networks provide a mathematical framework for representing biological phenomena that go beyond standard one-gene-at-a-time analyses. Networks can model system-level patterns and the molecular rewiring (i.e. changes in connectivity) occurring in response to perturbations or between distinct phenotypic groups or cell types. This in turn supports the identification of putative mechanisms of actions of the biological processes under study, and thus have the potential to advance prevention and therapy. However, there are major challenges faced by researchers. Inference of biological network structures is often performed on high-dimensional data, yet is hindered by the limited sample size of high throughput omics data. Furthermore, modeling biological networks involves complex analyses capable of integrating multiple sources of omics layers and summarizing large amounts of information.
My dissertation aims to address these challenges by presenting new approaches for high-dimensional network inference with limit sample sizes as well as methods and tools for integrated network analysis applied to multiple research domains in cancer genomics. First, I introduce a novel method for reconstructing gene regulatory networks called SHINE (Structure Learning for Hierarchical Networks) and present an evaluation on simulated and real datasets including a Pan-Cancer analysis using The Cancer Genome Atlas (TCGA) data. Next, I summarize the challenges with executing and managing data processing workflows for large omics datasets on high performance computing environments and present multiple strategies for using Nextflow for reproducible scientific workflows including shine-nf - a collection of Nextflow modules for structure learning. Lastly, I introduce the methods, objects, and tools developed for the analysis of biological networks used throughout my dissertation work. Together - these contributions were used in focused analyses of understanding the molecular mechanisms of tumor maintenance and progression in subtype networks of Breast Cancer and Head and Neck Squamous Cell Carcinoma