18 research outputs found

    Mining clinical attributes of genomic variants through assisted literature curation in Egas

    Get PDF
    The veritable deluge of biological data over recent years has led to the establishment of a considerable number of knowledge resources that compile curated information extracted from the literature and store it in structured form, facilitating its use and exploitation. In this article, we focus on the curation of inherited genetic variants and associated clinical attributes, such as zygosity, penetrance or inheritance mode, and describe the use of Egas for this task. Egas is a web-based platform for text-mining assisted literature curation that focuses on usability through modern design solutions and simple user interactions. Egas offers a flexible and customizable tool that allows defining the concept types and relations of interest for a given annotation task, as well as the ontologies used for normalizing each concept type. Further, annotations may be performed on raw documents or on the results of automated concept identification and relation extraction tools. Users can inspect, correct or remove automatic text-mining results, manually add new annotations, and export the results to standard formats. Egas is compatible with the most recent versions of Google Chrome, Mozilla Firefox, Internet Explorer and Safari and is available for use at https://demo.bmd-software.com/egas/

    Plataforma colaborativa de anotação de literatura biomédica

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaWith the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. Since manual annotation is a very demanding and expensive task, computerized solutions were developed to perform such tasks automatically. Nevertheless, high-end information extraction techniques are still not widely used by biomedical research communities, mainly due to the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This thesis presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic inline annotation of concepts and relations. Furthermore, a comprehensive set of knowledge bases are integrated and indexed to provide straightforward concept normalization features. Additionally, curators can also rely on real-time collaboration and conversation functionalities allowing discussing details of the annotation task as well as providing instant feedback of curators interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein-protein interactions described in PubMed abstracts related to neuropathological disorders. Thereby, when evaluated by expert curators, Egas obtained very positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases in a more consistent way.Com o acréscimo da quantidade de literatura biomédica a ser produzida todos os dias, vários esforços têm sido feitos para tentar extrair e armazenar de forma estruturada os conceitos e as relações nela presentes. Por outro lado, uma vez que a extração manual de conceitos compreende uma tarefa extremamente exigente e exaustiva, algumas soluções de anotação automática foram surgindo. No entanto, mesmo os sistemas de anotação mais completos não têm sido muito bem recebidos no seio das equipas de investigação, em grande parte devido às falhas a nível de usabilidade e de interface standards. Para colmatar esta falha são necessárias ferramentas de anotação interativa, que tirem proveito de sistemas de anotação automática e de bases de dados já existentes, para ajudar os anotadores nas suas tarefas do dia-a-dia. Nesta dissertação iremos apresentar uma plataforma de anotação de literatura biomédica orientada para a usabilidade e que suporta anotação manual e automática. No mesmo sentido, integramos no sistema várias bases de dados, no intuito de facilitar a normalização dos conceitos anotados. Por outro lado, os utilizadores podem também contar com funcionalidades colaborativas em toda a aplicação, estimulando assim a interação entre os anotadores e, desta forma, a produção de melhores resultados. O sistema apresenta ainda funcionalidades para importar e exportar ficheiros, gestão de projetos e diretivas de anotação. Com esta plataforma, Egas, participámos na tarefa de anotação interativa do BioCreative IV (IAT), nomeadamente na identificação de interações proteína-proteína. Depois de avaliado por um conjunto de anotadores, o Egas obteve os melhores resultados entre os sistemas apresentados, relativamente à usabilidade, confiança e desempenho

    Overview of the interactive task in BioCreative V

    Get PDF
    Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested

    A Roadmap for Natural Language Processing Research in Information Systems

    Get PDF
    Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between human and computers. Although many NLP studies have been published, none have comprehensively reviewed or synthesized tasks most commonly addressed in NLP research. We conduct a thorough review of IS literature to assess the current state of NLP research, and identify 12 prototypical tasks that are widely researched. Our analysis of 238 articles in Information Systems (IS) journals between 2004 and 2015 shows an increasing trend in NLP research, especially since 2011. Based on our analysis, we propose a roadmap for NLP research, and detail how it may be useful to guide future NLP research in IS. In addition, we employ Association Rules (AR) mining for data analysis to investigate co-occurrence of prototypical tasks and discuss insights from the findings

    Mineração de informação biomédica a partir de literatura científica

    Get PDF
    Doutoramento conjunto MAP-iThe rapid evolution and proliferation of a world-wide computerized network, the Internet, resulted in an overwhelming and constantly growing amount of publicly available data and information, a fact that was also verified in biomedicine. However, the lack of structure of textual data inhibits its direct processing by computational solutions. Information extraction is the task of text mining that intends to automatically collect information from unstructured text data sources. The goal of the work described in this thesis was to build innovative solutions for biomedical information extraction from scientific literature, through the development of simple software artifacts for developers and biocurators, delivering more accurate, usable and faster results. We started by tackling named entity recognition - a crucial initial task - with the development of Gimli, a machine-learning-based solution that follows an incremental approach to optimize extracted linguistic characteristics for each concept type. Afterwards, Totum was built to harmonize concept names provided by heterogeneous systems, delivering a robust solution with improved performance results. Such approach takes advantage of heterogenous corpora to deliver cross-corpus harmonization that is not constrained to specific characteristics. Since previous solutions do not provide links to knowledge bases, Neji was built to streamline the development of complex and custom solutions for biomedical concept name recognition and normalization. This was achieved through a modular and flexible framework focused on speed and performance, integrating a large amount of processing modules optimized for the biomedical domain. To offer on-demand heterogenous biomedical concept identification, we developed BeCAS, a web application, service and widget. We also tackled relation mining by developing TrigNER, a machine-learning-based solution for biomedical event trigger recognition, which applies an automatic algorithm to obtain the best linguistic features and model parameters for each event type. Finally, in order to assist biocurators, Egas was developed to support rapid, interactive and real-time collaborative curation of biomedical documents, through manual and automatic in-line annotation of concepts and relations. Overall, the research work presented in this thesis contributed to a more accurate update of current biomedical knowledge bases, towards improved hypothesis generation and knowledge discovery.A rápida evolução e proliferação de uma rede mundial de computadores, a Internet, resultou num esmagador e constante crescimento na quantidade de dados e informação publicamente disponíveis, o que também se verificou na biomedicina. No entanto, a inexistência de estrutura em dados textuais inibe o seu processamento direto por parte de soluções informatizadas. Extração de informação é a tarefa de mineração de texto que pretende extrair automaticamente informação de fontes de dados de texto não estruturados. O objetivo do trabalho descrito nesta tese foi essencialmente focado em construir soluções inovadoras para extração de informação biomédica a partir da literatura científica, através do desenvolvimento de aplicações simples de usar por programadores e bio-curadores, capazes de fornecer resultados mais precisos, usáveis e de forma mais rápida. Começámos por abordar o reconhecimento de nomes de conceitos - uma tarefa inicial e fundamental - com o desenvolvimento de Gimli, uma solução baseada em inteligência artificial que aplica uma estratégia incremental para otimizar as características linguísticas extraídas do texto para cada tipo de conceito. Posteriormente, Totum foi implementado para harmonizar nomes de conceitos provenientes de sistemas heterogéneos, oferecendo uma solução mais robusta e com melhores resultados. Esta aproximação recorre a informação contida em corpora heterogéneos para disponibilizar uma solução não restrita às característica de um único corpus. Uma vez que as soluções anteriores não oferecem ligação dos nomes a bases de conhecimento, Neji foi construído para facilitar o desenvolvimento de soluções complexas e personalizadas para o reconhecimento de conceitos nomeados e respectiva normalização. Isto foi conseguido através de uma plataforma modular e flexível focada em rapidez e desempenho, integrando um vasto conjunto de módulos de processamento optimizados para o domínio biomédico. De forma a disponibilizar identificação de conceitos biomédicos em tempo real, BeCAS foi desenvolvido para oferecer um serviço, aplicação e widget Web. A extracção de relações entre conceitos também foi abordada através do desenvolvimento de TrigNER, uma solução baseada em inteligência artificial para o reconhecimento de palavras que desencadeiam a ocorrência de eventos biomédicos. Esta ferramenta aplica um algoritmo automático para encontrar as melhores características linguísticas e parâmetros para cada tipo de evento. Finalmente, de forma a auxiliar o trabalho de bio-curadores, Egas foi desenvolvido para suportar a anotação rápida, interactiva e colaborativa em tempo real de documentos biomédicos, através da anotação manual e automática de conceitos e relações de forma contextualizada. Resumindo, este trabalho contribuiu para a actualização mais precisa das actuais bases de conhecimento, auxiliando a formulação de hipóteses e a descoberta de novo conhecimento

    Anotação automática e interativa de documentos PDF

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaO aumento acelerado da literatura biomédica levou ao desenvolvimento de vários esforços para extrair e armazenar, de forma estruturada, a informação relativa aos conceitos e relações presentes nesses textos, oferecendo aos investigadores e clínicos um acesso rápido e fácil à informação. No entanto, este processo de "curadoria de conhecimento" é uma tarefa extremamente exaustiva, sendo cada vez mais comum o uso de ferramentas de anotação automática, fazendo uso de técnicas de mineração de texto. Apesar de já existirem sistemas de anotação bastante completos e que apresentam um alto desempenho, estes não são largamente usados pela comunidade biomédica, principalmente por serem complexos e apresentarem limitações ao nível de usabilidade. Por outro lado, o PDF tornou-se nos últimos anos num dos formatos mais populares para publicar e partilhar documentos visto poder ser apresentado exatamente da mesma maneira independentemente do sistema ou plataforma em que é acedido. A maioria das ferramentas de anotação foram principalmente desenhadas para extrair informação de texto livre, contudo hoje em dia uma grande parte da literatura biomédica é publicada e distribuída em PDF, e portanto a extração de informação de documentos PDF deve ser um ponto de foco para a comunidade de mineração de texto biomédico. O objetivo do trabalho descrito nesta dissertação foi a extensão da framework Neji, permitindo o processamento de documentos em formato PDF, e a integração dessas funcionalidades na plataforma Egas, permitindo que um utilizador possa visualizar e anotar, simultaneamente, o artigo original no formato PDF e o texto extraído deste. Os sistemas desenvolvidos apresentam bons resultados de desempenho, tanto em termos de velocidade de processamento como de representação da informação, o que também contribui para uma melhor experiência de utilizador. Além disso, apresentam várias vantagens para a comunidade de mineração de texto e curadores, permitindo a anotação direta de artigos no formato PDF e simplificando o uso e configuração destes sistemas de anotação por parte de investigadores.The accelerated increase of the biomedical literature has led to various efforts to extract and store, in a structured way, the information related with the concepts and relations presented in those texts, providing to investigators and researchers a fast and easy access to knowledge. However, this process of “knowledge curation” is an extremely exhaustive task, being more and more common demanding the application of automatic annotation tools, that make use of text mining techniques. Even thought complete annotation systems already exist and produce high performance results, they are not widely used by the biomedical community, mainly because of their complexity and also due to some limitations in usability. On the other hand, the PDF has become in the last years one of the most popular formats for publishing and sharing documents because of it can be displayed exactly in the same way independently of the system or platform where it is accessed. The majority of annotation tools were mainly designed to extract information from raw text, although a big part of the biomedical literature is published and distributed in PDF, and thus the information extraction from PDF documents should be a focus point for the biomedical text mining community. The objective of the work described in this document is the extension of Neji framework, allowing the processing of documents in PDF format, and the integration of these features in Egas platform, allowing a user to simultaneously visualize the original article in PDF format and its extracted text. The improved and developed systems present good performing results, both in terms of processing speed and representation of the information, contributing also for a better user experience. Besides that, they present several advantages for the biomedical community, allowing the direct annotation of PDF articles and simplifying the use and configuration of these annotation systems by researchers

    Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing

    Get PDF
    The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline. This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research. The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Development of methods for Omics Network inference and analysis and their application to disease modeling

    Full text link
    With the advent of Next Generation Sequencing (NGS) technologies and the emergence of large publicly available genomics data comes an unprecedented opportunity to model biological networks through a holistic lens using a systems-based approach. Networks provide a mathematical framework for representing biological phenomena that go beyond standard one-gene-at-a-time analyses. Networks can model system-level patterns and the molecular rewiring (i.e. changes in connectivity) occurring in response to perturbations or between distinct phenotypic groups or cell types. This in turn supports the identification of putative mechanisms of actions of the biological processes under study, and thus have the potential to advance prevention and therapy. However, there are major challenges faced by researchers. Inference of biological network structures is often performed on high-dimensional data, yet is hindered by the limited sample size of high throughput omics data. Furthermore, modeling biological networks involves complex analyses capable of integrating multiple sources of omics layers and summarizing large amounts of information. My dissertation aims to address these challenges by presenting new approaches for high-dimensional network inference with limit sample sizes as well as methods and tools for integrated network analysis applied to multiple research domains in cancer genomics. First, I introduce a novel method for reconstructing gene regulatory networks called SHINE (Structure Learning for Hierarchical Networks) and present an evaluation on simulated and real datasets including a Pan-Cancer analysis using The Cancer Genome Atlas (TCGA) data. Next, I summarize the challenges with executing and managing data processing workflows for large omics datasets on high performance computing environments and present multiple strategies for using Nextflow for reproducible scientific workflows including shine-nf - a collection of Nextflow modules for structure learning. Lastly, I introduce the methods, objects, and tools developed for the analysis of biological networks used throughout my dissertation work. Together - these contributions were used in focused analyses of understanding the molecular mechanisms of tumor maintenance and progression in subtype networks of Breast Cancer and Head and Neck Squamous Cell Carcinoma
    corecore