30 research outputs found
Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org
Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology
Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data.
The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups.
This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base.
However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects.
Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I
Abstract ... V
Acknowledgements ... VII
Prelude ... IX
1 Introduction
1.1 An overview of environmental toxicology ... 2
1.1.1 Environmental toxicology ... 2
1.1.2 Chemicals in the environment ... 4
1.1.3 Systems biological perspectives in environmental toxicology ... 7
Computational toxicology ... 11
1.2.1 Omics-based approaches ... 12
1.2.2 Linking chemical exposure to transcriptional effects ... 14
1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19
1.2.4 Biomedical literature-based discovery ... 24
1.2.5 Deep learning with knowledge representation ... 27
1.3 Research question and approaches ... 29
2 Methods and Data ... 33
2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34
2.1.1 Exposure and microarray data ... 34
2.1.2 Preprocessing ... 35
2.1.3 Differential gene expression ... 37
2.1.4 Association rule mining ... 38
2.1.5 Weighted gene correlation network analysis ... 39
2.1.6 Method comparison ... 41
Predicting exposure-related effects on a molecular level ... 44
2.2.1 Input ... 44
2.2.2 Input preparation ... 47
2.2.3 Deep learning models ... 49
2.2.4 Toxicogenomic application ... 54
3 Method comparison to link complex stream water exposures to effects on
the transcriptional level ... 57
3.1 Background and motivation ... 58
3.1.1 Workflow ... 61
3.2 Results ... 62
3.2.1 Data preprocessing ... 62
3.2.2 Differential gene expression analysis ... 67
3.2.3 Association rule mining ... 71
3.2.4 Network inference ... 78
3.2.5 Method comparison ... 84
3.2.6 Application case of method integration ... 87
3.3 Discussion ... 91
3.4 Conclusion ... 99
4 Deep learning prediction of chemical-biomolecule interactions ... 101
4.1 Motivation ... 102
4.1.1Workflow ...105
4.2 Results ... 107
4.2.1 Input preparation ... 107
4.2.2 Model selection ... 110
4.2.3 Model comparison ... 118
4.2.4 Toxicogenomic application ... 121
4.2.5 Horizontal augmentation without tail-padding ...123
4.2.6 Four-class problem formulation ... 124
4.2.7 Training with CTD data ... 125
4.3 Discussion ... 129
4.3.1 Transferring biomedical knowledge towards toxicology ... 129
4.3.2 Deep learning with biomedical knowledge representation ...133
4.3.3 Data integration ...136
4.4 Conclusion ... 141
5 Conclusion and Future perspectives ... 143
5.1 Conclusion ... 143
5.1.1 Investigating complex mixtures in the environment ... 144
5.1.2 Complex knowledge from literature and curated databases predict chemical-
biomolecule interactions ... 145
5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146
5.2 Future perspectives ... 147
S1 Supplement Chapter 1 ... 153
S1.1 Example of an estrogen bioassay ... 154
S1.2 Types of mode of action ... 154
S1.3 The dogma of molecular biology ... 157
S1.4 Transcriptomics ... 159
S2 Supplement Chapter 3 ... 161
S3 Supplement Chapter 4 ... 175
S3.1 Hyperparameter tuning results ... 176
S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179
S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183
S3.4 Horizontal augmentation without tail-padding ... 183
S3.5 Four-relationship classification ... 185
S3.6 Interpreting loss observations for SemMedDB trained models ... 187
List of Abbreviations ... i
List of Figures ... vi
List of Tables ... x
Bibliography ... xii
Curriculum scientiae ... xxxix
Selbständigkeitserklärung ... xlii
Recommended from our members
UNDERSTANDING CONDITIONAL MODES OF ACTIONS IN CHEMICAL-INDUCED TOXICITY USING RULE MODELS
It is estimated that 115 million animals are used in experimental testing each year. Hence,
shifting efforts toward alternative methods for toxicity assessment is essential. However, slow regulatory acceptance of new approaches is governed by knowledge gaps in toxicity modes of action. In this thesis, I describe these challenges and the use of in vitro screening as an alternative of animal testing. I also discuss common data-based methods to derive hypotheses about toxicity modes of actions, and the associated limitations in capturing multiple biological perturbations.
I applied novel data-based workflows, using rule models, to prioritize in vitro assays predictive of toxicity as well as to detect significant polypharmacology profiles. I explain how constraints were applied to rule-based models to inform meaningful mechanistic interpretation for two toxicity endpoints: rat hepatotoxicity and acute toxicity. I compared assays selected, by rules, for predicting hepatotoxicity with endpoints used in in
vitro models from commercial sources. An overlap was observed including cytochrome
activity, mitochondrial toxicity and immunological responses. However, nuclear receptor
activity, identified in rules, is not currently covered in commercial setups. I also demonstrate that endocrine disruption endpoints extrapolate better into in vivo toxicity when a set of specific conditions are met, such as physicochemical properties associated with good bioavailability.
Next, I examined synergistic interactions between conditions in rules describing acute toxicity. I gained novel insights into how specific stressors potentiate the perturbation by known key events, such as acetylcholinesterase inhibition and neuro-signalling disruption. I show that examining polypharmacology profiles is particularly important at low bioactive potencies.
Further, the overall predictive performance of rules describing acute toxicity was tested against a benchmark Random Forest model in a conformal prediction framework. Irrespective to the data type used in the training, the models were prone to bias over compounds promiscuity, by which high promiscuous compounds were more likely to be predicted as toxic.
Overall, the studies conducted in this thesis provide novel insights into molecular mechanisms of toxicity, namely hepatotoxicity and acute toxicity, and with regards to chemical properties and polypharmacology. This knowledge can be used to improve the utility and design of alternative methods for toxicity, and hence, accelerate the regulatory acceptance.Islamic Development Bank
Cambridge Trust Fun
Generation and Applications of Knowledge Graphs in Systems and Networks Biology
The acceleration in the generation of data in the biomedical domain has necessitated the use of computational approaches to assist in its interpretation. However, these approaches rely on the availability of high quality, structured, formalized biomedical knowledge. This thesis has the two goals to improve methods for curation and semantic data integration to generate high granularity biological knowledge graphs and to develop novel methods for using prior biological knowledge to propose new biological hypotheses. The first two publications describe an ecosystem for handling biological knowledge graphs encoded in the Biological Expression Language throughout the stages of curation, visualization, and analysis. Further, the second two publications describe the reproducible acquisition and integration of high-granularity knowledge with low contextual specificity from structured biological data sources on a massive scale and support the semi-automated curation of new content at high speed and precision. After building the ecosystem and acquiring content, the last three publications in this thesis demonstrate three different applications of biological knowledge graphs in modeling and simulation. The first demonstrates the use of agent-based modeling for simulation of neurodegenerative disease biomarker trajectories using biological knowledge graphs as priors. The second applies network representation learning to prioritize nodes in biological knowledge graphs based on corresponding experimental measurements to identify novel targets. Finally, the third uses biological knowledge graphs and develops algorithmics to deconvolute the mechanism of action of drugs, that could also serve to identify drug repositioning candidates. Ultimately, the this thesis lays the groundwork for production-level applications of drug repositioning algorithms and other knowledge-driven approaches to analyzing biomedical experiments
Discovering lesser known molecular players and mechanistic patterns in Alzheimer's disease using an integrative disease modelling approach
Convergence of exponentially advancing technologies is driving medical research with life changing discoveries. On the contrary, repeated failures of high-profile drugs to battle Alzheimer's disease (AD) has made it one of the least successful therapeutic area. This failure pattern has provoked researchers to grapple with their beliefs about Alzheimer's aetiology. Thus, growing realisation that Amyloid-β and tau are not 'the' but rather 'one of the' factors necessitates the reassessment of pre-existing data to add new perspectives. To enable a holistic view of the disease, integrative modelling approaches are emerging as a powerful technique. Combining data at different scales and modes could considerably increase the predictive power of the integrative model by filling biological knowledge gaps. However, the reliability of the derived hypotheses largely depends on the completeness, quality, consistency, and context-specificity of the data. Thus, there is a need for agile methods and approaches that efficiently interrogate and utilise existing public data. This thesis presents the development of novel approaches and methods that address intrinsic issues of data integration and analysis in AD research. It aims to prioritise lesser-known AD candidates using highly curated and precise knowledge derived from integrated data. Here much of the emphasis is put on quality, reliability, and context-specificity. This thesis work showcases the benefit of integrating well-curated and disease-specific heterogeneous data in a semantic web-based framework for mining actionable knowledge. Furthermore, it introduces to the challenges encountered while harvesting information from literature and transcriptomic resources. State-of-the-art text-mining methodology is developed to extract miRNAs and its regulatory role in diseases and genes from the biomedical literature. To enable meta-analysis of biologically related transcriptomic data, a highly-curated metadata database has been developed, which explicates annotations specific to human and animal models. Finally, to corroborate common mechanistic patterns — embedded with novel candidates — across large-scale AD transcriptomic data, a new approach to generate gene regulatory networks has been developed. The work presented here has demonstrated its capability in identifying testable mechanistic hypotheses containing previously unknown or emerging knowledge from public data in two major publicly funded projects for Alzheimer's, Parkinson's and Epilepsy diseases
Systems Analytics and Integration of Big Omics Data
A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
CASSANDRA: drug gene association prediction via text mining and ontologies
The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce CASSANDRA, a fully corpus-based and unsupervised algorithm which uses the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. CASSANDRA measures the Pointwise Mutual Information (PMI) between biomedical terms derived from Gene Ontology (GO) and Medical Subject Headings (MeSH). Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles.
Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The algorithm can successfully identify direct drug gene associations with high precision and prioritize them over indirect drug gene associations. Validation shows that the statistically derived profiles from literature perform as good as (and at times better than) the manually curated profiles.
In addition, we examine CASSANDRA’s potential towards drug repositioning. For all FDA-approved drugs repositioned over the last 5 years, we generate profiles from publications before 2009 and show that the new indications rank high in these profiles. In summary, co-occurrence based profiles derived from the biomedical literature can accurately predict drug gene associations and provide insights onto potential repositioning cases
Mineração de informação biomédica a partir de literatura científica
Doutoramento conjunto MAP-iThe rapid evolution and proliferation of a world-wide computerized network,
the Internet, resulted in an overwhelming and constantly growing
amount of publicly available data and information, a fact that was also verified
in biomedicine. However, the lack of structure of textual data inhibits
its direct processing by computational solutions. Information extraction is
the task of text mining that intends to automatically collect information
from unstructured text data sources. The goal of the work described in this
thesis was to build innovative solutions for biomedical information extraction
from scientific literature, through the development of simple software
artifacts for developers and biocurators, delivering more accurate, usable
and faster results. We started by tackling named entity recognition - a crucial
initial task - with the development of Gimli, a machine-learning-based
solution that follows an incremental approach to optimize extracted linguistic
characteristics for each concept type. Afterwards, Totum was built to
harmonize concept names provided by heterogeneous systems, delivering a
robust solution with improved performance results. Such approach takes
advantage of heterogenous corpora to deliver cross-corpus harmonization
that is not constrained to specific characteristics. Since previous solutions
do not provide links to knowledge bases, Neji was built to streamline the
development of complex and custom solutions for biomedical concept name
recognition and normalization. This was achieved through a modular and
flexible framework focused on speed and performance, integrating a large
amount of processing modules optimized for the biomedical domain. To
offer on-demand heterogenous biomedical concept identification, we developed
BeCAS, a web application, service and widget. We also tackled relation
mining by developing TrigNER, a machine-learning-based solution for
biomedical event trigger recognition, which applies an automatic algorithm
to obtain the best linguistic features and model parameters for each event
type. Finally, in order to assist biocurators, Egas was developed to support
rapid, interactive and real-time collaborative curation of biomedical documents,
through manual and automatic in-line annotation of concepts and
relations. Overall, the research work presented in this thesis contributed
to a more accurate update of current biomedical knowledge bases, towards
improved hypothesis generation and knowledge discovery.A rápida evolução e proliferação de uma rede mundial de computadores, a
Internet, resultou num esmagador e constante crescimento na quantidade
de dados e informação publicamente disponíveis, o que também se verificou
na biomedicina. No entanto, a inexistência de estrutura em dados textuais
inibe o seu processamento direto por parte de soluções informatizadas. Extração
de informação é a tarefa de mineração de texto que pretende extrair
automaticamente informação de fontes de dados de texto não estruturados.
O objetivo do trabalho descrito nesta tese foi essencialmente focado em
construir soluções inovadoras para extração de informação biomédica a partir
da literatura científica, através do desenvolvimento de aplicações simples
de usar por programadores e bio-curadores, capazes de fornecer resultados
mais precisos, usáveis e de forma mais rápida. Começámos por abordar o
reconhecimento de nomes de conceitos - uma tarefa inicial e fundamental -
com o desenvolvimento de Gimli, uma solução baseada em inteligência artificial
que aplica uma estratégia incremental para otimizar as características
linguísticas extraídas do texto para cada tipo de conceito. Posteriormente,
Totum foi implementado para harmonizar nomes de conceitos provenientes
de sistemas heterogéneos, oferecendo uma solução mais robusta e com melhores
resultados. Esta aproximação recorre a informação contida em corpora
heterogéneos para disponibilizar uma solução não restrita às característica
de um único corpus. Uma vez que as soluções anteriores não oferecem
ligação dos nomes a bases de conhecimento, Neji foi construído para facilitar
o desenvolvimento de soluções complexas e personalizadas para o
reconhecimento de conceitos nomeados e respectiva normalização. Isto foi
conseguido através de uma plataforma modular e flexível focada em rapidez
e desempenho, integrando um vasto conjunto de módulos de processamento
optimizados para o domínio biomédico. De forma a disponibilizar identificação
de conceitos biomédicos em tempo real, BeCAS foi desenvolvido para
oferecer um serviço, aplicação e widget Web. A extracção de relações entre
conceitos também foi abordada através do desenvolvimento de TrigNER,
uma solução baseada em inteligência artificial para o reconhecimento de
palavras que desencadeiam a ocorrência de eventos biomédicos. Esta ferramenta
aplica um algoritmo automático para encontrar as melhores características
linguísticas e parâmetros para cada tipo de evento. Finalmente,
de forma a auxiliar o trabalho de bio-curadores, Egas foi desenvolvido para
suportar a anotação rápida, interactiva e colaborativa em tempo real de
documentos biomédicos, através da anotação manual e automática de conceitos
e relações de forma contextualizada. Resumindo, este trabalho contribuiu
para a actualização mais precisa das actuais bases de conhecimento,
auxiliando a formulação de hipóteses e a descoberta de novo conhecimento