12 research outputs found
Computing Network of Diseases and Pharmacological Entities through the Integration of Distributed Literature Mining and Ontology Mapping
The proliferation of -omics (such as, Genomics, Proteomics) and -ology (such as, System Biology, Cell Biology, Pharmacology) have spawned new frontiers of research in drug discovery and personalized medicine. A vast amount (21 million) of published research results are archived in the PubMed and are continually growing in size. To improve the accessibility and utility of such a large number of literatures, it is critical to develop a suit of semantic sensitive technology that is capable of discovering knowledge and can also infer possible new relationships based on statistical co-occurrences of meaningful terms or concepts. In this context, this thesis presents a unified framework to mine a large number of literatures through the integration of latent semantic analysis (LSA) and ontology mapping. In particular, a parameter optimized, robust, scalable, and distributed LSA (DiLSA) technique was designed and implemented on a carefully selected 7.4 million PubMed records related to pharmacology. The DiLSA model was integrated with MeSH to make the model effective and efficient for a specific domain. An optimized multi-gram dictionary was customized by mapping the MeSH to build the DiLSA model. A fully integrated web-based application, called PharmNet, was developed to bridge the gap between biological knowledge and clinical practices. Preliminary analysis using the PharmNet shows an improved performance over global LSA model. A limited expert evaluation was performed to validate the retrieved results and network with biological literatures. A thorough performance evaluation and validation of results is in progress
Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and their Roles in Complex Disease
Vast amounts of biomedical associations are easily accessible in public resources, spanning gene-disease associations, tissue-specific gene expression, gene function and pathway annotations, and many other data types. Despite this mass of data, information most relevant to the study of a particular disease remains loosely coupled and difficult to incorporate into ongoing research. Current public databases are difficult to navigate and do not interoperate well due to the plethora of interfaces and varying biomedical concept identifiers used. Because no coherent display of data within a specific problem domain is available, finding the latent relationships associated with a disease of interest is impractical.
This research describes a method for extracting the contextual relationships embedded within associations relevant to a disease of interest. After applying the method to a small test data set, a large-scale integrated association network is constructed for application of a network propagation technique that helps uncover more distant latent relationships. Together these methods are adept at uncovering highly relevant relationships without any a priori knowledge of the disease of interest.
The combined contextual search and relevance methods power a tool which makes pertinent biomedical associations easier to find, easier to assimilate into ongoing work, and more prominent than currently available databases. Increasing the accessibility of current information is an important component to understanding high-throughput experimental results and surviving the data deluge
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Communityâs Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by ConselleriÌa
de Cultura, EducacioÌn e OrdenacioÌn Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank InÌigo GarciaÌ -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Novel Algorithm Development for âNextGenerationâ Sequencing Data Analysis
In recent years, the decreasing cost of âNext generationâ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation.
This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics â computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data.
The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives.
Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4).
Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposiâs Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection
In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability
Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge
linkages in existing scientific literature to provide impetus to innovation and research
productivity. Despite significant advancements in LBD research, previous studies contain
several open problems and shortcomings that are hindering its progress. The overarching
goal of this thesis is to address these issues, not only to enhance the discovery
component of LBD, but also to shed light on new directions that can further strengthen
the existing understanding of the LBD work
ow. In accordance with this goal, the thesis
aims to enhance the LBD work
ow with a view to ensuring its widespread applicability.
The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of
the proposed solutions to a diverse range of problem settings. These problem settings
are not necessarily application areas that are closely related to the LBD context, but
could include a wide range of problems beyond the typical scope of LBD, which has traditionally
been applied to scientific literature. Adapting the LBD work
ow to problems
outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of
LBD research, which is discovering novel linkages in text corpora is valid across a vast
range of problem settings.
Secondly, the idea of widespread applicability also denotes the capability of the proposed
solutions to be executed in new environments. These `new environments' are various
academic disciplines (i.e., cross-domain knowledge discovery) and publication languages
(i.e., cross-lingual knowledge discovery). The application of LBD models to new environments
is timely, since the massive growth of the scientific literature has engendered
huge challenges to academics, irrespective of their domain.
This thesis is divided into five main research objectives that address the following topics:
literature synthesis, the input component, the discovery component, reusability, and
portability. The objective of the literature synthesis is to address the gaps in existing
LBD reviews by conducting the rst systematic literature review. The input component
section aims to provide generalised insights on the suitability of various input types in the
LBD work
ow, focusing on their role and potential impact on the information retrieval
cycle of LBD.
The discovery component section aims to intermingle two research directions that have
been under-investigated in the LBD literature, `modern word embedding techniques'
and `temporal dimension' by proposing diachronic semantic inferences. Their potential
positive in
uence in knowledge discovery is veri ed through both direct and indirect
uses. The reusability section aims to present a new, distinct viewpoint on these LBD
models by verifying their reusability in a timely application area using a methodical reuse
plan. The last section, portability, proposes an interdisciplinary LBD framework that
can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its
generalisable capabilities.
Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main
research objectives, enhancing the existing understanding of the LBD work
ow. The
thesis offers new insights which future LBD research could further explore and expand
to create more eficient, widely applicable LBD models to enable broader community
benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Bioinformatics
This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
Semi-automated framework for the analytical use of gene-centric data with biological ontologies
Motivation Translational bioinformatics(TBI) has been defined as âthe development
and application of informatics methods that connect molecular entities to clinical entitiesâ
[1], which has emerged as a systems theory approach to bridge the huge wealth of
biomedical data into clinical actions using a combination of innovations and resources
across the entire spectrum of biomedical informatics approaches [2]. The challenge
for TBI is the availability of both comprehensive knowledge based on genes and the
corresponding tools that allow their analysis and exploitation.
Traditionally, biological researchers usually study one or only a few genes at a
time, but in recent years high throughput technologies such as gene expression microarrays,
protein mass-spectrometry and next-generation DNA and RNA sequencing
have emerged that allow the simultaneous measurement of changes on a genome-wide
scale. These technologies usually result in large lists of interesting genes, but meaningful
biological interpretation remains a major challenge. Over the last decade, enrichment
analysis has become standard practice in the analysis of such gene lists, enabling
systematic assessment of the likelihood of differential representation of defined groups
of genes compared to suitably annotated background knowledge. The success of such
analyses are highly dependent on the availability and quality of the gene annotation
data.
For many years, genes were annotated by different experts using inconsistent, non-standard
terminologies. Large amounts of variation and duplication in these unstructured
annotation sets, made them unsuitable for principled quantitative analysis. More
recently, a lot of effort has been put into the development and use of structured, domain
specific vocabularies to annotate genes. The Gene Ontology is one of the most successful
examples of this where genes are annotated with terms from three main clades;
biological process, molecular function and cellular component. However, there are
many other established and emerging ontologies to aid biological data interpretation,
but are rarely used. For the same reason, many bioinformatic tools only support analysis
analysis using the Gene Ontology.
The lack of annotation coverage and the support for them in existing analytical
tools to aid biological interpretation of data has become a major limitation to their utility
and uptake. Thus, automatic approaches are needed to facilitate the transformation
of unstructured data to unlock the potential of all ontologies, with corresponding bioinformatics
tools to support their interpretation.
Approaches In this thesis, firstly, similar to the approach in [3,4], I propose a series
of computational approaches implemented in a new tool OntoSuite-Miner to address
the ontology based gene association data integration challenge. This approach uses
NLP based text mining methods for ontology based biomedical text mining. What
differentiates my approach from other approaches is that I integrate two of the most
wildly used NLP modules into the framework, not only increasing the confidence of
the text mining results, but also providing an annotation score for each mapping, based
on the number of pieces of evidence in the literature and the number of NLP modules
that agreed with the mapping. Since heterogeneous data is important in understanding
human disease, the approach was designed to be generic, thus the ontology
based annotation generation can be applied to different sources and can be repeated
with different ontologies. Secondly, in respect of the second challenge proposed by
TBI, to increase the statistical power of the annotation enrichment analysis, I propose
OntoSuite-Analytics, which integrates a collection of enrichment analysis methods into
a unified open-source software package named topOnto, in the statistical programming
language R. The package supports enrichment analysis across multiple ontologies with
a set of implemented statistical/topological algorithms, allowing the comparison of enrichment
results across multiple ontologies and between different algorithms.
Results The methodologies described above were implemented and a Human Disease
Ontology (HDO) based gene annotation database was generated by mining three
publicly available database, OMIM, GeneRIF and Ensembl variation. With the availability
of the HDO annotation and the corresponding ontology enrichment analysis
tools in topOnto, I profiled 277 gene classes with human diseases and generated âdisease
environmentsâ for 1310 human diseases. The exploration of the disease profiles
and disease environment provides an overview of known disease knowledge and provides
new insights into disease mechanisms. The integration of multiple ontologies
into a disease context demonstrates how âorthogonalâ ontologies can lead to biological
insight that would have been missed by more traditional single ontology analysis
FAIR and bias-free network modules for mechanism-based disease redefinitions
Even though chronic diseases are the cause of 60% of all deaths around the world, the underlying causes for most of them are not fully understood. Hence, diseases are defined based on organs and symptoms, and therapies largely focus on mitigating symptoms rather than cure. This is also reflected in the most commonly used disease classifications. The complex nature of diseases, however, can be better defined in terms of networks of molecular interactions. This research applies the approaches of network medicine â a field that uses network science for identifying and treating diseases â to multiple diseases with highly unmet medical need such as stroke and hypertension. The results show the success of this approach to analyse complex disease networks and predict drug targets for different conditions, which are validated through preclinical experiments and are currently in human clinical trials
Frameshift mutations at the C-terminus of HIST1H1E result in a specific DNA hypomethylation signature
BACKGROUND: We previously associated HIST1H1E mutations causing Rahman syndrome with a specific genome-wide methylation pattern. RESULTS: Methylome analysis from peripheral blood samples of six affected subjects led us to identify a specific hypomethylated profile. This "episignature" was enriched for genes involved in neuronal system development and function. A computational classifier yielded full sensitivity and specificity in detecting subjects with Rahman syndrome. Applying this model to a cohort of undiagnosed probands allowed us to reach diagnosis in one subject. CONCLUSIONS: We demonstrate an epigenetic signature in subjects with Rahman syndrome that can be used to reach molecular diagnosis
Additional file 2 of Prioritization, clustering and functional annotation of MicroRNAs using latent semantic indexing of MEDLINE abstracts
Tables S1A, S1B, S1C, S2A, S2B, S3, S4A and S4B. Microsoft Excel 2013 workbook ĂąÂÂS11-S1.xlsxù contains supplementary tables 1A, 1B, 1C, 2A, 2B, 3, 4A and 4B in separate tabs. (XLSX 32.5 KB