8,310 research outputs found
Machine learning for metagenomics: methods and tools
Owing to the complexity and variability of metagenomic studies, modern
machine learning approaches have seen increased usage to answer a variety of
question encompassing the full range of metagenomic NGS data analysis. We
review here the contribution of machine learning techniques for the field of
metagenomics, by presenting known successful approaches in a unified framework.
This review focuses on five important metagenomic problems: OTU-clustering,
binning, taxonomic profling and assignment, comparative metagenomics and gene
prediction. For each of these problems, we identify the most prominent methods,
summarize the machine learning approaches used and put them into perspective of
similar methods. We conclude our review looking further ahead at the challenge
posed by the analysis of interactions within microbial communities and
different environments, in a field one could call "integrative metagenomics"
Reference-Based Sequence Classification
Sequence classification is an important data mining task in many real world
applications. Over the past few decades, many sequence classification methods
have been proposed from different aspects. In particular, the pattern-based
method is one of the most important and widely studied sequence classification
methods in the literature. In this paper, we present a reference-based sequence
classification framework, which can unify existing pattern-based sequence
classification methods under the same umbrella. More importantly, this
framework can be used as a general platform for developing new sequence
classification algorithms. By utilizing this framework as a tool, we propose
new sequence classification algorithms that are quite different from existing
solutions. Experimental results show that new methods developed under the
proposed framework are capable of achieving comparable classification accuracy
to those state-of-the-art sequence classification algorithms
Big Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental
datasets and complex data analytics methods. The machine learning methods used
in bioinformatics are iterative and parallel. These methods can be scaled to
handle big data using the distributed and parallel computing technologies.
Usually big data tools perform computation in batch-mode and are not
optimized for iterative processing and high data dependency among operations.
In the recent years, parallel, incremental, and multi-view machine learning
algorithms have been proposed. Similarly, graph-based architectures and
in-memory big data tools have been developed to minimize I/O cost and optimize
iterative processing.
However, there lack standard big data architectures and tools for many
important bioinformatics problems, such as fast construction of co-expression
and regulatory networks and salient module identification, detection of
complexes over growing protein-protein interaction data, fast analysis of
massive DNA, RNA, and protein sequence data, and fast querying on incremental
and heterogeneous disease networks. This paper addresses the issues and
challenges posed by several big data problems in bioinformatics, and gives an
overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic
Dealing with complexity of biological systems: from data to models
Four chapters of the synthesis represent four major areas of my research
interests: 1) data analysis in molecular biology, 2) mathematical modeling of
biological networks, 3) genome evolution, and 4) cancer systems biology. The
first chapter is devoted to my work in developing non-linear methods of
dimension reduction (methods of elastic maps and principal trees) which extends
the classical method of principal components. Also I present application of
matrix factorization techniques to analysis of cancer data. The second chapter
is devoted to the complexity of mathematical models in molecular biology. I
describe the basic ideas of asymptotology of chemical reaction networks aiming
at dissecting and simplifying complex chemical kinetics models. Two
applications of this approach are presented: to modeling NFkB and apoptosis
pathways, and to modeling mechanisms of miRNA action on protein translation.
The third chapter briefly describes my investigations of the genome structure
in different organisms (from microbes to human cancer genomes). Unsupervised
data analysis approaches are used to investigate the patterns in genomic
sequences shaped by genome evolution and influenced by the basic properties of
the environment. The fourth chapter summarizes my experience in studying cancer
by computational methods (through combining integrative data analysis and
mathematical modeling approaches). In particular, I describe the on-going
research projects such as mathematical modeling of cell fate decisions and
synthetic lethal interactions in DNA repair network. The synthesis is concluded
by listing major challenges in computational systems biology, connected to the
topics of this text, i.e. dealing with complexity of biological systems.Comment: HDR m\'emoire (habilitation thesis) defended on the 04/04/201
Leveraging binding-site structure for drug discovery with point-cloud methods
Computational drug discovery strategies can be broadly placed in two
categories: ligand-based methods which identify novel molecules by similarity
with known ligands, and structure-based methods which predict molecules with
high-affinity to a given 3D structure (e.g. a protein). However, ligand-based
methods do not leverage information about the binding site, and structure-based
approaches rely on the knowledge of a finite set of ligands binding the target.
In this work, we introduce TarLig, a novel approach that aims to bridge the gap
between ligand and structure-based approaches. We use the 3D structure of the
binding site as input to a model which predicts the ligand preferences of the
binding site. The resulting predictions could then offer promising seeds and
constraints in the chemical space search, based on the binding site structure.
TarLig outperforms standard models by introducing a data-alignment and
augmentation technique. The recent popularity of Volumetric 3DCNN pipelines in
structural bioinformatics suggests that this extra step could help a wide range
of methods to improve their results with minimal modifications
Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
New technologies have enabled the investigation of biology and human health
at an unprecedented scale and in multiple dimensions. These dimensions include
a myriad of properties describing genome, epigenome, transcriptome, microbiome,
phenotype, and lifestyle. No single data type, however, can capture the
complexity of all the factors relevant to understanding a phenomenon such as a
disease. Integrative methods that combine data from multiple technologies have
thus emerged as critical statistical and computational approaches. The key
challenge in developing such approaches is the identification of effective
models to provide a comprehensive and relevant systems view. An ideal method
can answer a biological or medical question, identifying important features and
predicting outcomes, by harnessing heterogeneous data across several dimensions
of biological variation. In this Review, we describe the principles of data
integration and discuss current methods and available implementations. We
provide examples of successful data integration in biology and medicine.
Finally, we discuss current challenges in biomedical integrative methods and
our perspective on the future development of the field
Continuum directions for supervised dimension reduction
Dimension reduction of multivariate data supervised by auxiliary information
is considered. A series of basis for dimension reduction is obtained as
minimizers of a novel criterion. The proposed method is akin to continuum
regression, and the resulting basis is called continuum directions. With a
presence of binary supervision data, these directions continuously bridge the
principal component, mean difference and linear discriminant directions, thus
ranging from unsupervised to fully supervised dimension reduction.
High-dimensional asymptotic studies of continuum directions for binary
supervision reveal several interesting facts. The conditions under which the
sample continuum directions are inconsistent, but their classification
performance is good, are specified. While the proposed method can be directly
used for binary and multi-category classification, its generalizations to
incorporate any form of auxiliary data are also presented. The proposed method
enjoys fast computation, and the performance is better or on par with more
computer-intensive alternatives
Deep Neural Network for Analysis of DNA Methylation Data
Many researches demonstrated that the DNA methylation, which occurs in the
context of a CpG, has strong correlation with diseases, including cancer. There
is a strong interest in analyzing the DNA methylation data to find how to
distinguish different subtypes of the tumor. However, the conventional
statistical methods are not suitable for analyzing the highly dimensional DNA
methylation data with bounded support. In order to explicitly capture the
properties of the data, we design a deep neural network, which composes of
several stacked binary restricted Boltzmann machines, to learn the low
dimensional deep features of the DNA methylation data. Experiments show these
features perform best in breast cancer DNA methylation data cluster analysis,
comparing with some state-of-the-art methods.Comment: Techinical Repor
Network-based protein structural classification
Experimental determination of protein function is resource-consuming. As an
alternative, computational prediction of protein function has received
attention. In this context, protein structural classification (PSC) can help,
by allowing for determining structural classes of currently unclassified
proteins based on their features, and then relying on the fact that proteins
with similar structures have similar functions. Existing PSC approaches rely on
sequence-based or direct 3-dimensional (3D) structure-based protein features.
In contrast, we first model 3D structures of proteins as protein structure
networks (PSNs). Then, we use network-based features for PSC. We propose the
use of graphlets, state-of-the-art features in many research areas of network
science, in the task of PSC. Moreover, because graphlets can deal only with
unweighted PSNs, and because accounting for edge weights when constructing PSNs
could improve PSC accuracy, we also propose a deep learning framework that
automatically learns network features from weighted PSNs. When evaluated on a
large set of ~9,400 CATH and ~12,800 SCOP protein domains (spanning 36 PSN
sets), our proposed approaches are superior to existing PSC approaches in terms
of accuracy, with comparable running time
Persistent-Homology-based Machine Learning and its Applications -- A Survey
A suitable feature representation that can both preserve the data intrinsic
information and reduce data complexity and dimensionality is key to the
performance of machine learning models. Deeply rooted in algebraic topology,
persistent homology (PH) provides a delicate balance between data
simplification and intrinsic structure characterization, and has been applied
to various areas successfully. However, the combination of PH and machine
learning has been hindered greatly by three challenges, namely topological
representation of data, PH-based distance measurements or metrics, and PH-based
feature representation. With the development of topological data analysis,
progresses have been made on all these three problems, but widely scattered in
different literatures. In this paper, we provide a systematical review of PH
and PH-based supervised and unsupervised models from a computational
perspective. Our emphasizes are the recent development of mathematical models
and tools, including PH softwares and PH-based functions, feature
representations, kernels, and similarity models. Essentially, this paper can
work as a roadmap for the practical application of PH-based machine learning
tools. Further, we consider different topological feature representations in
different machine learning models, and investigate their impacts on the protein
secondary structure classification.Comment: 42 pages; 6 figures; 9 table
- …