8,310 research outputs found

    Machine learning for metagenomics: methods and tools

    Full text link
    Owing to the complexity and variability of metagenomic studies, modern machine learning approaches have seen increased usage to answer a variety of question encompassing the full range of metagenomic NGS data analysis. We review here the contribution of machine learning techniques for the field of metagenomics, by presenting known successful approaches in a unified framework. This review focuses on five important metagenomic problems: OTU-clustering, binning, taxonomic profling and assignment, comparative metagenomics and gene prediction. For each of these problems, we identify the most prominent methods, summarize the machine learning approaches used and put them into perspective of similar methods. We conclude our review looking further ahead at the challenge posed by the analysis of interactions within microbial communities and different environments, in a field one could call "integrative metagenomics"

    Reference-Based Sequence Classification

    Full text link
    Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms

    Big Data Analytics in Bioinformatics: A Machine Learning Perspective

    Full text link
    Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic

    Dealing with complexity of biological systems: from data to models

    Full text link
    Four chapters of the synthesis represent four major areas of my research interests: 1) data analysis in molecular biology, 2) mathematical modeling of biological networks, 3) genome evolution, and 4) cancer systems biology. The first chapter is devoted to my work in developing non-linear methods of dimension reduction (methods of elastic maps and principal trees) which extends the classical method of principal components. Also I present application of matrix factorization techniques to analysis of cancer data. The second chapter is devoted to the complexity of mathematical models in molecular biology. I describe the basic ideas of asymptotology of chemical reaction networks aiming at dissecting and simplifying complex chemical kinetics models. Two applications of this approach are presented: to modeling NFkB and apoptosis pathways, and to modeling mechanisms of miRNA action on protein translation. The third chapter briefly describes my investigations of the genome structure in different organisms (from microbes to human cancer genomes). Unsupervised data analysis approaches are used to investigate the patterns in genomic sequences shaped by genome evolution and influenced by the basic properties of the environment. The fourth chapter summarizes my experience in studying cancer by computational methods (through combining integrative data analysis and mathematical modeling approaches). In particular, I describe the on-going research projects such as mathematical modeling of cell fate decisions and synthetic lethal interactions in DNA repair network. The synthesis is concluded by listing major challenges in computational systems biology, connected to the topics of this text, i.e. dealing with complexity of biological systems.Comment: HDR m\'emoire (habilitation thesis) defended on the 04/04/201

    Leveraging binding-site structure for drug discovery with point-cloud methods

    Full text link
    Computational drug discovery strategies can be broadly placed in two categories: ligand-based methods which identify novel molecules by similarity with known ligands, and structure-based methods which predict molecules with high-affinity to a given 3D structure (e.g. a protein). However, ligand-based methods do not leverage information about the binding site, and structure-based approaches rely on the knowledge of a finite set of ligands binding the target. In this work, we introduce TarLig, a novel approach that aims to bridge the gap between ligand and structure-based approaches. We use the 3D structure of the binding site as input to a model which predicts the ligand preferences of the binding site. The resulting predictions could then offer promising seeds and constraints in the chemical space search, based on the binding site structure. TarLig outperforms standard models by introducing a data-alignment and augmentation technique. The recent popularity of Volumetric 3DCNN pipelines in structural bioinformatics suggests that this extra step could help a wide range of methods to improve their results with minimal modifications

    Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities

    Full text link
    New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field

    Continuum directions for supervised dimension reduction

    Full text link
    Dimension reduction of multivariate data supervised by auxiliary information is considered. A series of basis for dimension reduction is obtained as minimizers of a novel criterion. The proposed method is akin to continuum regression, and the resulting basis is called continuum directions. With a presence of binary supervision data, these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. High-dimensional asymptotic studies of continuum directions for binary supervision reveal several interesting facts. The conditions under which the sample continuum directions are inconsistent, but their classification performance is good, are specified. While the proposed method can be directly used for binary and multi-category classification, its generalizations to incorporate any form of auxiliary data are also presented. The proposed method enjoys fast computation, and the performance is better or on par with more computer-intensive alternatives

    Deep Neural Network for Analysis of DNA Methylation Data

    Full text link
    Many researches demonstrated that the DNA methylation, which occurs in the context of a CpG, has strong correlation with diseases, including cancer. There is a strong interest in analyzing the DNA methylation data to find how to distinguish different subtypes of the tumor. However, the conventional statistical methods are not suitable for analyzing the highly dimensional DNA methylation data with bounded support. In order to explicitly capture the properties of the data, we design a deep neural network, which composes of several stacked binary restricted Boltzmann machines, to learn the low dimensional deep features of the DNA methylation data. Experiments show these features perform best in breast cancer DNA methylation data cluster analysis, comparing with some state-of-the-art methods.Comment: Techinical Repor

    Network-based protein structural classification

    Full text link
    Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct 3-dimensional (3D) structure-based protein features. In contrast, we first model 3D structures of proteins as protein structure networks (PSNs). Then, we use network-based features for PSC. We propose the use of graphlets, state-of-the-art features in many research areas of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from weighted PSNs. When evaluated on a large set of ~9,400 CATH and ~12,800 SCOP protein domains (spanning 36 PSN sets), our proposed approaches are superior to existing PSC approaches in terms of accuracy, with comparable running time

    Persistent-Homology-based Machine Learning and its Applications -- A Survey

    Full text link
    A suitable feature representation that can both preserve the data intrinsic information and reduce data complexity and dimensionality is key to the performance of machine learning models. Deeply rooted in algebraic topology, persistent homology (PH) provides a delicate balance between data simplification and intrinsic structure characterization, and has been applied to various areas successfully. However, the combination of PH and machine learning has been hindered greatly by three challenges, namely topological representation of data, PH-based distance measurements or metrics, and PH-based feature representation. With the development of topological data analysis, progresses have been made on all these three problems, but widely scattered in different literatures. In this paper, we provide a systematical review of PH and PH-based supervised and unsupervised models from a computational perspective. Our emphasizes are the recent development of mathematical models and tools, including PH softwares and PH-based functions, feature representations, kernels, and similarity models. Essentially, this paper can work as a roadmap for the practical application of PH-based machine learning tools. Further, we consider different topological feature representations in different machine learning models, and investigate their impacts on the protein secondary structure classification.Comment: 42 pages; 6 figures; 9 table
    • …
    corecore