5,097 research outputs found

    Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

    Full text link
    This work introduces a number of algebraic topology approaches, such as multicomponent persistent homology, multi-level persistent homology and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. Multicomponent persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for chemical and biological problems. Extensive numerical experiments involving more than 4,000 protein-ligand complexes from the PDBBind database and near 100,000 ligands and decoys in the DUD database are performed to test respectively the scoring power and the virtual screening power of the proposed topological approaches. It is demonstrated that the present approaches outperform the modern machine learning based methods in protein-ligand binding affinity predictions and ligand-decoy discrimination

    Discrete Elastic Inner Vector Spaces with Application in Time Series and Sequence Mining

    Get PDF
    This paper proposes a framework dedicated to the construction of what we call discrete elastic inner product allowing one to embed sets of non-uniformly sampled multivariate time series or sequences of varying lengths into inner product space structures. This framework is based on a recursive definition that covers the case of multiple embedded time elastic dimensions. We prove that such inner products exist in our general framework and show how a simple instance of this inner product class operates on some prospective applications, while generalizing the Euclidean inner product. Classification experimentations on time series and symbolic sequences datasets demonstrate the benefits that we can expect by embedding time series or sequences into elastic inner spaces rather than into classical Euclidean spaces. These experiments show good accuracy when compared to the euclidean distance or even dynamic programming algorithms while maintaining a linear algorithmic complexity at exploitation stage, although a quadratic indexing phase beforehand is required.Comment: arXiv admin note: substantial text overlap with arXiv:1101.431

    Hidden Markov Models for Gene Sequence Classification: Classifying the VSG genes in the Trypanosoma brucei Genome

    Full text link
    The article presents an application of Hidden Markov Models (HMMs) for pattern recognition on genome sequences. We apply HMM for identifying genes encoding the Variant Surface Glycoprotein (VSG) in the genomes of Trypanosoma brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa causative agents of sleeping sickness and several diseases in domestic and wild animals. These parasites have a peculiar strategy to evade the host's immune system that consists in periodically changing their predominant cellular surface protein (VSG). The motivation for using patterns recognition methods to identify these genes, instead of traditional homology based ones, is that the levels of sequence identity (amino acid and DNA sequence) amongst these genes is often below of what is considered reliable in these methods. Among pattern recognition approaches, HMM are particularly suitable to tackle this problem because they can handle more naturally the determination of gene edges. We evaluate the performance of the model using different number of states in the Markov model, as well as several performance metrics. The model is applied using public genomic data. Our empirical results show that the VSG genes on T. brucei can be safely identified (high sensitivity and low rate of false positives) using HMM.Comment: Accepted article in July, 2015 in Pattern Analysis and Applications, Springer. The article contains 23 pages, 4 figures, 8 tables and 51 reference

    The distance-profile representation and its application to detection of distantly related protein families

    Get PDF
    BACKGROUND: Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. RESULTS: We describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions. CONCLUSION: Using several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE

    Recovering complete and draft population genomes from metagenome datasets.

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using <it>k</it>-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: <it>gre.k </it>(generalized relative entropy) and <it>gsm.k </it>(gapped similarity measure).</p> <p>Results</p> <p>We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.</p> <p>Conclusion</p> <p>Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel <it>gsm.k </it>introduced by this article, the <it>cos.k </it>followed. When the data becomes less redundant, <it>gre.k </it>proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that <it>Gdis.k </it>based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.</p
    • …
    corecore