44 research outputs found

    Inverse Protein Folding Problem via Quadratic Programming

    Get PDF
    International audienceThis paper presents a method of reconstruction a primary structure of a protein that folds into a given geometrical shape. This method predicts the primary structure of a protein and restores its linear sequence of amino acids in the polypeptide chain using the tertiary structure of a molecule. Unknown amino acids are determined according to the principle of energy minimization. This study represents inverse folding problem as a quadratic optimization problem and uses different relaxation techniques to reduce it to the problem of convex optimizations. Computational experiment compares the quality of these approaches on real protein structures

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    Assessment of chemical-crosslink-assisted protein structure modeling in CASP13

    Get PDF
    International audienceWith the advance of experimental procedures obtaining chemical crosslinking information is becoming a fast and routine practice. Information on crosslinks can greatly enhance the accuracy of protein structure modeling. Here, we review the current state of the art in modeling protein structures with the assistance of experimentally determined chemical crosslinks within the framework of the 13th meeting of Critical Assessment of Structure Prediction approaches. This largest‐to‐date blind assessment reveals benefits of using data assistance in difficult to model protein structure prediction cases. However, in a broader context, it also suggests that with the unprecedented advance in accuracy to predict contacts in recent years, experimental crosslinks will be useful only if their specificity and accuracy further improved and they are better integrated into computational workflows

    Biosynthetic potential of the global ocean microbiome

    Get PDF
    Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds. However, studying this diversity to identify genomic pathways for the synthesis of such compounds and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments

    Biosynthetic potential of the global ocean microbiome

    Get PDF
    8 pages, 4 figures, supplementary information https://doi.org/10.1038/s41586-022-04862-3.-- This Article is contribution number 130 of Tara OceansNatural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters (‘Candidatus Eudoremicrobiaceae’) that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environmentsThis work was supported by funding from the ETH and the Helmut Horten Foundation; the Swiss National Science Foundation (SNSF) through project grants 205321_184955 to S.S., 205320_185077 to J.P. and the NCCR Microbiomes (51NF40_180575) to S.S.; by the Gordon and Betty Moore Foundation (https://doi.org/10.37807/GBMF9204) and the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 101000392 (MARBLES) to J.P.; by an ETH research grant ETH-21 18-2 to J.P.; and by the Peter and Traudl Engelhorn Foundation and by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 897571 to C.C.F. S.L.R. was supported by an ETH Zurich postdoctoral fellowship 20-1 FEL-07. M.L., L.M.C. and G.Z. were supported by EMBL Core Funding and the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft, project no. 395357507, SFB 1371 to G.Z.). M.B.S. was supported by the NSF grant OCE#1829831. C.B. was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement Diatomic, no. 835067). S.G.A. was supported by the Spanish Ministry of Economy and Competitiveness (PID2020-116489RB-I00). M.K. and H.M. were funded by the SNSF grant 407540_167331 as part of the Swiss National Research Programme 75 ‘Big Data’. M.K., H.M. and A.K. are also partially funded by ETH core funding (to G. Rätsch)With the institutional support of the ‘Severo Ochoa Centre of Excellence’ accreditation (CEX2019-000928-S)Peer reviewe

    Scalable Annotated Genome Graphs for Representing Sequence Data

    No full text
    Technological advances made over the last decades in sequencing technologies have led to continuous improvements of quality and ever-decreasing costs of sequencing. All this resulted in a steady growth of the amount of biological sequences produced by medical institutions and the general scientific community. Yet, the vast majority of this data is stored in data repositories that do not provide means for large-scale analysis and search in this trove. For example, the European Nucleotide Archive (ENA) and NCBI Sequence Read Archive (SRA) currently store over 37 and 72 Petabases of sequences, respectively. However, to answer even such a simple question as 'has this sequence, variant, or pathogen been observed anywhere before?' with a moderately large query would require extensive computations that cost over a thousand US dollars with a typical Cloud Computing provider. In this dissertation, we consider the problem of indexing large collections of biological sequences. We design compressed data structures and apply these to build a tool called MetaGraph, which aggregates large volumes of sequence data and makes it searchable. As a result, life science researchers and other communities get easy access to the sequence data for investigation, which is essential for making discoveries. To demonstrate the capacity of MetaGraph, we have indexed a significant portion of all publicly available sequencing samples from the SRA. We have also indexed a number of other diverse and biologically relevant data sets, from reference genomes to raw metagenomic reads. In total, we processed 4.6 Petabases of sequences, which far exceeds the pivotal figure of one Petabase and, at last, makes this data fully and efficiently searchable by sequence. The resulting indexes form a valuable community resource, as they succinctly summarize large raw-sequence data sets while supporting various queries against them. We provide these indexes as a public resource with a subset of them hosted online as a service for interactive search. The size and the diversity of the data we have processed prove the feasibility of keeping all existing sequence archives indexed in a general manner and making them searchable, similarly to how Google indexes web pages and the information extracted from them

    Inverse Protein Folding Problem via Quadratic Programming

    No full text
    International audienceThis paper presents a method of reconstruction a primary structure of a protein that folds into a given geometrical shape. This method predicts the primary structure of a protein and restores its linear sequence of amino acids in the polypeptide chain using the tertiary structure of a molecule. Unknown amino acids are determined according to the principle of energy minimization. This study represents inverse folding problem as a quadratic optimization problem and uses different relaxation techniques to reduce it to the problem of convex optimizations. Computational experiment compares the quality of these approaches on real protein structures

    Inverse Protein Folding Problem via Quadratic Programming

    Get PDF
    International audienceThis paper presents a method of reconstruction a primary structure of a protein that folds into a given geometrical shape. This method predicts the primary structure of a protein and restores its linear sequence of amino acids in the polypeptide chain using the tertiary structure of a molecule. Unknown amino acids are determined according to the principle of energy minimization. This study represents inverse folding problem as a quadratic optimization problem and uses different relaxation techniques to reduce it to the problem of convex optimizations. Computational experiment compares the quality of these approaches on real protein structures

    Smooth orientation-dependent scoring function for coarse-grained protein quality assessment

    Get PDF
    International audienceMotivation: Protein quality assessment (QA) is a crucial element of protein structure prediction, a fundamental and yet open problem in structural bioinformatics. QA aims at ranking predicted protein models to select the best candidates. The assessment can be performed based either on a single model or on a consensus derived from an ensemble of models. The latter strategy can yield very high performance but substantially depends on the pool of available candidate models, which limits its applicability. Hence, single-model QA methods remain an important research target, also because they can assist the sampling of candidate models. Results: We present a novel single-model QA method called SBROD. The SBROD (Smooth Backbone-Reliant Orientation-Dependent) method uses only the backbone protein conforma-tion, and hence it can be applied to scoring coarse-grained protein models. The proposed method deduces its scoring function from a training set of protein models. The SBROD scoring function is composed of four terms related to different structural features: residue-residue orientations, contacts between backbone atoms, hydrogen bonding, and solvent-solute interactions. It is smooth with respect to atomic coordinates and thus is potentially applicable to continuous gradient-based optimization of protein conformations. Furthermore, it can also be used for coarse-grained protein modeling and computational protein design. SBROD proved to achieve similar performance to state-of-the-art single-model QA methods on diverse datasets (CASP11, CASP12, and MOULDER)

    Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs

    No full text
    Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases. We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data. When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes. While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes. It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices. This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche
    corecore