17 research outputs found

    Sensitive protein alignments at tree-of-life scale using DIAMOND

    No full text

    Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD

    No full text
    Publisher Copyright: © 2023, The Author(s).Sequence alignments are the foundations of life science research, but most innovation so far focuses on optimal alignments, while information derived from suboptimal solutions is ignored. We argue that one optimal alignment per pairwise sequence comparison is a reasonable approximation when dealing with very similar sequences but is insufficient when exploring the biodiversity of the protein universe at tree-of-life scale. To overcome this limitation, we introduce pairwise alignment-safety to uncover the amino acid positions robustly shared across all suboptimal solutions. We implement EMERALD, a software library for alignment-safety inference, and apply it to 400k sequences from the SwissProt database.Peer reviewe

    Annotated bacterial chromosomes from frame-shift-corrected long-read metagenomic data

    No full text
    Abstract Background Short-read sequencing technologies have long been the work-horse of microbiome analysis. Continuing technological advances are making the application of long-read sequencing to metagenomic samples increasingly feasible. Results We demonstrate that whole bacterial chromosomes can be obtained from an enriched community, by application of MinION sequencing to a sample from an EBPR bioreactor, producing 6 Gb of sequence that assembles into multiple closed bacterial chromosomes. We provide a simple pipeline for processing such data, which includes a new approach to correcting erroneous frame-shifts. Conclusions Advances in long-read sequencing technology and corresponding algorithms will allow the routine extraction of whole chromosomes from environmental samples, providing a more detailed picture of individual members of a microbiome

    Petabase-scale sequence alignment catalyses viral discovery

    No full text
    Abstract Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, now exceeding multiple petabases and growing exponentially [1, 2]. We developed a cloud computing infrastructure, Serratus , to enable ultra-high throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA dependent RNA polymerase, identifying well over 10 5 novel RNA viruses and thereby expanding the number of known species by roughly an order of magnitude. We characterised novel viruses related to coronaviruses and to hepatitis δ virus, respectively and explored their environmental reservoirs. To catalyse a new era of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics
    corecore