587 research outputs found
Deterministic Selection on the Mesh and Hypercube
In this paper we present efficient deterministic algorithms for selection on the mesh connected computers (referred to as the mesh from hereon) and the hypercube. Our algorithm on the mesh runs in time O([n/p] log logp + âp logn) where n is the input size and p is the number of processors. The time bound is significantly better than that of the best existing algorithms when n is large. The run time of our algorithm on the hypercube is O ([n/p] log log p + Ts/p log nM/em\u3e), where Ts/p is the time needed to sort p element on a p-node hypercube. In fact, the same algorithm runs on an network in time O([n/p] log log p +Ts/p log), where Ts/p is the time needed for sorting p keys using p processors (assuming that broadcast and prefix computations take time less than or equal to Ts/p
GRASP2: Fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data
This work is licensed under a Creative Commons Attribution 4.0 International License.Background
A crucial task in metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer from fragmentary and incomplete assembly, while the second is hampered by the reduced functional signal contained in the short reads. To tackle these issues, we have previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GRASP has significantly improved recall rate (60â80% vs. 30â40%) compared to other homolog search tools such as BLAST. However, GRASP is both time- and space-consuming. Subsequently, we developed GRASPx, which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem.
Results
GRASP2 utilizes Burrows-Wheeler Transformation (BWT) and FM-index to perform assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy as a filter. GRASP2 also explicitly generates candidate paths prior to alignment, which effectively uncouples the iterative access of the assembly graph and alignment matrix. This strategy makes the execution of the program more efficient under current computer architecture, and contributes to GRASP2âs speedup.
GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high recall rate of GRASP. GRASP2 reaches ~â80% recall rate compared to that of ~â40% generated by BLAST, both at a high precision level (>â95%). With such a high performance, GRASP2 is only ~3X slower than BLASTP.
Conclusion
GRASP2 is a high-performance gene-centric and homolog search tool with significant speedup compared to its predecessors, which makes GRASP2 a useful tool for metagenomics data analysis, GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2.University of Kansas New Faculty General Research Fund allocation #2302114National Science Foundation EPSCoR First Awards in Microbiome Researc
GRASPx: efficient homolog-search of short peptide metagenome database through simultaneous alignment and assembly
Background
Metagenomics is a cultivation-independent approach that enables the study of the genomic composition of microbes present in an environment. Metagenomic samples are routinely sequenced using next-generation sequencing technologies that generate short nucleotide reads. Proteins identified from these reads are mostly of partial length. On the other hand, de novo assembly of a large metagenomic dataset is computationally demanding and the assembled contigs are often fragmented, resulting in the identification of protein sequences that are also of partial length and incomplete. Annotation of an incomplete protein sequence often proceeds by identifying its homologs in a database of reference sequences. Identifying the homologs of incomplete sequences is a challenge and can result in substandard annotation of proteins from metagenomic datasets. To address this problem, we recently developed a homology detection algorithm named GRASP (Guided Reference-based Assembly of Short Peptides) that identifies the homologs of a given reference protein sequence in a database of short peptide metagenomic sequences. GRASP was developed to implement a simultaneous alignment and assembly algorithm for annotation of short peptides identified on metagenomic reads. The program achieves significantly improved recall rate at the cost of computational efficiency. In this article, we adopted three techniques to speed up the original version of GRASP, including the pre-construction of extension links, local assembly of individual seeds, and the implementation of query-level parallelism.
Results
The resulting new program, GRASPx, achieves >30X speedup compared to its predecessor GRASP. At the same time, we show that the performance of GRASPx is consistent with that of GRASP, and that both of them significantly outperform other popular homology-search tools including the BLAST and FASTA suites. GRASPx was also applied to a human saliva metagenome dataset and shows superior performance for both recall and precision rates.
Conclusions
In this article we present GRASPx, a fast and accurate homology-search program implementing a simultaneous alignment and assembly framework. GRASPx can be used for more comprehensive and accurate annotation of short peptides. GRASPx is freely available at http://graspx.sourceforge.net/.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1119-1) contains supplementary material, which is available to authorized users
CD-HIT Suite: a web server for clustering and comparing biological sequences
Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels
Diet and Feeding Pattern Affect the Diurnal Dynamics of the Gut Microbiome
SummaryThe gut microbiome and daily feeding/fasting cycle influence host metabolism and contribute to obesity and metabolic diseases. However, fundamental characteristics of this relationship between the feeding/fasting cycle and the gut microbiome are unknown. Our studies show that the gut microbiome is highly dynamic, exhibiting daily cyclical fluctuations in composition. Diet-induced obesity dampens the daily feeding/fasting rhythm and diminishes many of these cyclical fluctuations. Time-restricted feeding (TRF), in which feeding is consolidated to the nocturnal phase, partially restores these cyclical fluctuations. Furthermore, TRF, which protects against obesity and metabolic diseases, affects bacteria shown to influence host metabolism. Cyclical changes in the gut microbiome from feeding/fasting rhythms contribute to the diversity of gut microflora and likely represent a mechanism by which the gut microbiome affects host metabolism. Thus, feeding pattern and time of harvest, in addition to diet, are important parameters when assessing the microbiomeâs contribution to host metabolism
Microbial diversity in individuals and their household contacts following typical antibiotic courses.
BackgroundAntibiotics are a mainstay of treatment for bacterial infections worldwide, yet the effects of typical antibiotic prescriptions on human indigenous microbiota have not been thoroughly evaluated. We examined the effects of the two most commonly prescribed antibiotics (amoxicillin and azithromycin) in the USA to discern whether short-term antibiotic courses may have prolonged effects on human microbiota.ResultsWe sampled the feces, saliva, and skin specimens from a cohort of unrelated, cohabitating individuals over 6Â months. An individual in each household was given an antibiotic, and the other a placebo to discern antibiotic impacts on microbiota, as well as determine whether antibiotic use might reshape the microbiota of each household. We observed household-specific patterns of microbiota on each body surface, which persevered despite antibiotic perturbations. While the gut microbiota within an individual became more dissimilar over time, there was no evidence that the use of antibiotics accelerated this process when compared to household members. There was a significant change in microbiota diversity in the gut and mouth in response to antibiotics, but analogous patterns were not observed on the skin. Those who received 7Â days of amoxicillin generally had greater reductions in diversity compared to those who received 3Â days, in contrast to those who received azithromycin.ConclusionsAs few as 3Â days of treatment with the most commonly prescribed antibiotics can result in sustained reductions in microbiota diversity, which could have implications for the maintenance of human health and resilience to disease
Metagenome and Metatranscriptome Analyses Using Protein Family Profiles
Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx
Integrated de novo gene prediction and peptide assembly of metagenomic sequencing data
Metagenomics is the study of all genomic content contained in given microbial communities. Metagenomic functional analysis aims to quantify protein families and reconstruct metabolic pathways from the metagenome. It plays a central role in understanding the interaction between the microbial community and its host or environment. De novo functional analysis, which allows the discovery of novel protein families, remains challenging for high-complexity communities. There are currently three main approaches for recovering novel genes or proteins: de novo nucleotide assembly, gene calling and peptide assembly. Unfortunately, their information dependency has been overlooked, and each has been formulated as an independent problem. In this work, we develop a sophisticated workflow called integrated Metagenomic Protein Predictor (iMPP), which leverages the information dependencies for better de novo functional analysis. iMPP contains three novel modules: a hybrid assembly graph generation module, a graph-based gene calling module, and a peptide assembly-based refinement module. iMPP significantly improved the existing gene calling sensitivity on unassembled metagenomic reads, achieving a 92â97% recall rate at a high precision level (>85%). iMPP further allowed for more sensitive and accurate peptide assembly, recovering more reference proteins and delivering more hypothetical protein sequences. The high performance of iMPP can provide a more comprehensive and unbiased view of the microbial communities under investigation. iMPP is freely available from https://github.com/Sirisha-t/iMPP
A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes
<p>Abstract</p> <p>Background</p> <p>Intragenic tandem repeats occur throughout all domains of life and impart functional and structural variability to diverse translation products. Repeat proteins confer distinctive surface phenotypes to many unicellular organisms, including those with minimal genomes such as the wall-less bacterial monoderms, <it>Mollicutes</it>. One such repeat pattern in this clade is distributed in a manner suggesting its exchange by horizontal gene transfer (HGT). Expanding genome sequence databases reveal the pattern in a widening range of bacteria, and recently among eucaryotic microbes. We examined the genomic flux and consequences of the motif by determining its distribution, predicted structural features and association with membrane-targeted proteins.</p> <p>Results</p> <p>Using a refined hidden Markov model, we document a 25-residue protein sequence motif tandemly arrayed in variable-number repeats in ORFs lacking assigned functions. It appears sporadically in unicellular microbes from disparate bacterial and eucaryotic clades, representing diverse lifestyles and ecological niches that include host parasitic, marine and extreme environments. Tracts of the repeats predict a malleable configuration of recurring domains, with conserved hydrophobic residues forming an amphipathic secondary structure in which hydrophilic residues endow extensive sequence variation. Many ORFs with these domains also have membrane-targeting sequences that predict assorted topologies; others may comprise reservoirs of sequence variants. We demonstrate expressed variants among surface lipoproteins that distinguish closely related animal pathogens belonging to a subgroup of the <it>Mollicutes</it>. DNA sequences encoding the tandem domains display dyad symmetry. Moreover, in some taxa the domains occur in ORFs selectively associated with mobile elements. These features, a punctate phylogenetic distribution, and different patterns of dispersal in genomes of related taxa, suggest that the repeat may be disseminated by HGT and intra-genomic shuffling.</p> <p>Conclusions</p> <p>We describe novel features of PARCELs (<b>P</b>alindromic <b>A</b>mphipathic <b>R</b>epeat <b>C</b>oding <b>EL</b>ements), a set of widely distributed repeat protein domains and coding sequences that were likely acquired through HGT by diverse unicellular microbes, further mobilized and diversified within genomes, and co-opted for expression in the membrane proteome of some taxa. Disseminated by multiple gene-centric vehicles, ORFs harboring these elements enhance accessory gene pools as part of the "mobilome" connecting genomes of various clades, in taxa sharing common niches.</p
- âŠ