40 research outputs found

    ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

    No full text
    <div><p>The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at <a href="http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html" target="_blank">http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html</a>.</p></div

    (a) A toy example of a PBP tree and (b) its corresponding partitioning of a dataset and a space.

    No full text
    <p>The colors indicate different levels of nodes and their corresponding hyper-spheres. The leaf nodes are omitted in the tree. When searching for the nearest neighbor of a point (large red dot), only a small number of sibling hyper-spheres (filled circles) need to be explored.</p

    Execution time of ESPRIT-Forest performed on HMP and ELDERMET datasets.

    No full text
    <p>The clustering termination criterion was set to 85% sequence similarity. ESPRIT preproc was used to remove low-quality reads before clustering analysis.</p

    Execution time of four clustering methods performed on a subset of the human gut microbiome data.

    No full text
    <p>ESPRIT-Forest and HPC-Clust (including the INFERNAL alignment method it used) were executed on 32 CPU cores.</p

    A toy example illustrating single-point (left) and multiple-point (right) hierarchical clustering by parallelizing uncorrelated operations.

    No full text
    <p>Each filled box represents a sequence and each circle represents a cluster-merging step. The numbers in the circle denote the order of merging operations. The merging orders may change when switching from single-point to multi-point clustering.</p

    Comparison of clustering quality of ESPRIT-Forest, ESPRIT-Tree and UPARSE performed on benchmark datasets using the species annotation as ground truth.

    No full text
    <p>(a) NMI scores calculated on human gut V2 dataset. (b) NMI scores calculated on human gut V6 dataset. (c) NMI scores calculated on ELDERMET dataset. (d) NMI scores calculated on HMP Saliva dataset.</p

    Execution time of ESPRIT-Forest performed on a human gut microbiome dataset using a varying number of CPU cores ranging from 1 to 128.

    No full text
    <p>The clustering termination criterion was set to 85% sequence similarity. For comparison, the execution time of ESPRIT-Tree is also reported.</p

    Intestinal Microbial Ecology and Environmental Factors Affecting Necrotizing Enterocolitis

    No full text
    <div><p>Necrotizing enterocolitis (NEC) is the most devastating intestinal disease affecting preterm infants. In addition to being associated with short term mortality and morbidity, survivors are left with significant long term sequelae. The cost of caring for these infants is high. Epidemiologic evidence suggests that use of antibiotics and type of feeding may cause an intestinal dysbiosis important in the pathogenesis of NEC, but the contribution of specific infectious agents is poorly understood. Fecal samples from preterm infants ≤32 weeks gestation were analyzed using 16S rRNA based methods at 2, 1, and 0 weeks, prior to diagnosis of NEC in 18 NEC cases and 35 controls. Environmental factors such as antibiotic usage, feeding type (human milk versus formula) and location of neonatal intensive care unit (NICU) were also evaluated. Microbiota composition differed between the three neonatal units where we observed differences in antibiotic usage. In NEC cases we observed a higher proportion of Proteobacteria (61%) two weeks and of Actinobacteria (3%) 1 week before diagnosis of NEC compared to controls (19% and 0.4%, respectively) and lower numbers of Bifidobacteria counts and Bacteroidetes proportions in the weeks before NEC diagnosis. In the first fecal samples obtained during week one of life we detected a novel signature sequence, distinct from but matching closest to <i>Klebsiella pneumoniae</i>, that was strongly associated with NEC development later in life. Infants who develop NEC exhibit a different pattern of microbial colonization compared to controls. Antibiotic usage correlated with these differences and combined with type of feeding likely plays a critical role in the development of NEC.</p></div
    corecore