40 research outputs found
ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
<div><p>The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at <a href="http://www.acsu.buffalo.edu/â¼yijunsun/lab/ESPRIT-Forest.html" target="_blank">http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html</a>.</p></div
(a) A toy example of a PBP tree and (b) its corresponding partitioning of a dataset and a space.
<p>The colors indicate different levels of nodes and their corresponding hyper-spheres. The leaf nodes are omitted in the tree. When searching for the nearest neighbor of a point (large red dot), only a small number of sibling hyper-spheres (filled circles) need to be explored.</p
Comparison of the overall frameworks of ESPRIT-Tree and ESPRIT-Forest.
<p>The parallel execution parts are underlined.</p
Execution time of ESPRIT-Forest performed on HMP and ELDERMET datasets.
<p>The clustering termination criterion was set to 85% sequence similarity. ESPRIT preproc was used to remove low-quality reads before clustering analysis.</p
Execution time of four clustering methods performed on a subset of the human gut microbiome data.
<p>ESPRIT-Forest and HPC-Clust (including the INFERNAL alignment method it used) were executed on 32 CPU cores.</p
A toy example illustrating single-point (left) and multiple-point (right) hierarchical clustering by parallelizing uncorrelated operations.
<p>Each filled box represents a sequence and each circle represents a cluster-merging step. The numbers in the circle denote the order of merging operations. The merging orders may change when switching from single-point to multi-point clustering.</p
Comparison of clustering quality of ESPRIT-Tree (red) and ESPRIT-Forest (blue) on various distance cut-offs on the human gut V2 dataset.
<p>We see that the results of both algorithms agrees but with small variations caused by randomness in clustering.</p
Comparison of clustering quality of ESPRIT-Forest, ESPRIT-Tree and UPARSE performed on benchmark datasets using the species annotation as ground truth.
<p>(a) NMI scores calculated on human gut V2 dataset. (b) NMI scores calculated on human gut V6 dataset. (c) NMI scores calculated on ELDERMET dataset. (d) NMI scores calculated on HMP Saliva dataset.</p
Execution time of ESPRIT-Forest performed on a human gut microbiome dataset using a varying number of CPU cores ranging from 1 to 128.
<p>The clustering termination criterion was set to 85% sequence similarity. For comparison, the execution time of ESPRIT-Tree is also reported.</p
Intestinal Microbial Ecology and Environmental Factors Affecting Necrotizing Enterocolitis
<div><p>Necrotizing enterocolitis (NEC) is the most devastating intestinal disease affecting preterm infants. In addition to being associated with short term mortality and morbidity, survivors are left with significant long term sequelae. The cost of caring for these infants is high. Epidemiologic evidence suggests that use of antibiotics and type of feeding may cause an intestinal dysbiosis important in the pathogenesis of NEC, but the contribution of specific infectious agents is poorly understood. Fecal samples from preterm infants ≤32 weeks gestation were analyzed using 16S rRNA based methods at 2, 1, and 0 weeks, prior to diagnosis of NEC in 18 NEC cases and 35 controls. Environmental factors such as antibiotic usage, feeding type (human milk versus formula) and location of neonatal intensive care unit (NICU) were also evaluated. Microbiota composition differed between the three neonatal units where we observed differences in antibiotic usage. In NEC cases we observed a higher proportion of Proteobacteria (61%) two weeks and of Actinobacteria (3%) 1 week before diagnosis of NEC compared to controls (19% and 0.4%, respectively) and lower numbers of Bifidobacteria counts and Bacteroidetes proportions in the weeks before NEC diagnosis. In the first fecal samples obtained during week one of life we detected a novel signature sequence, distinct from but matching closest to <i>Klebsiella pneumoniae</i>, that was strongly associated with NEC development later in life. Infants who develop NEC exhibit a different pattern of microbial colonization compared to controls. Antibiotic usage correlated with these differences and combined with type of feeding likely plays a critical role in the development of NEC.</p></div