7 research outputs found

    Author index

    Get PDF

    ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time

    Get PDF
    Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm

    Master index

    Get PDF

    Robust identification of Parkinson\u27s disease subtypes using radiomics and hybrid machine learning

    Get PDF
    OBJECTIVES: It is important to subdivide Parkinson\u27s disease (PD) into subtypes, enabling potentially earlier disease recognition and tailored treatment strategies. We aimed to identify reproducible PD subtypes robust to variations in the number of patients and features. METHODS: We applied multiple feature-reduction and cluster-analysis methods to cross-sectional and timeless data, extracted from longitudinal datasets (years 0, 1, 2 & 4; Parkinson\u27s Progressive Marker Initiative; 885 PD/163 healthy-control visits; 35 datasets with combinations of non-imaging, conventional-imaging, and radiomics features from DAT-SPECT images). Hybrid machine-learning systems were constructed invoking 16 feature-reduction algorithms, 8 clustering algorithms, and 16 classifiers (C-index clustering evaluation used on each trajectory). We subsequently performed: i) identification of optimal subtypes, ii) multiple independent tests to assess reproducibility, iii) further confirmation by a statistical approach, iv) test of reproducibility to the size of the samples. RESULTS: When using no radiomics features, the clusters were not robust to variations in features, whereas, utilizing radiomics information enabled consistent generation of clusters through ensemble analysis of trajectories. We arrived at 3 distinct subtypes, confirmed using the training and testing process of k-means, as well as Hotelling\u27s T2 test. The 3 identified PD subtypes were 1) mild; 2) intermediate; and 3) severe, especially in terms of dopaminergic deficit (imaging), with some escalating motor and non-motor manifestations. CONCLUSION: Appropriate hybrid systems and independent statistical tests enable robust identification of 3 distinct PD subtypes. This was assisted by utilizing radiomics features from SPECT images (segmented using MRI). The PD subtypes provided were robust to the number of the subjects, and features

    Novel concepts for lipid identification from shotgun mass spectra using a customized query language

    Get PDF
    Lipids are the main component of semipermeable cell membranes and linked to several important physiological processes. Shotgun lipidomics relies on the direct infusion of total lipid extracts from cells, tissues or organisms into the mass spectrometer and is a powerful tool to elucidate their molecular composition. Despite the technical advances in modern mass spectrometry the currently available software underperforms in several aspects of the lipidomics pipeline. This thesis addresses these issues by presenting a new concept for lipid identification using a customized query language for mass spectra in combination with efficient spectra alignment algorithms which are implemented in the open source kit “LipidXplorer”

    Optimal algorithms for complete linkage clustering in d dimensions

    Get PDF
    It is shown that the complete linkage clustering of n points in R-d, where d greater than or equal to 1 is a constant, can be computed in optimal O(nlogn) time and linear space, under the L-1 and L-infinity-metrics. Furthermore, for every other fixed L-t-metric, it is shown that it can be approximated within an arbitrarily small constant factor in O(nlogn) time and linear space
    corecore