102,099 research outputs found

    Estimating the number of clusters using diversity

    Get PDF
    It is an important and challenging problem in unsupervised learning to estimate the number of clusters in a dataset. Knowing the number of clusters is a prerequisite for many commonly used clustering algorithms such as k-means. In this paper, we propose a novel diversity based approach to this problem. Specifically, we show that the difference between the global diversity of clusters and the sum of each cluster's local diversity of their members can be used as an effective indicator of the optimality of the number of clusters, where the diversity is measured by Rao's quadratic entropy. A notable advantage of our proposed method is that it encourages balanced clustering by taking into account both the sizes of clusters and the distances between clusters. In other words, it is less prone to very small "outlier" clusters than existing methods. Our extensive experiments on both synthetic and real-world datasets (with known ground-truth clustering) have demonstrated that our proposed method is robust for clusters of different sizes, variances, and shapes, and it is more accurate than existing methods (including elbow, Calinski-Harabasz, silhouette, and gap-statistic) in terms of finding out the optimal number of clusters

    Proteomic fingerprinting facilitates biodiversity assessments in understudied ecosystems: A case study on integrated taxonomy of deep sea copepods

    Get PDF
    Accurate and reliable biodiversity estimates of marine zooplankton are a prerequisite to understand how changes in diversity can affect whole ecosystems. Species identification in the deep sea is significantly impeded by high numbers of new species and decreasing numbers of taxonomic experts, hampering any assessment of biodiversity. We used in parallel morphological, genetic, and proteomic characteristics of specimens of calanoid copepods from the abyssal South Atlantic to test if proteomic fingerprinting can accelerate estimating biodiversity. We cross-validated the respective molecular discrimination methods with morphological identifications to establish COI and proteomic reference libraries, as they are a pre-requisite to assign taxonomic information to the identified molecular species clusters. Due to the high number of new species only 37% of the individuals could be assigned to species or genus level morphologically. COI sequencing was successful for 70% of the specimens analysed, while proteomic fingerprinting was successful for all specimens examined. Predicted species richness based on morphological and molecular methods was 42 morphospecies, 56 molecular operational taxonomic units (MOTUs) and 79 proteomic operational taxonomic units (POTUs), respectively. Species diversity was predicted based on proteomic profiles using hierarchical cluster analysis followed by application of the variance ratio criterion for identification of species clusters. It was comparable to species diversity calculated based on COI sequence distances. Less than 7% of specimens were misidentified by proteomic profiles when compared with COI derived MOTUs, indicating that unsupervised machine learning using solely proteomic data could be used for quickly assessing species diversity

    Molecular characterization and genetic diversity in some Egyptian wheat (Triticum aestivum L.) using microsatellite markers

    Get PDF
    Wheat (Triticum aestivum L.) is the most important and strategic cereal crop in Egypt and has many bread wheat varieties. Seventeen Egyptian bread wheat varieties used in this study with a set of sixteen wheat microsatellite markers to examine their utility in detecting DNA polymorphism, estimating genetic diversity and identifying genotypes. A total of 190 alleles were detected at 16 loci using 16 microsatellite primer pairs. The number of allele per locus ranged from 8 to 20, with an average of 11.875. The polymorphic information content (PIC) and marker index (MI) average values were 0.8669, 0.8530 respectively. The (GA) n microsatellites were the highest polymorphic (20 alleles). The Jaccard's Coefficient for genetic similarity was ranged from 0.524 to 0.109 with average of 0.375. A dendrogram was prepared based on similarity matrix using UPGMA algorithm, divided the cultivars into two major clusters. The results proved the microsatellite markers utility in detecting polymorphism due to the discrimination of cultivars and estimating genetic diversity

    Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.

    Get PDF
    BackgroundEstimation of individual ancestry from genetic data is useful for the analysis of disease association studies, understanding human population history and interpreting personal genomic variation. New, computationally efficient methods are needed for ancestry inference that can effectively utilize existing information about allele frequencies associated with different human populations and can work directly with DNA sequence reads.ResultsWe describe a fast method for estimating the relative contribution of known reference populations to an individual's genetic ancestry. Our method utilizes allele frequencies from the reference populations and individual genotype or sequence data to obtain a maximum likelihood estimate of the global admixture proportions using the BFGS optimization algorithm. It accounts for the uncertainty in genotypes present in sequence data by using genotype likelihoods and does not require individual genotype data from external reference panels. Simulation studies and application of the method to real datasets demonstrate that our method is significantly times faster than previous methods and has comparable accuracy. Using data from the 1000 Genomes project, we show that estimates of the genome-wide average ancestry for admixed individuals are consistent between exome sequence data and whole-genome low-coverage sequence data. Finally, we demonstrate that our method can be used to estimate admixture proportions using pooled sequence data making it a valuable tool for controlling for population stratification in sequencing based association studies that utilize DNA pooling.ConclusionsOur method is an efficient and versatile tool for estimating ancestry from DNA sequence data and is available from https://sites.google.com/site/vibansal/software/iAdmix

    PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data.

    Get PDF
    Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity

    Topicality and Social Impact: Diverse Messages but Focused Messengers

    Full text link
    Are users who comment on a variety of matters more likely to achieve high influence than those who delve into one focused field? Do general Twitter hashtags, such as #lol, tend to be more popular than novel ones, such as #instantlyinlove? Questions like these demand a way to detect topics hidden behind messages associated with an individual or a hashtag, and a gauge of similarity among these topics. Here we develop such an approach to identify clusters of similar hashtags by detecting communities in the hashtag co-occurrence network. Then the topical diversity of a user's interests is quantified by the entropy of her hashtags across different topic clusters. A similar measure is applied to hashtags, based on co-occurring tags. We find that high topical diversity of early adopters or co-occurring tags implies high future popularity of hashtags. In contrast, low diversity helps an individual accumulate social influence. In short, diverse messages and focused messengers are more likely to gain impact.Comment: 9 pages, 7 figures, 6 table
    corecore