3,762 research outputs found

    Parametric t-Distributed Stochastic Exemplar-centered Embedding

    Full text link
    Parametric embedding methods such as parametric t-SNE (pt-SNE) have been widely adopted for data visualization and out-of-sample data embedding without further computationally expensive optimization or approximation. However, the performance of pt-SNE is highly sensitive to the hyper-parameter batch size due to conflicting optimization goals, and often produces dramatically different embeddings with different choices of user-defined perplexities. To effectively solve these issues, we present parametric t-distributed stochastic exemplar-centered embedding methods. Our strategy learns embedding parameters by comparing given data only with precomputed exemplars, resulting in a cost function with linear computational and memory complexity, which is further reduced by noise contrastive samples. Moreover, we propose a shallow embedding network with high-order feature interactions for data visualization, which is much easier to tune but produces comparable performance in contrast to a deep neural network employed by pt-SNE. We empirically demonstrate, using several benchmark datasets, that our proposed methods significantly outperform pt-SNE in terms of robustness, visual effects, and quantitative evaluations.Comment: fixed typo

    Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data

    Get PDF
    Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets

    Making Sense Of The New Cosmology

    Get PDF
    Over the past three years we have determined the basic features of the Universe -- spatially flat; accelerating; comprised of 1/3 a new form of matter, 2/3 a new form of energy, with some ordinary matter and a dash of massive neutrinos; and apparently born from a burst of rapid expansion during which quantum noise was stretched to astrophysical size seeding cosmic structure. The New Cosmology greatly extends the highly successful hot big-bang model. Now we have to make sense of all this: What is the dark matter particle? What is the nature of the dark energy? Why this mixture? How did the matter -- antimatter asymmetry arise? What is the underlying cause of inflation (if it indeed occurred)?Comment: 17 pages Latex (sprocl.sty). To appear in the Proceedings of 2001: A Spacetime Odyssey (U. Michigan, May 2001, World Scientific

    Fast k-means based on KNN Graph

    Full text link
    In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast kk-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

    Toward a generic representation of random variables for machine learning

    Full text link
    This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.Comment: submitted to Pattern Recognition Letter
    • …
    corecore