4 research outputs found

    Confidence estimation for t-SNE embeddings using random forest

    No full text

    Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data

    No full text
    Arguably one of the most famous dimensionality reduction algorithms of today is t-distributed stochastic neighbor embedding (t-SNE). Although being widely used for the visualization of scRNA-seq data, it is prone to errors as any algorithm and may lead to inaccurate interpretations of the visualized data. A reasonable way to avoid misinterpretations is to quantify the reliability of the visualizations. The focus of this work is first to find the best possible way to predict sample-based confidence scores for t-SNE embeddings and next, to use these confidence scores to improve the clustering algorithms. We adopt an RF regression algorithm using seven distance measures as features for having the sample-based confidence scores with a variety of different distance measures. The best configuration is used to assess the clustering improvement using K-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) based on Adjusted Rank Index (ARI), Normalized Mutual Information (NMI), and accuracy (ACC) scores. The experimental results show that distance measures have a considerable effect on the precision of confidence scores and clustering performance can be improved substantially if these confidence scores are incorporated before the clustering algorithm. Our findings reveal the usefulness of these confidence scores on downstream analyses for scRNA-seq data

    Confidence estimation for t-SNE embeddings using random forest

    No full text
    Dimensionality reduction algorithms are commonly used for reducing the dimension of multi-dimensional data to visualize them on a standard display. Although many dimensionality reduction algorithms such as the t-distributed Stochastic Neighborhood Embedding aim to preserve close neighborhoods in low-dimensional space, they might not accomplish that for every sample of the data and eventually produce erroneous representations. In this study, we developed a supervised confidence estimation algorithm for detecting erroneous samples in embeddings. Our algorithm generates a confidence score for each sample in an embedding based on a distance-oriented score and a random forest regressor. We evaluate its performance on both intra- and inter-domain data and compare it with the neighborhood preservation ratio as our baseline. Our results showed that the resulting confidence score provides distinctive information about the correctness of any sample in an embedding compared to the baseline. The source code is available at https://github.com/gsaygili/dimred

    Weighted t-Distributed Stochastic Neighbor Embedding for Projection-Based Clustering

    No full text
    This paper presents a projection-based clustering method for visualizing high-dimensional data points in lower-dimensional spaces while preserving the data’s structural properties. The proposed method modifies the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm by adding a weight function that adjusts the dissimilarity between high-dimensional data points to obtain more realistic lower-dimensional representations. In our algorithm, the centroids obtained with a prototype-based clustering algorithm attract high-dimensional data points allocated to their respective clusters, while repelling those points assigned to other clusters. The simulations using real-world datasets show that the Weighted t-SNE produces better projections than similar algorithms without the need for any previous dimensionality reduction step
    corecore