1,022 research outputs found
Compressive Embedding and Visualization using Graphs
Visualizing high-dimensional data has been a focus in data analysis
communities for decades, which has led to the design of many algorithms, some
of which are now considered references (such as t-SNE for example). In our era
of overwhelming data volumes, the scalability of such methods have become more
and more important. In this work, we present a method which allows to apply any
visualization or embedding algorithm on very large datasets by considering only
a fraction of the data as input and then extending the information to all data
points using a graph encoding its global similarity. We show that in most
cases, using only samples is sufficient to diffuse the
information to all data points. In addition, we propose quantitative
methods to measure the quality of embeddings and demonstrate the validity of
our technique on both synthetic and real-world datasets
Multiclass Total Variation Clustering
Ideas from the image processing literature have recently motivated a new set
of clustering algorithms that rely on the concept of total variation. While
these algorithms perform well for bi-partitioning tasks, their recursive
extensions yield unimpressive results for multiclass clustering tasks. This
paper presents a general framework for multiclass total variation clustering
that does not rely on recursion. The results greatly outperform previous total
variation algorithms and compare well with state-of-the-art NMF approaches
Semi-Supervised Kernel PCA
We present three generalisations of Kernel Principal Components Analysis
(KPCA) which incorporate knowledge of the class labels of a subset of the data
points. The first, MV-KPCA, penalises within class variances similar to Fisher
discriminant analysis. The second, LSKPCA is a hybrid of least squares
regression and kernel PCA. The final LR-KPCA is an iteratively reweighted
version of the previous which achieves a sigmoid loss function on the labeled
points. We provide a theoretical risk bound as well as illustrative experiments
on real and toy data sets
Hypergraph Learning with Line Expansion
Previous hypergraph expansions are solely carried out on either vertex level
or hyperedge level, thereby missing the symmetric nature of data co-occurrence,
and resulting in information loss. To address the problem, this paper treats
vertices and hyperedges equally and proposes a new hypergraph formulation named
the \emph{line expansion (LE)} for hypergraphs learning. The new expansion
bijectively induces a homogeneous structure from the hypergraph by treating
vertex-hyperedge pairs as "line nodes". By reducing the hypergraph to a simple
graph, the proposed \emph{line expansion} makes existing graph learning
algorithms compatible with the higher-order structure and has been proven as a
unifying framework for various hypergraph expansions. We evaluate the proposed
line expansion on five hypergraph datasets, the results show that our method
beats SOTA baselines by a significant margin
spa: Semi-Supervised Semi-Parametric Graph-Based Estimation in R
In this paper, we present an R package that combines feature-based (X) data and graph-based (G) data for prediction of the response Y . In this particular case, Y is observed for a subset of the observations (labeled) and missing for the remainder (unlabeled). We examine an approach for fitting Y = Xò + f(G) where ò is a coefficient vector and f is a function over the vertices of the graph. The procedure is semi-supervised in nature (trained on the labeled and unlabeled sets), requiring iterative algorithms for fitting this estimate. The package provides several key functions for fitting and evaluating an estimator of this type. The package is illustrated on a text analysis data set, where the observations are text documents (papers), the response is the category of paper (either applied or theoretical statistics), the X information is the name of the journal in which the paper resides, and the graph is a co-citation network, with each vertex an observation and each edge the number of times that the two papers cite a common paper. An application involving classification of protein location using a protein interaction graph and an application involving classification on a manifold with part of the feature data converted to a graph are also presented.
- …