3,299 research outputs found
Spectral Clustering via Ensemble Deep Autoencoder Learning (SC-EDAE)
Recently, a number of works have studied clustering strategies that combine
classical clustering algorithms and deep learning methods. These approaches
follow either a sequential way, where a deep representation is learned using a
deep autoencoder before obtaining clusters with k-means, or a simultaneous way,
where deep representation and clusters are learned jointly by optimizing a
single objective function. Both strategies improve clustering performance,
however the robustness of these approaches is impeded by several deep
autoencoder setting issues, among which the weights initialization, the width
and number of layers or the number of epochs. To alleviate the impact of such
hyperparameters setting on the clustering performance, we propose a new model
which combines the spectral clustering and deep autoencoder strengths in an
ensemble learning framework. Extensive experiments on various benchmark
datasets demonstrate the potential and robustness of our approach compared to
state-of-the-art deep clustering methods.Comment: Revised manuscrip
Initialization for Network Embedding: A Graph Partition Approach
Network embedding has been intensively studied in the literature and widely
used in various applications, such as link prediction and node classification.
While previous work focus on the design of new algorithms or are tailored for
various problem settings, the discussion of initialization strategies in the
learning process is often missed. In this work, we address this important issue
of initialization for network embedding that could dramatically improve the
performance of the algorithms on both effectiveness and efficiency.
Specifically, we first exploit the graph partition technique that divides the
graph into several disjoint subsets, and then construct an abstract graph based
on the partitions. We obtain the initialization of the embedding for each node
in the graph by computing the network embedding on the abstract graph, which is
much smaller than the input graph, and then propagating the embedding among the
nodes in the input graph. With extensive experiments on various datasets, we
demonstrate that our initialization technique significantly improves the
performance of the state-of-the-art algorithms on the evaluations of link
prediction and node classification by up to 7.76% and 8.74% respectively.
Besides, we show that the technique of initialization reduces the running time
of the state-of-the-arts by at least 20%.Comment: Full Research Paper accepted in the 13th ACM International Conference
on Web Search and Data Mining (WSDM 2020
Learning A Task-Specific Deep Architecture For Clustering
While sparse coding-based clustering methods have shown to be successful,
their bottlenecks in both efficiency and scalability limit the practical usage.
In recent years, deep learning has been proved to be a highly effective,
efficient and scalable feature learning tool. In this paper, we propose to
emulate the sparse coding-based clustering pipeline in the context of deep
learning, leading to a carefully crafted deep model benefiting from both. A
feed-forward network structure, named TAGnet, is constructed based on a
graph-regularized sparse coding algorithm. It is then trained with
task-specific loss functions from end to end. We discover that connecting deep
learning to sparse coding benefits not only the model performance, but also its
initialization and interpretation. Moreover, by introducing auxiliary
clustering tasks to the intermediate feature hierarchy, we formulate DTAGnet
and obtain a further performance boost. Extensive experiments demonstrate that
the proposed model gains remarkable margins over several state-of-the-art
methods
Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification
The eigendeomposition of nearest-neighbor (NN) graph Laplacian matrices is
the main computational bottleneck in spectral clustering. In this work, we
introduce a highly-scalable, spectrum-preserving graph sparsification algorithm
that enables to build ultra-sparse NN (u-NN) graphs with guaranteed
preservation of the original graph spectrums, such as the first few
eigenvectors of the original graph Laplacian. Our approach can immediately lead
to scalable spectral clustering of large data networks without sacrificing
solution quality. The proposed method starts from constructing low-stretch
spanning trees (LSSTs) from the original graphs, which is followed by
iteratively recovering small portions of "spectrally critical" off-tree edges
to the LSSTs by leveraging a spectral off-tree embedding scheme. To determine
the suitable amount of off-tree edges to be recovered to the LSSTs, an
eigenvalue stability checking scheme is proposed, which enables to robustly
preserve the first few Laplacian eigenvectors within the sparsified graph.
Additionally, an incremental graph densification scheme is proposed for
identifying extra edges that have been missing in the original NN graphs but
can still play important roles in spectral clustering tasks. Our experimental
results for a variety of well-known data sets show that the proposed method can
dramatically reduce the complexity of NN graphs, leading to significant
speedups in spectral clustering
Graph Reordering for Cache-Efficient Near Neighbor Search
Graph search is one of the most successful algorithmic trends in near
neighbor search. Several of the most popular and empirically successful
algorithms are, at their core, a simple walk along a pruned near neighbor
graph. Such algorithms consistently perform at the top of industrial speed
benchmarks for applications such as embedding search. However, graph traversal
applications often suffer from poor memory access patterns, and near neighbor
search is no exception to this rule. Our measurements show that popular search
indices such as the hierarchical navigable small-world graph (HNSW) can have
poor cache miss performance. To address this problem, we apply graph reordering
algorithms to near neighbor graphs. Graph reordering is a memory layout
optimization that groups commonly-accessed nodes together in memory. We present
exhaustive experiments applying several reordering algorithms to a leading
graph-based near neighbor method based on the HNSW index. We find that
reordering improves the query time by up to 40%, and we demonstrate that the
time needed to reorder the graph is negligible compared to the time required to
construct the index
Vertex nomination: The canonical sampling and the extended spectral nomination schemes
Suppose that one particular block in a stochastic block model is of interest,
but block labels are only observed for a few of the vertices in the network.
Utilizing a graph realized from the model and the observed block labels, the
vertex nomination task is to order the vertices with unobserved block labels
into a ranked nomination list with the goal of having an abundance of
interesting vertices near the top of the list. There are vertex nomination
schemes in the literature, including the optimally precise canonical nomination
scheme~ and the consistent spectral partitioning nomination
scheme~. While the canonical nomination scheme
is provably optimally precise, it is computationally intractable, being
impractical to implement even on modestly sized graphs. With this in mind, an
approximation of the canonical scheme---denoted the {\it canonical sampling
nomination scheme} ---is introduced;
relies on a scalable, Markov chain Monte Carlo-based approximation of
, and converges to as the amount of sampling
goes to infinity. The spectral partitioning nomination scheme is also extended
to the {\it extended spectral partitioning nomination scheme},
, which introduces a novel semisupervised clustering
framework to improve upon the precision of . Real-data and
simulation experiments are employed to illustrate the precision of these vertex
nomination schemes, as well as their empirical computational complexity.
Keywords: vertex nomination, Markov chain Monte Carlo, spectral partitioning,
Mclust MSC[2010]: 60J22, 65C40, 62H30, 62H2
MCNE: An End-to-End Framework for Learning Multiple Conditional Network Representations of Social Network
Recently, the Network Representation Learning (NRL) techniques, which
represent graph structure via low-dimension vectors to support social-oriented
application, have attracted wide attention. Though large efforts have been
made, they may fail to describe the multiple aspects of similarity between
social users, as only a single vector for one unique aspect has been
represented for each node. To that end, in this paper, we propose a novel
end-to-end framework named MCNE to learn multiple conditional network
representations, so that various preferences for multiple behaviors could be
fully captured. Specifically, we first design a binary mask layer to divide the
single vector as conditional embeddings for multiple behaviors. Then, we
introduce the attention network to model interaction relationship among
multiple preferences, and further utilize the adapted message sending and
receiving operation of graph neural network, so that multi-aspect preference
information from high-order neighbors will be captured. Finally, we utilize
Bayesian Personalized Ranking loss function to learn the preference similarity
on each behavior, and jointly learn multiple conditional node embeddings via
multi-task learning framework. Extensive experiments on public datasets
validate that our MCNE framework could significantly outperform several
state-of-the-art baselines, and further support the visualization and transfer
learning tasks with excellent interpretability and robustness.Comment: Accepted by KDD 2019 Research Track. In Proceedings of the 25th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'19
Device Placement Optimization with Reinforcement Learning
The past few years have witnessed a growth in size and computational
requirements for training and inference with neural networks. Currently, a
common approach to address these requirements is to use a heterogeneous
distributed environment with a mixture of hardware devices such as CPUs and
GPUs. Importantly, the decision of placing parts of the neural models on
devices is often made by human experts based on simple heuristics and
intuitions. In this paper, we propose a method which learns to optimize device
placement for TensorFlow computational graphs. Key to our method is the use of
a sequence-to-sequence model to predict which subsets of operations in a
TensorFlow graph should run on which of the available devices. The execution
time of the predicted placements is then used as the reward signal to optimize
the parameters of the sequence-to-sequence model. Our main result is that on
Inception-V3 for ImageNet classification, and on RNN LSTM, for language
modeling and neural machine translation, our model finds non-trivial device
placements that outperform hand-crafted heuristics and traditional algorithmic
methods.Comment: To appear at ICML 201
Arabesque: A System for Distributed Graph Mining - Extended version
Distributed data processing platforms such as MapReduce and Pregel have
substantially simplified the design and deployment of certain classes of
distributed graph analytics algorithms. However, these platforms do not
represent a good match for distributed graph mining problems, as for example
finding frequent subgraphs in a graph. Given an input graph, these problems
require exploring a very large number of subgraphs and finding patterns that
match some "interestingness" criteria desired by the user. These algorithms are
very important for areas such as social net- works, semantic web, and
bioinformatics. In this paper, we present Arabesque, the first distributed data
processing platform for implementing graph mining algorithms. Arabesque
automates the process of exploring a very large number of subgraphs. It defines
a high-level filter-process computational model that simplifies the development
of scalable graph mining algorithms: Arabesque explores subgraphs and passes
them to the application, which must simply compute outputs and decide whether
the subgraph should be further extended. We use Arabesque's API to produce
distributed solutions to three fundamental graph mining problems: frequent
subgraph mining, counting motifs, and finding cliques. Our implementations
require a handful of lines of code, scale to trillions of subgraphs, and
represent in some cases the first available distributed solutions.Comment: A short version of this report appeared in the Proceedings of the
25th ACM Symp. on Operating Systems Principles (SOSP), 201
TextLuas: Tracking and Visualizing Document and Term Clusters in Dynamic Text Data
For large volumes of text data collected over time, a key knowledge discovery
task is identifying and tracking clusters. These clusters may correspond to
emerging themes, popular topics, or breaking news stories in a corpus.
Therefore, recently there has been increased interest in the problem of
clustering dynamic data. However, there exists little support for the
interactive exploration of the output of these analysis techniques,
particularly in cases where researchers wish to simultaneously explore both the
change in cluster structure over time and the change in the textual content
associated with clusters. In this paper, we propose a model for tracking
dynamic clusters characterized by the evolutionary events of each cluster.
Motivated by this model, the TextLuas system provides an implementation for
tracking these dynamic clusters and visualizing their evolution using a metro
map metaphor. To provide overviews of cluster content, we adapt the tag cloud
representation to the dynamic clustering scenario. We demonstrate the TextLuas
system on two different text corpora, where they are shown to elucidate the
evolution of key themes. We also describe how TextLuas was applied to a problem
in bibliographic network research.Comment: 21 page versio
- …