45 research outputs found
Distributed Methods for High-dimensional and Large-scale Tensor Factorization
Given a high-dimensional large-scale tensor, how can we decompose it into
latent factors? Can we process it on commodity computers with limited memory?
These questions are closely related to recommender systems, which have modeled
rating data not as a matrix but as a tensor to utilize contextual information
such as time and location. This increase in the dimension requires tensor
factorization methods scalable with both the dimension and size of a tensor. In
this paper, we propose two distributed tensor factorization methods, SALS and
CDTF. Both methods are scalable with all aspects of data, and they show an
interesting trade-off between convergence speed and memory requirements. SALS
updates a subset of the columns of a factor matrix at a time, and CDTF, a
special case of SALS, updates one column at a time. In our experiments, only
our methods factorize a 5-dimensional tensor with 1 billion observable entries,
10M mode length, and 1K rank, while all other state-of-the-art methods fail.
Moreover, our methods require several orders of magnitude less memory than our
competitors. We implement our methods on MapReduce with two widely-applicable
optimization techniques: local disk caching and greedy row assignment. They
speed up our methods up to 98.2X and also the competitors up to 5.9X
Hypercore Decomposition for Non-Fragile Hyperedges: Concepts, Algorithms, Observations, and Applications
Hypergraphs are a powerful abstraction for modeling high-order relations,
which are ubiquitous in many fields. A hypergraph consists of nodes and
hyperedges (i.e., subsets of nodes); and there have been a number of attempts
to extend the notion of -cores, which proved useful with numerous
applications for pairwise graphs, to hypergraphs. However, the previous
extensions are based on an unrealistic assumption that hyperedges are fragile,
i.e., a high-order relation becomes obsolete as soon as a single member leaves
it.
In this work, we propose a new substructure model, called (,
)-hypercore, based on the assumption that high-order relations remain as
long as at least fraction of the members remain. Specifically, it is
defined as the maximal subhypergraph where (1) every node has degree at least
in it and (2) at least fraction of the nodes remain in every hyperedge.
We first prove that, given (or ), finding the (, )-hypercore for
every possible (or ) can be computed in time linear w.r.t the sum of the
sizes of hyperedges. Then, we demonstrate that real-world hypergraphs from the
same domain share similar (, )-hypercore structures, which capture
different perspectives depending on . Lastly, we show the successful
applications of our model in identifying influential nodes, dense
substructures, and vulnerability in hypergraphs.Comment: 24 pages, 14 figure
Set2Box: Similarity Preserving Representation Learning of Sets
Sets have been used for modeling various types of objects (e.g., a document
as the set of keywords in it and a customer as the set of the items that she
has purchased). Measuring similarity (e.g., Jaccard Index) between sets has
been a key building block of a wide range of applications, including,
plagiarism detection, recommendation, and graph compression. However, as sets
have grown in numbers and sizes, the computational cost and storage required
for set similarity computation have become substantial, and this has led to the
development of hashing and sketching based solutions. In this work, we propose
Set2Box, a learning-based approach for compressed representations of sets from
which various similarity measures can be estimated accurately in constant time.
The key idea is to represent sets as boxes to precisely capture overlaps of
sets. Additionally, based on the proposed box quantization scheme, we design
Set2Box+, which yields more concise but more accurate box representations of
sets. Through extensive experiments on 8 real-world datasets, we show that,
compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to
40.8X smaller estimation error while requiring 60% fewer bits to encode sets,
(b) Concise: yielding up to 96.8X more concise representations with similar
estimation error, and (c) Versatile: enabling the estimation of four
set-similarity measures from a single representation of each set.Comment: Accepted by ICDM 202
Incremental Lossless Graph Summarization
Given a fully dynamic graph, represented as a stream of edge insertions and
deletions, how can we obtain and incrementally update a lossless summary of its
current snapshot? As large-scale graphs are prevalent, concisely representing
them is inevitable for efficient storage and analysis. Lossless graph
summarization is an effective graph-compression technique with many desirable
properties. It aims to compactly represent the input graph as (a) a summary
graph consisting of supernodes (i.e., sets of nodes) and superedges (i.e.,
edges between supernodes), which provide a rough description, and (b) edge
corrections which fix errors induced by the rough description. While a number
of batch algorithms, suited for static graphs, have been developed for rapid
and compact graph summarization, they are highly inefficient in terms of time
and space for dynamic graphs, which are common in practice. In this work, we
propose MoSSo, the first incremental algorithm for lossless summarization of
fully dynamic graphs. In response to each change in the input graph, MoSSo
updates the output representation by repeatedly moving nodes among supernodes.
MoSSo decides nodes to be moved and their destinations carefully but rapidly
based on several novel ideas. Through extensive experiments on 10 real graphs,
we show MoSSo is (a) Fast and 'any time': processing each change in
near-constant time (less than 0.1 millisecond), up to 7 orders of magnitude
faster than running state-of-the-art batch methods, (b) Scalable: summarizing
graphs with hundreds of millions of edges, requiring sub-linear memory during
the process, and (c) Effective: achieving comparable compression ratios even to
state-of-the-art batch methods.Comment: to appear at the 26th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD '20
Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs
Real-world graphs are dynamic, constantly evolving with new interactions,
such as financial transactions in financial networks. Temporal Graph Neural
Networks (TGNNs) have been developed to effectively capture the evolving
patterns in dynamic graphs. While these models have demonstrated their
superiority, being widely adopted in various important fields, their
vulnerabilities against adversarial attacks remain largely unexplored. In this
paper, we propose T-SPEAR, a simple and effective adversarial attack method for
link prediction on continuous-time dynamic graphs, focusing on investigating
the vulnerabilities of TGNNs. Specifically, before the training procedure of a
victim model, which is a TGNN for link prediction, we inject edge perturbations
to the data that are unnoticeable in terms of the four constraints we propose,
and yet effective enough to cause malfunction of the victim model. Moreover, we
propose a robust training approach T-SHIELD to mitigate the impact of
adversarial attacks. By using edge filtering and enforcing temporal smoothness
to node embeddings, we enhance the robustness of the victim model. Our
experimental study shows that T-SPEAR significantly degrades the victim model's
performance on link prediction tasks, and even more, our attacks are
transferable to other TGNNs, which differ from the victim model assumed by the
attacker. Moreover, we demonstrate that T-SHIELD effectively filters out
adversarial edges and exhibits robustness against adversarial attacks,
surpassing the link prediction performance of the naive TGNN by up to 11.2%
under T-SPEAR
NeuKron: Constant-Size Lossy Compression of Sparse Reorderable Matrices and Tensors
Many real-world data are naturally represented as a sparse reorderable
matrix, whose rows and columns can be arbitrarily ordered (e.g., the adjacency
matrix of a bipartite graph). Storing a sparse matrix in conventional ways
requires an amount of space linear in the number of non-zeros, and lossy
compression of sparse matrices (e.g., Truncated SVD) typically requires an
amount of space linear in the number of rows and columns. In this work, we
propose NeuKron for compressing a sparse reorderable matrix into a
constant-size space. NeuKron generalizes Kronecker products using a recurrent
neural network with a constant number of parameters. NeuKron updates the
parameters so that a given matrix is approximated by the product and reorders
the rows and columns of the matrix to facilitate the approximation. The updates
take time linear in the number of non-zeros in the input matrix, and the
approximation of each entry can be retrieved in logarithmic time. We also
extend NeuKron to compress sparse reorderable tensors (e.g. multi-layer
graphs), which generalize matrices. Through experiments on ten real-world
datasets, we show that NeuKron is (a) Compact: requiring up to five orders of
magnitude less space than its best competitor with similar approximation
errors, (b) Accurate: giving up to 10x smaller approximation error than its
best competitors with similar size outputs, and (c) Scalable: successfully
compressing a matrix with over 230 million non-zero entries.Comment: Accepted to WWW 2023 - The Web Conference 202