86 research outputs found
Incremental Lossless Graph Summarization
Given a fully dynamic graph, represented as a stream of edge insertions and
deletions, how can we obtain and incrementally update a lossless summary of its
current snapshot? As large-scale graphs are prevalent, concisely representing
them is inevitable for efficient storage and analysis. Lossless graph
summarization is an effective graph-compression technique with many desirable
properties. It aims to compactly represent the input graph as (a) a summary
graph consisting of supernodes (i.e., sets of nodes) and superedges (i.e.,
edges between supernodes), which provide a rough description, and (b) edge
corrections which fix errors induced by the rough description. While a number
of batch algorithms, suited for static graphs, have been developed for rapid
and compact graph summarization, they are highly inefficient in terms of time
and space for dynamic graphs, which are common in practice. In this work, we
propose MoSSo, the first incremental algorithm for lossless summarization of
fully dynamic graphs. In response to each change in the input graph, MoSSo
updates the output representation by repeatedly moving nodes among supernodes.
MoSSo decides nodes to be moved and their destinations carefully but rapidly
based on several novel ideas. Through extensive experiments on 10 real graphs,
we show MoSSo is (a) Fast and 'any time': processing each change in
near-constant time (less than 0.1 millisecond), up to 7 orders of magnitude
faster than running state-of-the-art batch methods, (b) Scalable: summarizing
graphs with hundreds of millions of edges, requiring sub-linear memory during
the process, and (c) Effective: achieving comparable compression ratios even to
state-of-the-art batch methods.Comment: to appear at the 26th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD '20
NeuKron: Constant-Size Lossy Compression of Sparse Reorderable Matrices and Tensors
Many real-world data are naturally represented as a sparse reorderable
matrix, whose rows and columns can be arbitrarily ordered (e.g., the adjacency
matrix of a bipartite graph). Storing a sparse matrix in conventional ways
requires an amount of space linear in the number of non-zeros, and lossy
compression of sparse matrices (e.g., Truncated SVD) typically requires an
amount of space linear in the number of rows and columns. In this work, we
propose NeuKron for compressing a sparse reorderable matrix into a
constant-size space. NeuKron generalizes Kronecker products using a recurrent
neural network with a constant number of parameters. NeuKron updates the
parameters so that a given matrix is approximated by the product and reorders
the rows and columns of the matrix to facilitate the approximation. The updates
take time linear in the number of non-zeros in the input matrix, and the
approximation of each entry can be retrieved in logarithmic time. We also
extend NeuKron to compress sparse reorderable tensors (e.g. multi-layer
graphs), which generalize matrices. Through experiments on ten real-world
datasets, we show that NeuKron is (a) Compact: requiring up to five orders of
magnitude less space than its best competitor with similar approximation
errors, (b) Accurate: giving up to 10x smaller approximation error than its
best competitors with similar size outputs, and (c) Scalable: successfully
compressing a matrix with over 230 million non-zero entries.Comment: Accepted to WWW 2023 - The Web Conference 202
TensorCodec: Compact Lossy Compression of Tensors without Strong Data Assumptions
Many real-world datasets are represented as tensors, i.e., multi-dimensional
arrays of numerical values. Storing them without compression often requires
substantial space, which grows exponentially with the order. While many tensor
compression algorithms are available, many of them rely on strong data
assumptions regarding its order, sparsity, rank, and smoothness. In this work,
we propose TENSORCODEC, a lossy compression algorithm for general tensors that
do not necessarily adhere to strong input data assumptions. TENSORCODEC
incorporates three key ideas. The first idea is Neural Tensor-Train
Decomposition (NTTD) where we integrate a recurrent neural network into
Tensor-Train Decomposition to enhance its expressive power and alleviate the
limitations imposed by the low-rank assumption. Another idea is to fold the
input tensor into a higher-order tensor to reduce the space required by NTTD.
Finally, the mode indices of the input tensor are reordered to reveal patterns
that can be exploited by NTTD for improved approximation. Our analysis and
experiments on 8 real-world datasets demonstrate that TENSORCODEC is (a)
Concise: it gives up to 7.38x more compact compression than the best competitor
with similar reconstruction error, (b) Accurate: given the same budget for
compressed size, it yields up to 3.33x more accurate reconstruction than the
best competitor, (c) Scalable: its empirical compression time is linear in the
number of tensor entries, and it reconstructs each entry in logarithmic time.
Our code and datasets are available at https://github.com/kbrother/TensorCodec.Comment: Accepted to ICDM 2023 - IEEE International Conference on Data Mining
202
BeGin: Extensive Benchmark Scenarios and An Easy-to-use Framework for Graph Continual Learning
Continual Learning (CL) is the process of learning ceaselessly a sequence of
tasks. Most existing CL methods deal with independent data (e.g., images and
text) for which many benchmark frameworks and results under standard
experimental settings are available. Compared to them, however, CL methods for
graph data (graph CL) are relatively underexplored because of (a) the lack of
standard experimental settings, especially regarding how to deal with the
dependency between instances, (b) the lack of benchmark datasets and scenarios,
and (c) high complexity in implementation and evaluation due to the dependency.
In this paper, regarding (a) we define four standard incremental settings
(task-, class-, domain-, and time-incremental) for node-, link-, and
graph-level problems, extending the previously explored scope. Regarding (b),
we provide 31 benchmark scenarios based on 20 real-world graphs. Regarding (c),
we develop BeGin, an easy and fool-proof framework for graph CL. BeGin is
easily extended since it is modularized with reusable modules for data
processing, algorithm design, and evaluation. Especially, the evaluation module
is completely separated from user code to eliminate potential mistakes.
Regarding benchmark results, we cover 3X more combinations of incremental
settings and levels of problems than the latest benchmark. All assets for the
benchmark framework are publicly available at
https://github.com/ShinhwanKang/BeGin
Hypergraph Motifs and Their Extensions Beyond Binary
Hypergraphs naturally represent group interactions, which are omnipresent in
many domains: collaborations of researchers, co-purchases of items, and joint
interactions of proteins, to name a few. In this work, we propose tools for
answering the following questions: (Q1) what are the structural design
principles of real-world hypergraphs? (Q2) how can we compare local structures
of hypergraphs of different sizes? (Q3) how can we identify domains from which
hypergraphs are? We first define hypergraph motifs (h-motifs), which describe
the overlapping patterns of three connected hyperedges. Then, we define the
significance of each h-motif in a hypergraph as its occurrences relative to
those in properly randomized hypergraphs. Lastly, we define the characteristic
profile (CP) as the vector of the normalized significance of every h-motif.
Regarding Q1, we find that h-motifs' occurrences in 11 real-world hypergraphs
from 5 domains are clearly distinguished from those of randomized hypergraphs.
Then, we demonstrate that CPs capture local structural patterns unique to each
domain, and thus comparing CPs of hypergraphs addresses Q2 and Q3. The concept
of CP is extended to represent the connectivity pattern of each node or
hyperedge as a vector, which proves useful in node classification and hyperedge
prediction. Our algorithmic contribution is to propose MoCHy, a family of
parallel algorithms for counting h-motifs' occurrences in a hypergraph. We
theoretically analyze their speed and accuracy and show empirically that the
advanced approximate version MoCHy-A+ is more accurate and faster than the
basic approximate and exact versions, respectively. Furthermore, we explore
ternary hypergraph motifs that extends h-motifs by taking into account not only
the presence but also the cardinality of intersections among hyperedges. This
extension proves beneficial for all previously mentioned applications.Comment: Extended version of VLDB 2020 paper arXiv:2003.0185
- …