218 research outputs found
Truss Decomposition in Massive Networks
The k-truss is a type of cohesive subgraphs proposed recently for the study
of networks. While the problem of computing most cohesive subgraphs is NP-hard,
there exists a polynomial time algorithm for computing k-truss. Compared with
k-core which is also efficient to compute, k-truss represents the "core" of a
k-core that keeps the key information of, while filtering out less important
information from, the k-core. However, existing algorithms for computing
k-truss are inefficient for handling today's massive networks. We first improve
the existing in-memory algorithm for computing k-truss in networks of moderate
size. Then, we propose two I/O-efficient algorithms to handle massive networks
that cannot fit in main memory. Our experiments on real datasets verify the
efficiency of our algorithms and the value of k-truss.Comment: VLDB201
Large Graph Analysis in the GMine System
Current applications have produced graphs on the order of hundreds of
thousands of nodes and millions of edges. To take advantage of such graphs, one
must be able to find patterns, outliers and communities. These tasks are better
performed in an interactive environment, where human expertise can guide the
process. For large graphs, though, there are some challenges: the excessive
processing requirements are prohibitive, and drawing hundred-thousand nodes
results in cluttered images hard to comprehend. To cope with these problems, we
propose an innovative framework suited for any kind of tree-like graph visual
design. GMine integrates (a) a representation for graphs organized as
hierarchies of partitions - the concepts of SuperGraph and Graph-Tree; and (b)
a graph summarization methodology - CEPS. Our graph representation deals with
the problem of tracing the connection aspects of a graph hierarchy with sub
linear complexity, allowing one to grasp the neighborhood of a single node or
of a group of nodes in a single click. As a proof of concept, the visual
environment of GMine is instantiated as a system in which large graphs can be
investigated globally and locally
Network Sampling: From Static to Streaming Graphs
Network sampling is integral to the analysis of social, information, and
biological networks. Since many real-world networks are massive in size,
continuously evolving, and/or distributed in nature, the network structure is
often sampled in order to facilitate study. For these reasons, a more thorough
and complete understanding of network sampling is critical to support the field
of network science. In this paper, we outline a framework for the general
problem of network sampling, by highlighting the different objectives,
population and units of interest, and classes of network sampling methods. In
addition, we propose a spectrum of computational models for network sampling
methods, ranging from the traditionally studied model based on the assumption
of a static domain to a more challenging model that is appropriate for
streaming domains. We design a family of sampling methods based on the concept
of graph induction that generalize across the full spectrum of computational
models (from static to streaming) while efficiently preserving many of the
topological properties of the input graphs. Furthermore, we demonstrate how
traditional static sampling algorithms can be modified for graph streams for
each of the three main classes of sampling methods: node, edge, and
topology-based sampling. Our experimental results indicate that our proposed
family of sampling methods more accurately preserves the underlying properties
of the graph for both static and streaming graphs. Finally, we study the impact
of network sampling algorithms on the parameter estimation and performance
evaluation of relational classification algorithms
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
- …