5,184 research outputs found
Drawing Binary Tanglegrams: An Experimental Evaluation
A binary tanglegram is a pair of binary trees whose leaf sets are in
one-to-one correspondence; matching leaves are connected by inter-tree edges.
For applications, for example in phylogenetics or software engineering, it is
required that the individual trees are drawn crossing-free. A natural
optimization problem, denoted tanglegram layout problem, is thus to minimize
the number of crossings between inter-tree edges.
The tanglegram layout problem is NP-hard and is currently considered both in
application domains and theory. In this paper we present an experimental
comparison of a recursive algorithm of Buchin et al., our variant of their
algorithm, the algorithm hierarchy sort of Holten and van Wijk, and an integer
quadratic program that yields optimal solutions.Comment: see
http://www.siam.org/proceedings/alenex/2009/alx09_011_nollenburgm.pd
Overlap Removal of Dimensionality Reduction Scatterplot Layouts
Dimensionality Reduction (DR) scatterplot layouts have become a ubiquitous
visualization tool for analyzing multidimensional data items with presence in
different areas. Despite its popularity, scatterplots suffer from occlusion,
especially when markers convey information, making it troublesome for users to
estimate items' groups' sizes and, more importantly, potentially obfuscating
critical items for the analysis under execution. Different strategies have been
devised to address this issue, either producing overlap-free layouts, lacking
the powerful capabilities of contemporary DR techniques in uncover interesting
data patterns, or eliminating overlaps as a post-processing strategy. Despite
the good results of post-processing techniques, the best methods typically
expand or distort the scatterplot area, thus reducing markers' size (sometimes)
to unreadable dimensions, defeating the purpose of removing overlaps. This
paper presents a novel post-processing strategy to remove DR layouts' overlaps
that faithfully preserves the original layout's characteristics and markers'
sizes. We show that the proposed strategy surpasses the state-of-the-art in
overlap removal through an extensive comparative evaluation considering
multiple different metrics while it is 2 or 3 orders of magnitude faster for
large datasets.Comment: 11 pages and 9 figure
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Rectangular Layouts and Contact Graphs
Contact graphs of isothetic rectangles unify many concepts from applications
including VLSI and architectural design, computational geometry, and GIS.
Minimizing the area of their corresponding {\em rectangular layouts} is a key
problem. We study the area-optimization problem and show that it is NP-hard to
find a minimum-area rectangular layout of a given contact graph. We present
O(n)-time algorithms that construct -area rectangular layouts for
general contact graphs and -area rectangular layouts for trees.
(For trees, this is an -approximation algorithm.) We also present an
infinite family of graphs (rsp., trees) that require (rsp.,
) area.
We derive these results by presenting a new characterization of graphs that
admit rectangular layouts using the related concept of {\em rectangular duals}.
A corollary to our results relates the class of graphs that admit rectangular
layouts to {\em rectangle of influence drawings}.Comment: 28 pages, 13 figures, 55 references, 1 appendi
Qd-tree: Learning Data Layouts for Big Data Analytics
Corporations today collect data at an unprecedented and accelerating scale,
making the need to run queries on large datasets increasingly important.
Technologies such as columnar block-based data organization and compression
have become standard practice in most commercial database systems. However, the
problem of best assigning records to data blocks on storage is still open. For
example, today's systems usually partition data by arrival time into row
groups, or range/hash partition the data based on selected fields. For a given
workload, however, such techniques are unable to optimize for the important
metric of the number of blocks accessed by a query. This metric directly
relates to the I/O cost, and therefore performance, of most analytical queries.
Further, they are unable to exploit additional available storage to drive this
metric down further.
In this paper, we propose a new framework called a query-data routing tree,
or qd-tree, to address this problem, and propose two algorithms for their
construction based on greedy and deep reinforcement learning techniques.
Experiments over benchmark and real workloads show that a qd-tree can provide
physical speedups of more than an order of magnitude compared to current
blocking schemes, and can reach within 2X of the lower bound for data skipping
based on selectivity, while providing complete semantic descriptions of created
blocks.Comment: ACM SIGMOD 202
Expansion of layouts of complete binary trees into grids
AbstractLet Th be the complete binary tree of height h. Let M be the infinite grid graph with vertex set Z2, where two vertices (x1,y1) and (x2,y2) of M are adjacent if and only if |x1âx2|+|y1ây2|=1. Suppose that T is a tree which is a subdivision of Th and is also isomorphic to a subgraph of M. Motivated by issues in optimal VLSI design, we show that the point expansion ratio n(T)/n(Th)=n(T)/(2h+1â1) is bounded below by 1.122 for h sufficiently large. That is, we give bounds on how many vertices of degree 2 must be inserted along the edges of Th in order that the resulting tree can be laid out in the grid. Concerning the constructive end of VLSI design, suppose that T is a tree which is a subdivision of Th and is also isomorphic to a subgraph of the nĂn grid graph. Define the expansion ratio of such a layout to be n2/n(Th)=n2/(2h+1â1). We show constructively that the minimum possible expansion ratio over all layouts of Th is bounded above by 1.4656 for sufficiently large h. That is, we give efficient layouts of complete binary trees into square grids, making improvements upon the previous work of others. We also give bounds for the point expansion and expansion problems for layouts of Th into extended grids, i.e. grids with added diagonals
Optimal column layout for hybrid workloads
Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each columnâs data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versionPublished versio
- âŠ