132 research outputs found
Discovering structure without labels
The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction.
The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space.
The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success.
Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples
The PACE 2022 Parameterized Algorithms and Computational Experiments Challenge: Directed Feedback Vertex Set
The Parameterized Algorithms and Computational Experiments challenge (PACE) 2022 was devoted to engineer algorithms solving the NP-hard Directed Feedback Vertex Set (DFVS) problem. The DFVS problem is to find a minimum subset in a given directed graph such that, when all vertices of and their adjacent edges are deleted from , the remainder is acyclic.
Overall, the challenge had 90 participants from 26 teams, 12 countries, and 3 continents that submitted their implementations to this year’s competition. In this report, we briefly describe the setup of the challenge, the selection of benchmark instances, as well as the ranking of the participating teams. We also briefly outline the approaches used in the submitted solvers
Bridge Girth: A Unifying Notion in Network Design
A classic 1993 paper by Alth\H{o}fer et al. proved a tight reduction from
spanners, emulators, and distance oracles to the extremal function of
high-girth graphs. This paper initiated a large body of work in network design,
in which problems are attacked by reduction to or the analogous
extremal function for other girth concepts. In this paper, we introduce and
study a new girth concept that we call the bridge girth of path systems, and we
show that it can be used to significantly expand and improve this web of
connections between girth problems and network design. We prove two kinds of
results:
1) We write the maximum possible size of an -node, -path system with
bridge girth as , and we write a certain variant for
"ordered" path systems as . We identify several arguments in
the literature that implicitly show upper or lower bounds on ,
and we provide some polynomially improvements to these bounds. In particular,
we construct a tight lower bound for , and we polynomially
improve the upper bounds for and .
2) We show that many state-of-the-art results in network design can be
recovered or improved via black-box reductions to or .
Examples include bounds for distance/reachability preservers, exact hopsets,
shortcut sets, the flow-cut gaps for directed multicut and sparsest cut, an
integrality gap for directed Steiner forest.
We believe that the concept of bridge girth can lead to a stronger and more
organized map of the research area. Towards this, we leave many open problems,
related to both bridge girth reductions and extremal bounds on the size of path
systems with high bridge girth
Solving Directed Feedback Vertex Set by Iterative Reduction to Vertex Cover
In the Directed Feedback Vertex Set (DFVS) problem, one is given a directed graph G = (V,E) and wants to find a minimum cardinality set S ? V such that G-S is acyclic. DFVS is a fundamental problem in computer science and finds applications in areas such as deadlock detection. The problem was the subject of the 2022 PACE coding challenge. We develop a novel exact algorithm for the problem that is tailored to perform well on instances that are mostly bi-directed. For such instances, we adapt techniques from the well-researched vertex cover problem. Our core idea is an iterative reduction to vertex cover. To this end, we also develop a new reduction rule that reduces the number of not bi-directed edges. With the resulting algorithm, we were able to win third place in the exact track of the PACE challenge. We perform computational experiments and compare the running time to other exact algorithms, in particular to the winning algorithm in PACE. Our experiments show that we outpace the other algorithms on instances that have a low density of uni-directed edges
Tree Drawings with Columns
Our goal is to visualize an additional data dimension of a tree with
multifaceted data through superimposition on vertical strips, which we call
columns. Specifically, we extend upward drawings of unordered rooted trees
where vertices have assigned heights by mapping each vertex to a column. Under
an orthogonal drawing style and with every subtree within a column drawn
planar, we consider different natural variants concerning the arrangement of
subtrees within a column. We show that minimizing the number of crossings in
such a drawing can be achieved in fixed-parameter tractable (FPT) time in the
maximum vertex degree for the most restrictive variant, while becoming
NP-hard (even to approximate) already for a slightly relaxed variant. However,
we provide an FPT algorithm in the number of crossings plus , and an
FPT-approximation algorithm in via a reduction to feedback arc set.Comment: Appears in the Proceedings of the 31st International Symposium on
Graph Drawing and Network Visualization (GD 2023
Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
Data clustering is an unsupervised machine learning task whose objective is to group together similar items. As a versatile data mining tool, data clustering has numerous applications, such as object detection and localization using data from 3D laser-based sensors, finding popular routes using geolocation data, and finding similar patterns of electricity consumption using smart meters.The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods. This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods
Networks, Communication, and Computing Vol. 2
Networks, communications, and computing have become ubiquitous and inseparable parts of everyday life. This book is based on a Special Issue of the Algorithms journal, and it is devoted to the exploration of the many-faceted relationship of networks, communications, and computing. The included papers explore the current state-of-the-art research in these areas, with a particular interest in the interactions among the fields
Query Complexity of Inversion Minimization on Trees
We consider the following computational problem: Given a rooted tree and a
ranking of its leaves, what is the minimum number of inversions of the leaves
that can be attained by ordering the tree? This variation of the problem of
counting inversions in arrays originated in mathematical psychology, with the
evaluation of the Mann--Whitney statistic for detecting differences between
distributions as a special case.
We study the complexity of the problem in the comparison-query model, used
for problems like sorting and selection. For many types of trees with
leaves, we establish lower bounds close to the strongest known in the model,
namely the lower bound of for sorting items. We show:
(a) queries are needed whenever
the tree has a subtree that contains a fraction of the leaves. This
implies a lower bound of for trees
of degree .
(b) queries are needed in case the tree is binary.
(c) queries are needed for certain classes of
trees of degree , including perfect trees with even .
The lower bounds are obtained by developing two novel techniques for a
generic problem in the comparison-query model and applying them to
inversion minimization on trees. Both techniques can be described in terms of
the Cayley graph of the symmetric group with adjacent-rank transpositions as
the generating set. Consider the subgraph consisting of the edges between
vertices with the same value under . We show that the size of any decision
tree for must be at least:
(i) the number of connected components of this subgraph, and
(ii) the factorial of the average degree of the complementary subgraph,
divided by .
Lower bounds on query complexity then follow by taking the base-2 logarithm.Comment: 54 pages, 18 figures, full version of paper appearing in the
Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithm
- …