132 research outputs found

    Discovering structure without labels

    Get PDF
    The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples

    The PACE 2022 Parameterized Algorithms and Computational Experiments Challenge: Directed Feedback Vertex Set

    Get PDF
    The Parameterized Algorithms and Computational Experiments challenge (PACE) 2022 was devoted to engineer algorithms solving the NP-hard Directed Feedback Vertex Set (DFVS) problem. The DFVS problem is to find a minimum subset XVX ⊆ V in a given directed graph G=(V,E)G = (V,E) such that, when all vertices of XX and their adjacent edges are deleted from GG, the remainder is acyclic. Overall, the challenge had 90 participants from 26 teams, 12 countries, and 3 continents that submitted their implementations to this year’s competition. In this report, we briefly describe the setup of the challenge, the selection of benchmark instances, as well as the ranking of the participating teams. We also briefly outline the approaches used in the submitted solvers

    Bridge Girth: A Unifying Notion in Network Design

    Full text link
    A classic 1993 paper by Alth\H{o}fer et al. proved a tight reduction from spanners, emulators, and distance oracles to the extremal function γ\gamma of high-girth graphs. This paper initiated a large body of work in network design, in which problems are attacked by reduction to γ\gamma or the analogous extremal function for other girth concepts. In this paper, we introduce and study a new girth concept that we call the bridge girth of path systems, and we show that it can be used to significantly expand and improve this web of connections between girth problems and network design. We prove two kinds of results: 1) We write the maximum possible size of an nn-node, pp-path system with bridge girth >k>k as β(n,p,k)\beta(n, p, k), and we write a certain variant for "ordered" path systems as β(n,p,k)\beta^*(n, p, k). We identify several arguments in the literature that implicitly show upper or lower bounds on β,β\beta, \beta^*, and we provide some polynomially improvements to these bounds. In particular, we construct a tight lower bound for β(n,p,2)\beta(n, p, 2), and we polynomially improve the upper bounds for β(n,p,4)\beta(n, p, 4) and β(n,p,)\beta^*(n, p, \infty). 2) We show that many state-of-the-art results in network design can be recovered or improved via black-box reductions to β\beta or β\beta^*. Examples include bounds for distance/reachability preservers, exact hopsets, shortcut sets, the flow-cut gaps for directed multicut and sparsest cut, an integrality gap for directed Steiner forest. We believe that the concept of bridge girth can lead to a stronger and more organized map of the research area. Towards this, we leave many open problems, related to both bridge girth reductions and extremal bounds on the size of path systems with high bridge girth

    Solving Directed Feedback Vertex Set by Iterative Reduction to Vertex Cover

    Get PDF
    In the Directed Feedback Vertex Set (DFVS) problem, one is given a directed graph G = (V,E) and wants to find a minimum cardinality set S ? V such that G-S is acyclic. DFVS is a fundamental problem in computer science and finds applications in areas such as deadlock detection. The problem was the subject of the 2022 PACE coding challenge. We develop a novel exact algorithm for the problem that is tailored to perform well on instances that are mostly bi-directed. For such instances, we adapt techniques from the well-researched vertex cover problem. Our core idea is an iterative reduction to vertex cover. To this end, we also develop a new reduction rule that reduces the number of not bi-directed edges. With the resulting algorithm, we were able to win third place in the exact track of the PACE challenge. We perform computational experiments and compare the running time to other exact algorithms, in particular to the winning algorithm in PACE. Our experiments show that we outpace the other algorithms on instances that have a low density of uni-directed edges

    Tree Drawings with Columns

    Full text link
    Our goal is to visualize an additional data dimension of a tree with multifaceted data through superimposition on vertical strips, which we call columns. Specifically, we extend upward drawings of unordered rooted trees where vertices have assigned heights by mapping each vertex to a column. Under an orthogonal drawing style and with every subtree within a column drawn planar, we consider different natural variants concerning the arrangement of subtrees within a column. We show that minimizing the number of crossings in such a drawing can be achieved in fixed-parameter tractable (FPT) time in the maximum vertex degree Δ\Delta for the most restrictive variant, while becoming NP-hard (even to approximate) already for a slightly relaxed variant. However, we provide an FPT algorithm in the number of crossings plus Δ\Delta, and an FPT-approximation algorithm in Δ\Delta via a reduction to feedback arc set.Comment: Appears in the Proceedings of the 31st International Symposium on Graph Drawing and Network Visualization (GD 2023

    Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization

    Get PDF
    Data clustering is an unsupervised machine learning task whose objective is to group together similar items. As a versatile data mining tool, data clustering has numerous applications, such as object detection and localization using data from 3D laser-based sensors, finding popular routes using geolocation data, and finding similar patterns of electricity consumption using smart meters.The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods. This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods

    The PACE 2022 Parameterized Algorithms and Computational Experiments Challenge: Directed Feedback Vertex Set

    Get PDF

    Networks, Communication, and Computing Vol. 2

    Get PDF
    Networks, communications, and computing have become ubiquitous and inseparable parts of everyday life. This book is based on a Special Issue of the Algorithms journal, and it is devoted to the exploration of the many-faceted relationship of networks, communications, and computing. The included papers explore the current state-of-the-art research in these areas, with a particular interest in the interactions among the fields

    Query Complexity of Inversion Minimization on Trees

    Full text link
    We consider the following computational problem: Given a rooted tree and a ranking of its leaves, what is the minimum number of inversions of the leaves that can be attained by ordering the tree? This variation of the problem of counting inversions in arrays originated in mathematical psychology, with the evaluation of the Mann--Whitney statistic for detecting differences between distributions as a special case. We study the complexity of the problem in the comparison-query model, used for problems like sorting and selection. For many types of trees with nn leaves, we establish lower bounds close to the strongest known in the model, namely the lower bound of log2(n!)\log_2(n!) for sorting nn items. We show: (a) log2((α(1α)n)!)O(logn)\log_2((\alpha(1-\alpha)n)!) - O(\log n) queries are needed whenever the tree has a subtree that contains a fraction α\alpha of the leaves. This implies a lower bound of log2((k(k+1)2n)!)O(logn)\log_2((\frac{k}{(k+1)^2}n)!) - O(\log n) for trees of degree kk. (b) log2(n!)O(logn)\log_2(n!) - O(\log n) queries are needed in case the tree is binary. (c) log2(n!)O(klogk)\log_2(n!) - O(k \log k) queries are needed for certain classes of trees of degree kk, including perfect trees with even kk. The lower bounds are obtained by developing two novel techniques for a generic problem Π\Pi in the comparison-query model and applying them to inversion minimization on trees. Both techniques can be described in terms of the Cayley graph of the symmetric group with adjacent-rank transpositions as the generating set. Consider the subgraph consisting of the edges between vertices with the same value under Π\Pi. We show that the size of any decision tree for Π\Pi must be at least: (i) the number of connected components of this subgraph, and (ii) the factorial of the average degree of the complementary subgraph, divided by nn. Lower bounds on query complexity then follow by taking the base-2 logarithm.Comment: 54 pages, 18 figures, full version of paper appearing in the Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithm
    corecore