Search CORE

33 research outputs found

Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

Author: Munteanu Alexander
Schwiegelshohn Chris
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques

Archivio della ricerca- Università di Roma La Sapienza

The Power of Migration for Online Slack Scheduling

Author: Schwiegelshohn Chris
Schwiegelshohn Uwe
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th Annual European Symposium on Algorithms (ESA 2016)
Publication date: 01/01/2016
Field of study

We investigate the power of migration in online scheduling for parallel identical machines. Our objective is to maximize the total processing time of accepted jobs. Once we decide to accept a job, we have to complete it before its deadline d that satisfies d >= (1+epsilon)p + r, where p is the processing time, r the submission time and the slack epsilon > 0 a system parameter. Typically, the hard case arises for small slack epsilon << 1, i.e. for near-tight deadlines. Without migration, a greedy acceptance policy is known to be an optimal deterministic online algorithm with a competitive factor of (1+epsilon)/epsilon (DasGupta and Palis, APPROX 2000). Our first contribution is to show that migrations do not improve the competitive ratio of the greedy acceptance policy, i.e. the competitive ratio remains (1+epsilon)/epsilon for any number of machines. Our main contribution is a deterministic online algorithm with almost tight competitive ratio on any number of machines. For a single machine, the competitive factor matches the optimal bound of (1+epsilon)/epsilon of the greedy acceptance policy. The competitive ratio improves with an increasing number of machines. It approaches (1+epsilon) ln((1+epsilon)/epsilon) as the number of machines converges to infinity. This is an exponential improvement over the greedy acceptance policy for small epsilon. Moreover, we show a matching lower bound on the competitive ratio for deterministic algorithms on any number of machines

Dagstuhl Research Online Publication Server

Archivio della ricerca- Università di Roma La Sapienza

On Finding the Jaccard Center

Author: Bury Marc
Schwiegelshohn Chris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)
Publication date: 01/01/2017
Field of study

We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y, the Jaccard index is defined as |Xcap Y|/|Xcup Y| and the corresponding distance is 1-|Xcap Y|/|Xcup Y|. The Jaccard center is a set C minimizing the maximum distance to any set of N. We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP. Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem. In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm

Dagstuhl Research Online Publication Server

Archivio della ricerca- Università di Roma La Sapienza

Logistic Regression in Datastreams

Author: Schwiegelshohn Chris
Sohler Christian
Publication venue
Publication date: 01/01/2014
Field of study

Learning from data streams is a well researched task both in theory and practice. As remarked by Clarkson, Hazan and Woodruff, many classification problems cannot be very well solved in a streaming setting. For previous model assumptions, there exist simple, yet highly artificial lower bounds prohibiting space efficient one- pass algorithms. At the same time, several classification algorithms are often successfully used in practice. To overcome this gap, we give a model relaxing the constraints that previously made classification impossible from a theoretical point of view and under these model assumptions provide the first (1 + epsilon) -approximate algorithms for sketching the objective values of logistic regression and perceptron classifiers in data streams

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

An Empirical Evaluation of k-Means Coresets

Author: Schwiegelshohn Chris
Sheikh-Omar Omar Ali
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

Dagstuhl Research Online Publication Server

On algorithms for large-scale graph and clustering problems

Author: Schwiegelshohn Chris
Publication venue
Publication date
Field of study

Gegenstand dieser Arbeit sind algorithmische Methoden der modernen Datenanalyse. Dabei werden vorwiegend zwei übergeordnete Themen behandelt: Datenstromalgorithmen mit Kompressionseigenschaften und Approximationsalgorithmen für Clusteringverfahren. Datenstromalgorithmen verarbeiten einen Datensatz sequentiell und haben das Ziel, Eigenschaften des Datensatzes (approximativ) zu bestimmen, ohne dabei den gesamten Datensatz abzuspeichern. Unter Clustering versteht man die Partitionierung eines Datensatzes in verschiedene Gruppen. Das erste dargestellte Problem betrifft Matching in Graphen. Hier besteht der Datensatz aus einer Folge von Einfüge- und Löschoperationen von Kanten. Die Aufgabe besteht darin, die Größe des so genannten Maximum Matchings so genau wie möglich zu bestimmen. Es wird ein Algorithmus vorgestellt, der, unter der Annahme, dass das Matching höchstens die Größe k hat, die exakte Größe bestimmt und dabei k² Speichereinheiten benötigt. Dieser Algorithmus lässt sich weiterhin verwenden um eine konstante Approximation der Matchinggröße in planaren Graphen zu bestimmen. Des Weiteren werden untere Schranken für den benötigten Speicherplatz bestimmt und eine Reduktion von gewichtetem Matching zu ungewichteten Matching durchgeführt. Anschließend werden Datenstromalgorithmen für die Nachbarschaftssuche betrachtet, wobei die Aufgabe darin besteht, für n gegebene Mengen die Paare mit hoher Ähnlichkeit in nahezu Linearzeit zu finden. Dabei ist der Jaccard Index |A ∩ B|/|A U B| das Ähnlichkeitsmaß für zwei Mengen A und B. In der Arbeit wird eine Datenstruktur beschrieben, die dies erstmalig in dynamischen Datenströmen mit geringem Speicherplatzverbrauch leistet. Dabei werden Zufallszahlen mit nur 2-facher Unabhängigkeit verwendet, was eine sehr effiziente Implementierung ermöglicht. Das dritte Problem befindet sich an der Schnittstelle zwischen den beiden Themen dieser Arbeit und betrifft das k-center Clustering Problem in Datenströmen mit einem Zeitfenster. Die Aufgabe besteht darin k Zentren zu finden, sodass die maximale Distanz unter allen Punkten zu dem jeweils nächsten Zentrum minimiert wird. Ergebnis sind ein 6-Approximationalgorithmus für ein beliebiges k und ein optimaler 4-Approximationsalgorithmus für k = 2. Die entwickelten Techniken lassen sich ebenfalls auf das Durchmesserproblem anwenden und ermöglichen für dieses Problem einen optimalen Algorithmus. Danach werden Clusteringprobleme bezüglich der Jaccard Distanz analysiert. Dabei sind wieder eine Menge N von Teilmengen aus einer Grundgesamtheit U sind und die Aufgabe besteht darin eine Teilmenge

C

zu finden, die max 1-|X ∩ C|/|X U C| minimiert. Es wird gezeigt, dass zwar eine exakte Lösung des Problems NP-schwer ist, es aber gleichzeitig eine PTAS gibt. Abschließend wird die weit verbreitete lokale Suchheuristik für k-median und k-means Clustering untersucht. Obwohl es im Allgemeinen schwer ist, diese Probleme exakt oder auch nur approximativ zu lösen, gelten sie in der Praxis als relativ gut handhabbar, was andeutet, dass die Härteresultate auf pathologischen Eingaben beruhen. Auf Grund dieser Diskrepanz gab es in der Vergangenheit praxisrelevante Datensätze zu charakterisieren. Für drei der wichtigsten Charakterisierungen wird das Verhalten einer lokalen Suchheuristik untersucht mit dem Ergebnis, dass die lokale Suchheuristik in diesen Fällen optimale oder fast optimale Cluster ermittelt

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung