Search CORE

960 research outputs found

Net and Prune: A Linear Time Algorithm for Euclidean Distance Problems

Author: Har-Peled Sariel
Raichel Banjamin
Publication venue
Publication date: 25/09/2014
Field of study

We provide a general framework for getting expected linear time constant factor approximations (and in many cases FPTAS's) to several well known problems in Computational Geometry, such as

k

-center clustering and farthest nearest neighbor. The new approach is robust to variations in the input problem, and yet it is simple, elegant and practical. In particular, many of these well studied problems which fit easily into our framework, either previously had no linear time approximation algorithm, or required rather involved algorithms and analysis. A short list of the problems we consider include farthest nearest neighbor,

k

-center clustering, smallest disk enclosing

k

points,

k

th largest distance,

k

th smallest

m

-nearest neighbor distance,

k

th heaviest edge in the MST and other spanning forest type problems, problems involving upward closed set systems, and more. Finally, we show how to extend our framework such that the linear running time bound holds with high probability

arXiv.org e-Print Archive

CiteSeerX

ParGeo: A Library for Parallel Computational Geometry

Author: Dhulipala Laxman
Gu Yan
Shun Julian
Wang Yiqiu
Yesantharao Rahul
Yu Shangdi
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

Efficient Computation of Multiple Density-Based Clustering Hierarchies

Author: Campello Ricardo J. G. B.
Nascimento Mario A.
Neto Antonio Cavalcante Araujo
Sander Joerg
Publication venue
Publication date: 01/12/2017
Field of study

HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While the performance of HDBSCAN* is robust w.r.t. mpts in the sense that a small change in mpts typically leads to only a small or no change in the clustering structure, choosing a "good" mpts value can be challenging: depending on the data distribution, a high or low value for mpts may be more appropriate, and certain data clusters may reveal themselves at different values of mpts. To explore results for a range of mpts values, however, one has to run HDBSCAN* for each value in the range independently, which is computationally inefficient. In this paper, we propose an efficient approach to compute all HDBSCAN* hierarchies for a range of mpts values by replacing the graph used by HDBSCAN* with a much smaller graph that is guaranteed to contain the required information. An extensive experimental evaluation shows that with our approach one can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about 2 times.Comment: A short version of this paper appears at IEEE ICDM 2017. Corrected typos. Revised abstrac

arXiv.org e-Print Archive

ResearchOnline at James Cook University

Screening Rules for Convex Problems

Author: Gärtner Bernd
Jaggi Martin
Olbrich Jakob
Raj Anant
Schölkopf Bernhard
Publication venue
Publication date: 23/09/2016
Field of study

We propose a new framework for deriving screening rules for convex optimization problems. Our approach covers a large class of constrained and penalized optimization formulations, and works in two steps. First, given any approximate point, the structure of the objective function and the duality gap is used to gather information on the optimal solution. In the second step, this information is used to produce screening rules, i.e. safely identifying unimportant weight variables of the optimal solution. Our general framework leads to a large variety of useful existing as well as new screening rules for many applications. For example, we provide new screening rules for general simplex and

L_1

-constrained problems, Elastic Net, squared-loss Support Vector Machines, minimum enclosing ball, as well as structured norm regularized problems, such as group lasso

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Design and analysis of algorithms for similarity search based on intrinsic dimension

Author: Ma Xiguo
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2015
Field of study

One of the most fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, is that of similarity search. It has been used in numerous fields of application such as multimedia, information retrieval, recommender systems and pattern recognition. Specifically, a similarity query aims to retrieve from the database the most similar objects to a query object, where the underlying similarity measure is usually expressed as a distance function. The cost of processing similarity queries has been typically assessed in terms of the representational dimension of the data involved, that is, the number of features used to represent individual data objects. It is generally the case that high representational dimension would result in a significant increase in the processing cost of similarity queries. This relation is often attributed to an amalgamation of phenomena, collectively referred to as the curse of dimensionality. However, the observed effects of dimensionality in practice may not be as severe as expected. This has led to the development of models quantifying the complexity of data in terms of some measure of the intrinsic dimensionality. The generalized expansion dimension (GED) is one of such models, which estimates the intrinsic dimension in the vicinity of a query point q through the observation of the ranks and distances of pairs of neighbors with respect to q. This dissertation is mainly concerned with the design and analysis of search algorithms, based on the GED model. In particular, three variants of similarity search problem are considered, including adaptive similarity search, flexible aggregate similarity search, and subspace similarity search. The good practical performance of the proposed algorithms demonstrates the effectiveness of dimensionality-driven design of search algorithms

Digital Commons @ New Jersey Institute of Technology (NJIT)

On Finding the Jaccard Center

Author: Bury Marc
Schwiegelshohn Chris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)
Publication date: 01/01/2017
Field of study

We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y, the Jaccard index is defined as |Xcap Y|/|Xcup Y| and the corresponding distance is 1-|Xcap Y|/|Xcup Y|. The Jaccard center is a set C minimizing the maximum distance to any set of N. We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP. Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem. In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm

Dagstuhl Research Online Publication Server

Archivio della ricerca- Università di Roma La Sapienza

Improved Search of Relevant Points for Nearest-Neighbor Classification

Author: Flores-Velazco Alejandro
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Given a training set

P \subset \mathbb{R}^d

, the nearest-neighbor classifier assigns any query point

q \in \mathbb{R}^d

to the class of its closest point in

P

. To answer these classification queries, some training points are more relevant than others. We say a training point is relevant if its omission from the training set could induce the misclassification of some query point in

\mathbb{R}^d

. These relevant points are commonly known as border points, as they define the boundaries of the Voronoi diagram of

P

that separate points of different classes. Being able to compute this set of points efficiently is crucial to reduce the size of the training set without affecting the accuracy of the nearest-neighbor classifier. Improving over a decades-long result by Clarkson, in a recent paper by Eppstein an output-sensitive algorithm was proposed to find the set of border points of

P

O( n^2 + nk^2 )

time, where

k

is the size of such set. In this paper, we improve this algorithm to have time complexity equal to

O( nk^2 )

by proving that the first steps of their algorithm, which require

O( n^2 )

time, are unnecessary

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server