29,037 research outputs found

    Self-improving Algorithms for Coordinate-wise Maxima

    Full text link
    Computing the coordinate-wise maxima of a planar point set is a classic and well-studied problem in computational geometry. We give an algorithm for this problem in the \emph{self-improving setting}. We have nn (unknown) independent distributions \cD_1, \cD_2, ..., \cD_n of planar points. An input pointset (p1,p2,...,pn)(p_1, p_2, ..., p_n) is generated by taking an independent sample pip_i from each \cD_i, so the input distribution \cD is the product \prod_i \cD_i. A self-improving algorithm repeatedly gets input sets from the distribution \cD (which is \emph{a priori} unknown) and tries to optimize its running time for \cD. Our algorithm uses the first few inputs to learn salient features of the distribution, and then becomes an optimal algorithm for distribution \cD. Let \OPT_\cD denote the expected depth of an \emph{optimal} linear comparison tree computing the maxima for distribution \cD. Our algorithm eventually has an expected running time of O(\text{OPT}_\cD + n), even though it did not know \cD to begin with. Our result requires new tools to understand linear comparison trees for computing maxima. We show how to convert general linear comparison trees to very restricted versions, which can then be related to the running time of our algorithm. An interesting feature of our algorithm is an interleaved search, where the algorithm tries to determine the likeliest point to be maximal with minimal computation. This allows the running time to be truly optimal for the distribution \cD.Comment: To appear in Symposium of Computational Geometry 2012 (17 pages, 2 figures

    RRR: Rank-Regret Representative

    Full text link
    Selecting the best items in a dataset is a common task in data exploration. However, the concept of "best" lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove "dominated" items and create a "representative" subset of the data set, comprising the "best items" in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be almost as big as the full data. Representative can be found if we relax the requirement to include the best item for every possible user, and instead just limit the users' "regret". Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, alternatively, we consider the position of the items in the ranked list for defining the regret and propose the {\em rank-regret representative} as the minimal subset of the data containing at least one of the top-kk of any possible ranking function. This problem is NP-complete. We use the geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets

    Phase-space structures II: Hierarchical Structure Finder

    Full text link
    A new multi-dimensional Hierarchical Structure Finder (HSF) to study the phase-space structure of dark matter in N-body cosmological simulations is presented. The algorithm depends mainly on two parameters, which control the level of connectivity of the detected structures and their significance compared to Poisson noise. By working in 6D phase-space, where contrasts are much more pronounced than in 3D position space, our HSF algorithm is capable of detecting subhaloes including their tidal tails, and can recognise other phase-space structures such as pure streams and candidate caustics. If an additional unbinding criterion is added, the algorithm can be used as a self-consistent halo and subhalo finder. As a test, we apply it to a large halo of the Millennium Simulation, where 19 % of the halo mass are found to belong to bound substructures, which is more than what is detected with conventional 3D substructure finders, and an additional 23-36 % of the total mass belongs to unbound HSF structures. The distribution of identified phase-space density peaks is clearly bimodal: high peaks are dominated by the bound structures and low peaks belong mostly to tidal streams. In order to better understand what HSF provides, we examine the time evolution of structures, based on the merger tree history. Bound structures typically make only up to 6 orbits inside the main halo. Still, HSF can identify at the present time at least 80 % of the original content of structures with a redshift of infall as high as z <= 0.3, which illustrates the significant power of this tool to perform dynamical analyses in phase-space.Comment: Submitted to MNRAS, 24 pages, 18 figure

    On smoothed analysis of quicksort and Hoare's find

    Get PDF
    We provide a smoothed analysis of Hoare's find algorithm, and we revisit the smoothed analysis of quicksort. Hoare's find algorithm - often called quickselect or one-sided quicksort - is an easy-to-implement algorithm for finding the k-th smallest element of a sequence. While the worst-case number of comparisons that Hoare’s find needs is Theta(n^2), the average-case number is Theta(n). We analyze what happens between these two extremes by providing a smoothed analysis. In the first perturbation model, an adversary specifies a sequence of n numbers of [0,1], and then, to each number of the sequence, we add a random number drawn independently from the interval [0,d]. We prove that Hoare's find needs Theta(n/(d+1) sqrt(n/d) + n) comparisons in expectation if the adversary may also specify the target element (even after seeing the perturbed sequence) and slightly fewer comparisons for finding the median. In the second perturbation model, each element is marked with a probability of p, and then a random permutation is applied to the marked elements. We prove that the expected number of comparisons to find the median is Omega((1−p)n/p log n). Finally, we provide lower bounds for the smoothed number of comparisons of quicksort and Hoare’s find for the median-of-three pivot rule, which usually yields faster algorithms than always selecting the first element: The pivot is the median of the first, middle, and last element of the sequence. We show that median-of-three does not yield a significant improvement over the classic rule

    G\mathcal{G}-SELC: Optimization by sequential elimination of level combinations using genetic algorithms and Gaussian processes

    Full text link
    Identifying promising compounds from a vast collection of feasible compounds is an important and yet challenging problem in the pharmaceutical industry. An efficient solution to this problem will help reduce the expenditure at the early stages of drug discovery. In an attempt to solve this problem, Mandal, Wu and Johnson [Technometrics 48 (2006) 273--283] proposed the SELC algorithm. Although powerful, it fails to extract substantial information from the data to guide the search efficiently, as this methodology is not based on any statistical modeling. The proposed approach uses Gaussian Process (GP) modeling to improve upon SELC, and hence named G\mathcal{G}-SELC. The performance of the proposed methodology is illustrated using four and five dimensional test functions. Finally, we implement the new algorithm on a real pharmaceutical data set for finding a group of chemical compounds with optimal properties.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS199 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org