33 research outputs found

    On smoothed analysis of quicksort and Hoare's find

    Get PDF
    We provide a smoothed analysis of Hoare's find algorithm, and we revisit the smoothed analysis of quicksort. Hoare's find algorithm - often called quickselect or one-sided quicksort - is an easy-to-implement algorithm for finding the k-th smallest element of a sequence. While the worst-case number of comparisons that Hoare’s find needs is Theta(n^2), the average-case number is Theta(n). We analyze what happens between these two extremes by providing a smoothed analysis. In the first perturbation model, an adversary specifies a sequence of n numbers of [0,1], and then, to each number of the sequence, we add a random number drawn independently from the interval [0,d]. We prove that Hoare's find needs Theta(n/(d+1) sqrt(n/d) + n) comparisons in expectation if the adversary may also specify the target element (even after seeing the perturbed sequence) and slightly fewer comparisons for finding the median. In the second perturbation model, each element is marked with a probability of p, and then a random permutation is applied to the marked elements. We prove that the expected number of comparisons to find the median is Omega((1−p)n/p log n). Finally, we provide lower bounds for the smoothed number of comparisons of quicksort and Hoare’s find for the median-of-three pivot rule, which usually yields faster algorithms than always selecting the first element: The pivot is the median of the first, middle, and last element of the sequence. We show that median-of-three does not yield a significant improvement over the classic rule

    07391 Abstracts Collection -- Probabilistic Methods in the Design and Analysis of Algorithms

    Get PDF
    From 23.09.2007 to 28.09.2007, the Dagstuhl Seminar 07391 "Probabilistic Methods in the Design and Analysis of Algorithms\u27\u27was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. The seminar brought together leading researchers in probabilistic methods to strengthen and foster collaborations among various areas of Theoretical Computer Science. The interaction between researchers using randomization in algorithm design and researchers studying known algorithms and heuristics in probabilistic models enhanced the research of both groups in developing new complexity frameworks and in obtaining new algorithmic results. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Efficient estimation algorithms for large and complex data sets

    Get PDF
    The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms

    In pursuit of linear complexity in discrete and computational geometry

    Get PDF
    Many computational problems arise naturally from geometric data. In this thesis, we consider three such problems: (i) distance optimization problems over point sets, (ii) computing contour trees over simplicial meshes, and (iii) bounding the expected complexity of weighted Voronoi diagrams. While these topics are broad, here the focus is on identifying structure which implies linear (or near linear) algorithmic and descriptive complexity. The first topic we consider is in geometric optimization. More specifically, we define a large class of distance problems, for which we provide linear time exact or approximate solutions. Roughly speaking, the class of problems facilitate either clustering together close points (i.e. netting) or throwing out outliers (i.e pruning), allowing for successively smaller summaries of the relevant information in the input. A surprising number of classical geometric optimization problems are unified under this framework, including finding the optimal k-center clustering, the kth ranked distance, the kth heaviest edge of the MST, the minimum radius ball enclosing k points, and many others. In several cases we get the first known linear time approximation algorithm for a given problem, where our approximation ratio matches that of previous work. The second topic we investigate is contour trees, a fundamental structure in computational topology. Contour trees give a compact summary of the evolution of level sets on a mesh, and are typically used on massive data sets. Previous algorithms for computing contour trees took Θ(n log n) time and were worst-case optimal. Here we provide an algorithm whose running time lies between Θ(nα(n)) and Θ(n log n), and varies depending on the shape of the tree, where α(n) is the inverse Ackermann function. In particular, this is the first algorithm with O(nα(n)) running time on instances with balanced contour trees. Our algorithmic results are complemented by lower bounds indicating that, up to a factor of α(n), on all instance types our algorithm performs optimally. For the final topic, we consider the descriptive complexity of weighted Voronoi diagrams. Such diagrams have quadratic (or higher) worst-case complexity, however, as was the case for contour trees, here we push beyond worst-case analysis. A new diagram, called the candidate diagram, is introduced, which allows us to bound the complexity of weighted Voronoi diagrams arising from a particular probabilistic input model. Specifically, we assume weights are randomly permuted among fixed Voronoi sites, an assumption which is weaker than the more typical sampled locations assumption. Under this assumption, the expected complexity is shown to be near linear

    Multi-feature approach for writer-independent offline signature verification

    Get PDF
    Some of the fundamental problems facing handwritten signature verification are the large number of users, the large number of features, the limited number of reference signatures for training, the high intra-personal variability of the signatures and the unavailability of forgeries as counterexamples. This research first presents a survey of offline signature verification techniques, focusing on the feature extraction and verification strategies. The goal is to present the most important advances, as well as the current challenges in this field. Of particular interest are the techniques that allow for designing a signature verification system based on a limited amount of data. Next is presented a novel offline signature verification system based on multiple feature extraction techniques, dichotomy transformation and boosting feature selection. Using multiple feature extraction techniques increases the diversity of information extracted from the signature, thereby producing features that mitigate intra-personal variability, while dichotomy transformation ensures writer-independent classification, thus relieving the verification system from the burden of a large number of users. Finally, using boosting feature selection allows for a low cost writer-independent verification system that selects features while learning. As such, the proposed system provides a practical framework to explore and learn from problems with numerous potential features. Comparison of simulation results from systems found in literature confirms the viability of the proposed system, even when only a single reference signature is available. The proposed system provides an efficient solution to a wide range problems (eg. biometric authentication) with limited training samples, new training samples emerging during operations, numerous classes, and few or no counterexamples

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Efficient estimation algorithms for large and complex data sets

    Get PDF
    The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms
    corecore