80,998 research outputs found

    Patterns in random binary search trees

    Get PDF

    On the sub-permutations of pattern avoiding permutations

    Full text link
    There is a deep connection between permutations and trees. Certain sub-structures of permutations, called sub-permutations, bijectively map to sub-trees of binary increasing trees. This opens a powerful tool set to study enumerative and probabilistic properties of sub-permutations and to investigate the relationships between 'local' and 'global' features using the concept of pattern avoidance. First, given a pattern {\mu}, we study how the avoidance of {\mu} in a permutation {\pi} affects the presence of other patterns in the sub-permutations of {\pi}. More precisely, considering patterns of length 3, we solve instances of the following problem: given a class of permutations K and a pattern {\mu}, we ask for the number of permutations πAvn(μ)\pi \in Av_n(\mu) whose sub-permutations in K satisfy certain additional constraints on their size. Second, we study the probability for a generic pattern to be contained in a random permutation {\pi} of size n without being present in the sub-permutations of {\pi} generated by the entry 1kn1 \leq k \leq n. These theoretical results can be useful to define efficient randomized pattern-search procedures based on classical algorithms of pattern-recognition, while the general problem of pattern-search is NP-complete

    Maximal clades in random binary search trees

    Full text link
    We study maximal clades in random phylogenetic trees with the Yule-Harding model or, equivalently, in binary search trees. We use probabilistic methods to reprove and extend earlier results on moment asymptotics and asymptotic normality. In particular, we give an explanation of the curious phenomenon observed by Drmota, Fuchs and Lee (2014) that asymptotic normality holds, but one should normalize using half the variance.Comment: 25 page

    Limit Laws for Functions of Fringe trees for Binary Search Trees and Recursive Trees

    Full text link
    We prove limit theorems for sums of functions of subtrees of binary search trees and random recursive trees. In particular, we give simple new proofs of the fact that the number of fringe trees of size k=kn k=k_n in the binary search tree and the random recursive tree (of total size n n ) asymptotically has a Poisson distribution if k k\rightarrow\infty , and that the distribution is asymptotically normal for k=o(n) k=o(\sqrt{n}) . Furthermore, we prove similar results for the number of subtrees of size k k with some required property P P , for example the number of copies of a certain fixed subtree T T . Using the Cram\'er-Wold device, we show also that these random numbers for different fixed subtrees converge jointly to a multivariate normal distribution. As an application of the general results, we obtain a normal limit law for the number of \ell-protected nodes in a binary search tree or random recursive tree. The proofs use a new version of a representation by Devroye, and Stein's method (for both normal and Poisson approximation) together with certain couplings

    Dynamic load balancing in parallel KD-tree k-means

    Get PDF
    One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy

    The Case for Learned Index Structures

    Full text link
    Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Repeated patterns in tree genetic programming

    Get PDF
    We extend our analysis of repetitive patterns found in genetic programming genomes to tree based GP. As in linear GP, repetitive patterns are present in large numbers. Size fair crossover limits bloat in automatic programming, preventing the evolution of recurring motifs. We examine these complex properties in detail: e.g. using depth v. size Catalan binary tree shape plots, subgraph and subtree matching, information entropy, syntactic and semantic fitness correlations and diffuse introns. We relate this emergent phenomenon to considerations about building blocks in GP and how GP works
    corecore