214 research outputs found

    A Faster Implementation of Online Run-Length Burrows-Wheeler Transform

    Full text link
    Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in O(nlgr)O(n\lg r) time and O(rlgn)O(r\lg n) bits of space, where nn is the length of input string SS received so far and rr is the number of runs in the BWT of the reversed SS. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201

    Fast Label Extraction in the CDAWG

    Full text link
    The compact directed acyclic word graph (CDAWG) of a string TT of length nn takes space proportional just to the number ee of right extensions of the maximal repeats of TT, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which ee grows significantly more slowly than nn. We reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to count the number of occurrences of a pattern of length mm, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(mloglogn+occ)O(m\log{\log{n}}+\mathtt{occ}) to O(m+occ)O(m+\mathtt{occ}) in the time needed to locate all the occ\mathtt{occ} occurrences of the pattern. We also reduce from O(kloglogn)O(k\log{\log{n}}) to O(k)O(k) the time needed to read the kk characters of the label of an edge of the suffix tree of TT, and we reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to compute the matching statistics between a query of length mm and TT, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

    Composite repetition-aware data structures

    Get PDF
    In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

    Forecasting species distributions : correlation does not equal causation

    Get PDF
    This research was funded by the U.S. Department of the Interior Northeast Climate Adaptation Science Center, which is managed by the U.S. Geological Survey National Climate Adaptation Science Center. Additional funding was provided by T-2- 3R grants for Nongame Species Monitoring and Management through the New Hampshire Fish and Game Department and E-1- 25 grants for Investigations and Population Recovery through the Vermont Fish and Wildlife Department.Aim Identifying the mechanisms influencing species' distributions is critical for accurate climate change forecasts. However, current approaches are limited by correlative models that cannot distinguish between direct and indirect effects. Location New Hampshire and Vermont, USA. Methods Using causal and correlational models and new theory on range limits, we compared current (2014?2019) and future (2080s) distributions of ecologically important mammalian carnivores and competitors along range limits in the northeastern US under two global climate models (GCMs) and a high-emission scenario (RCP8.5) of projected snow and forest biomass change. Results Our hypothesis that causal models of climate-mediated competition would result in different distribution predictions than correlational models, both in the current and future periods, was well-supported by our results; however, these patterns were prominent only for species pairs that exhibited strong interactions. The causal model predicted the current distribution of Canada lynx (Lynx canadensis) more accurately, likely because it incorporated the influence of competitive interactions mediated by snow with the closely related bobcat (Lynx rufus). Both modeling frameworks predicted an overall decline in lynx occurrence in the central high-elevation regions and increased occurrence in the northeastern region in the 2080s due to changes in land use that provided optimal habitat. However, these losses and gains were less substantial in the causal model due to the inclusion of an indirect buffering effect of snow on lynx. Main conclusions Our comparative analysis indicates that a causal framework, steeped in ecological theory, can be used to generate spatially explicit predictions of species distributions. This approach can be used to disentangle correlated predictors that have previously hampered understanding of range limits and species' response to climate change.Publisher PDFPeer reviewe

    Lightweight BWT and LCP merging via the gap algorithm

    Get PDF
    Recently, Holt and McMillan [Bioinformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm

    Cognition in adults with Williams syndrome — A 20-year follow-up study

    Get PDF
    Background: Williams syndrome (WBS) is a genetic multisystem disorder. The main symptom is borderline (intelligence quotient, IQ 70–79) or abnormally low intelligence (IQ Methods: We followed 25 adults (age at baseline 19–68, median 38) with genetically confirmed WBS for about 20 years. The study subjects underwent medical and neuropsychological assessments at the baseline and at the end of follow‐up.Results: The mean VIQ remained quite stable from early adulthood up to 40 years of age after which it declined. The mean PIQ kept on improving from early adulthood until 50 years of age after which it gradually declined. At the end of the study, all study subjects had at least two longstanding health problems out of which hypertension, psychiatric disorder, and scoliosis or kyphosis occurred most frequently. At end of the study, two patients suffered from vascular dementia. Seven patients died during the follow‐up.Conclusions: In adults with WBS, the course of cognition is uneven across the cognitive profile. Their verbal functions both develop and deteriorate earlier than performance/nonverbal functions. Frequent somatic co‐morbidities may increase risk to shortened life span.</p

    Storage and retrieval of individual genomes

    Get PDF
    Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe

    BENCHOP - The BENCHmarking project in Option Pricing

    Get PDF
    The aim of the BENCHOP project is to provide the finance community with a common suite of benchmark problems for option pricing. We provide a detailed description of the six benchmark problems together with methods to compute reference solutions. We have implemented fifteen different numerical methods for these problems, and compare their relative performance. All implementations are available on line and can be used for future development and comparison

    Flexible Indexing of Repetitive Collections

    Get PDF
    Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA

    Compressed Suffix Arrays for Massive Data

    Get PDF
    We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O(n log n) time in the worst case, and only O(n) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour
    corecore