34 research outputs found

    Towards optimal packed string matching

    Get PDF
    a r t i c l e i n f o a b s t r a c t Dedicated to Professor Gad M. Landau, on the occasion of his 60th birthday Keywords: String matching Word-RAM Packed strings In the packed string matching problem, it is assumed that each machine word can accommodate up to α characters, thus an n-character string occupies n/α memory words. The main word-size string-matching instruction wssm is available in contemporary commodity processors. The other word-size maximum-suffix instruction wslm is only required during the pattern pre-processing. Benchmarks show that our solution can be efficiently implemented, unlike some prior theoretical packed string matching work. (b) We also consider the complexity of the packed string matching problem in the classical word-RAM model in the absence of the specialized micro-level instructions wssm and wslm. We propose micro-level algorithms for the theoretically efficient emulation using parallel algorithms techniques to emulate wssm and using the Four-Russians technique to emulate wslm. Surprisingly, our bit-parallel emulation of wssm also leads to a new simplified parallel random access machine string-matching algorithm. As a byproduct to facilitate our results we develop a new algorithm for finding the leftmost (most significant) 1 bits in consecutive non-overlapping blocks of uniform size inside a word. This latter problem is not known to be reducible to finding the rightmost 1, which can be easily solved, since we do not know how to reverse the bits of a word in O (1) time

    Optimal Packed String Matching

    Get PDF
    In the packed string matching problem, each machine word accommodates α characters, thus an n-character text occupies n/α memory words. We extend the Crochemore-Perrin constantspace O(n)-time string matching algorithm to run in optimal O(n/α) time and even in real-time, achieving a factor α speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC 0 instructions (i.e., no multiplication) plus two specialized AC 0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e., Intel’s SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC 0) and table lookup

    Sequence dependence of isothermal DNA amplification via EXPAR

    Get PDF
    Isothermal nucleic acid amplification is becoming increasingly important for molecular diagnostics. Therefore, new computational tools are needed to facilitate assay design. In the isothermal EXPonential Amplification Reaction (EXPAR), template sequences with similar thermodynamic characteristics perform very differently. To understand what causes this variability, we characterized the performance of 384 template sequences, and used this data to develop two computational methods to predict EXPAR template performance based on sequence: a position weight matrix approach with support vector machine classifier, and RELIEF attribute evaluation with Naïve Bayes classification. The methods identified well and poorly performing EXPAR templates with 67–70% sensitivity and 77–80% specificity. We combined these methods into a computational tool that can accelerate new assay design by ruling out likely poor performers. Furthermore, our data suggest that variability in template performance is linked to specific sequence motifs. Cytidine, a pyrimidine base, is over-represented in certain positions of well-performing templates. Guanosine and adenosine, both purine bases, are over-represented in similar regions of poorly performing templates, frequently as GA or AG dimers. Since polymerases have a higher affinity for purine oligonucleotides, polymerase binding to GA-rich regions of a single-stranded DNA template may promote non-specific amplification in EXPAR and other nucleic acid amplification reactions

    SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky Way Galaxy, and Extra-Solar Planetary Systems

    Get PDF
    Building on the legacy of the Sloan Digital Sky Survey (SDSS-I and II), SDSS-III is a program of four spectroscopic surveys on three scientific themes: dark energy and cosmological parameters, the history and structure of the Milky Way, and the population of giant planets around other stars. In keeping with SDSS tradition, SDSS-III will provide regular public releases of all its data, beginning with SDSS DR8 (which occurred in Jan 2011). This paper presents an overview of the four SDSS-III surveys. BOSS will measure redshifts of 1.5 million massive galaxies and Lya forest spectra of 150,000 quasars, using the BAO feature of large scale structure to obtain percent-level determinations of the distance scale and Hubble expansion rate at z<0.7 and at z~2.5. SEGUE-2, which is now completed, measured medium-resolution (R=1800) optical spectra of 118,000 stars in a variety of target categories, probing chemical evolution, stellar kinematics and substructure, and the mass profile of the dark matter halo from the solar neighborhood to distances of 100 kpc. APOGEE will obtain high-resolution (R~30,000), high signal-to-noise (S/N>100 per resolution element), H-band (1.51-1.70 micron) spectra of 10^5 evolved, late-type stars, measuring separate abundances for ~15 elements per star and creating the first high-precision spectroscopic survey of all Galactic stellar populations (bulge, bar, disks, halo) with a uniform set of stellar tracers and spectral diagnostics. MARVELS will monitor radial velocities of more than 8000 FGK stars with the sensitivity and cadence (10-40 m/s, ~24 visits per star) needed to detect giant planets with periods up to two years, providing an unprecedented data set for understanding the formation and dynamical evolution of giant planet systems. (Abridged)Comment: Revised to version published in The Astronomical Journa

    String pattern Matching For A Deluge Survival Kit

    Get PDF
    String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data are amassed, disseminated and shared at an increasing pace, effective access to, and manipulation of such data depend crucially on the efficiency with which strings are structured, compressed, transmitted, stored, searched and retrieved. This paper samples from this perspective, and with the authors&apos; own bias, a rich arsenal of ideas and techniques developed in more than three decades of history

    Writing information into DNA

    No full text
    Abstract. The time is approaching when information can be written into DNA. This tutorial work surveys the methods for designing code words using DNA, and proposes a simple code that avoids unwanted hybridization in the presence of shift and concatenation of DNA words and their complements.
    corecore