2,072 research outputs found

    The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences.

    Get PDF
    Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data

    JACKIE: Fast Enumeration of Genome-Wide Single- and Multicopy CRISPR Target Sites and Their Off-Target Numbers.

    Get PDF
    Zinc finger protein-, transcription activator like effector-, and CRISPR-based methods for genome and epigenome editing and imaging have provided powerful tools to investigate functions of genomes. Targeting sequence design is vital to the success of these experiments. Although existing design software mainly focus on designing target sequence for specific elements, we report here the implementation of Jackie and Albert\u27s Comprehensive K-mer Instances Enumerator (JACKIE), a suite of software for enumerating all single- and multicopy sites in the genome that can be incorporated for genome-scale designs as well as loaded onto genome browsers alongside other tracks for convenient web-based graphic-user-interface-enabled design. We also implement fast algorithms to identify sequence neighborhoods or off-target counts of targeting sequences so that designs with low probability of off-target can be identified among millions of design sequences in reasonable time. We demonstrate the application of JACKIE-designed CRISPR site clusters for genome imaging

    Using multiple alignments to improve seeded local alignment algorithms

    Get PDF
    Multiple alignments among genomes are becoming increasingly prevalent. This trend motivates the development of tools for efficient homology search between a query sequence and a database of multiple alignments. In this paper, we present an algorithm that uses the information implicit in a multiple alignment to dynamically build an index that is weighted most heavily towards the promising regions of the multiple alignment. We have implemented Typhon, a local alignment tool that incorporates our indexing algorithm, which our test results show to be more sensitive than algorithms that index only a sequence. This suggests that when applied on a whole-genome scale, Typhon should provide improved homology searches in time comparable to existing algorithms

    A perceptual hash function to store and retrieve large scale DNA sequences

    Full text link
    This paper proposes a novel approach for storing and retrieving massive DNA sequences.. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images, that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by "avalanche effect" and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieving them
    • …
    corecore