2,894 research outputs found

    On Longest Repeat Queries Using GPU

    Full text link
    Repeat finding in strings has important applications in subfields such as computational biology. The challenge of finding the longest repeats covering particular string positions was recently proposed and solved by \.{I}leri et al., using a total of the optimal O(n)O(n) time and space, where nn is the string size. However, their solution can only find the \emph{leftmost} longest repeat for each of the nn string position. It is also not known how to parallelize their solution. In this paper, we propose a new solution for longest repeat finding, which although is theoretically suboptimal in time but is conceptually simpler and works faster and uses less memory space in practice than the optimal solution. Further, our solution can find \emph{all} longest repeats of every string position, while still maintaining a faster processing speed and less memory space usage. Moreover, our solution is \emph{parallelizable} in the shared memory architecture (SMA), enabling it to take advantage of the modern multi-processor computing platforms such as the general-purpose graphics processing units (GPU). We have implemented both the sequential and parallel versions of our solution. Experiments with both biological and non-biological data show that our sequential and parallel solutions are faster than the optimal solution by a factor of 2--3.5 and 6--14, respectively, and use less memory space.Comment: 14 page

    Repetition Detection in a Dynamic String

    Get PDF
    A string UU for a non-empty string U is called a square. Squares have been well-studied both from a combinatorial and an algorithmic perspective. In this paper, we are the first to consider the problem of maintaining a representation of the squares in a dynamic string S of length at most n. We present an algorithm that updates this representation in n^o(1) time. This representation allows us to report a longest square-substring of S in O(1) time and all square-substrings of S in O(output) time. We achieve this by introducing a novel tool - maintaining prefix-suffix matches of two dynamic strings. We extend the above result to address the problem of maintaining a representation of all runs (maximal repetitions) of the string. Runs are known to capture the periodic structure of a string, and, as an application, we show that our representation of runs allows us to efficiently answer periodicity queries for substrings of a dynamic string. These queries have proven useful in static pattern matching problems and our techniques have the potential of offering solutions to these problems in a dynamic text setting

    Update Query Time Trade-Off for Dynamic Suffix Arrays

    Get PDF
    The Suffix Array SA(S) of a string S[1 ... n] is an array containing all the suffixes of S sorted by lexicographic order. The suffix array is one of the most well known indexing data structures, and it functions as a key tool in many string algorithms. In this paper, we present a data structure for maintaining the Suffix Array of a dynamic string. For every 0≤ε≤10 \leq \varepsilon \leq 1, our data structure reports SA[i] in O~(nε)\tilde{O}(n^{\varepsilon}) time and handles text modification in O~(n1−ε)\tilde{O}(n^{1-\varepsilon}) time. Additionally, our data structure enables the same query time for reporting iSA[i], with iSA being the Inverse Suffix Array of S[1 ... n]. Our data structure can be used to construct sub-linear dynamic variants of static strings algorithms or data structures that are based on the Suffix Array and the Inverse Suffix Array.Comment: 19 pages, 3 figure

    CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustered Regularly Interspaced Palindromic Repeats (CRISPRs) are a novel type of direct repeat found in a wide range of bacteria and archaea. CRISPRs are beginning to attract attention because of their proposed mechanism; that is, defending their hosts against invading extrachromosomal elements such as viruses. Existing repeat detection tools do a poor job of identifying CRISPRs due to the presence of unique spacer sequences separating the repeats. In this study, a new tool, CRT, is introduced that rapidly and accurately identifies CRISPRs in large DNA strings, such as genomes and metagenomes.</p> <p>Results</p> <p>CRT was compared to CRISPR detection tools, Patscan and Pilercr. In terms of correctness, CRT was shown to be very reliable, demonstrating significant improvements over Patscan for measures precision, recall and quality. When compared to Pilercr, CRT showed improved performance for recall and quality. In terms of speed, CRT proved to be a huge improvement over Patscan. Both CRT and Pilercr were comparable in speed, however CRT was faster for genomes containing large numbers of repeats.</p> <p>Conclusion</p> <p>In this paper a new tool was introduced for the automatic detection of CRISPR elements. This tool, CRT, showed some important improvements over current techniques for CRISPR identification. CRT's approach to detecting repetitive sequences is straightforward. It uses a simple sequential scan of a DNA sequence and detects repeats directly without any major conversion or preprocessing of the input. This leads to a program that is easy to describe and understand; yet it is very accurate, fast and memory efficient, being O(<it>n</it>) in space and O(<it>nm</it>/<it>l</it>) in time.</p

    De novo identification of LTR retrotransposons in eukaryotic genomes

    Get PDF
    BACKGROUND: LTR retrotransposons are a class of mobile genetic elements containing two similar long terminal repeats (LTRs). Currently, LTR retrotransposons are annotated in eukaryotic genomes mainly through the conventional homology searching approach. Hence, it is limited to annotating known elements. RESULTS: In this paper, we report a de novo computational method that can identify new LTR retrotransposons without relying on a library of known elements. Specifically, our method identifies intact LTR retrotransposons by using an approximate string matching technique and protein domain analysis. In addition, it identifies partially deleted or solo LTRs using profile Hidden Markov Models (pHMMs). As a result, this method can de novo identify all types of LTR retrotransposons. We tested this method on the two pairs of eukaryotic genomes, C. elegans vs. C. briggsae and D. melanogaster vs. D. pseudoobscura. LTR retrotransposons in C. elegans and D. melanogaster have been intensively studied using conventional annotation methods. Comparing with previous work, we identified new intact LTR retroelements and new putative families, which may imply that there may still be new retroelements that are left to be discovered even in well-studied organisms. To assess the sensitivity and accuracy of our method, we compared our results with a previously published method, LTR_STRUC, which predominantly identifies full-length LTR retrotransposons. In summary, both methods identified comparable number of intact LTR retroelements. But our method can identify nearly all known elements in C. elegans, while LTR_STRUCT missed about 1/3 of them. Our method also identified more known LTR retroelements than LTR_STRUCT in the D. melanogaster genome. We also identified some LTR retroelements in the other two genomes, C. briggsae and D. pseudoobscura, which have not been completely finished. In contrast, the conventional method failed to identify those elements. Finally, the phylogenetic and chromosomal distributions of the identified elements are discussed. CONCLUSION: We report a novel method for de novo identification of LTR retrotransposons in eukaryotic genomes with favorable performance over the existing methods

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Analysis Of DNA Motifs In The Human Genome

    Full text link
    DNA motifs include repeat elements, promoter elements and gene regulator elements, and play a critical role in the human genome. This thesis describes a genome-wide computational study on two groups of motifs: tandem repeats and core promoter elements. Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover tandem repeats generate a huge volume of data, which can be difficult to decipher without further organization. A new method is presented here to organize and rank detected tandem repeats through clustering and classification. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats replete in the human genome. We believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats. As with tandem repeats, promoter sequences of genes contain binding sites for proteins that play critical roles in mediating expression levels. Promoter region binding proteins and their co-factors influence timing and context of transcription. Despite the critical regulatory role of these non-coding sequences, computational methods to identify and predict DNA binding sites are extremely limited. The work reported here analyzes the relative occurrence of core promoter elements (CPEs) in and around transcription start sites. We found that out of all the data sets 49\%-63\% upstream regions have either TATA box or DPE elements. Our results suggest the possibility of predicting transcription start sites through combining CPEs signals with other promoter signals such as CpG islands and clusters of specific transcription binding sites

    ALGORITHMS FOR CORRECTING NEXT GENERATION SEQUENCING ERRORS

    Get PDF
    The advent of next generation sequencing technologies (NGS) generated a revolution in biological research. However, in order to use the data they produce, new computational tools are needed. Due to significantly shorter length of the reads and higher per-base error rate, more complicated approaches are employed and still critical problems, such as genome assembly, are not satisfactorily solved. We therefore focus our attention on improving the quality of the NGS data. More precisely, we address the error correction issue. The current methods for correcting errors are not very accurate. In addition, they do not adapt to the data. We proposed a novel tool, HiTEC, to correct errors in NGS data. HiTEC is based on the suffix array data structure accompanied by a statistical analysis. HiTEC’s accuracy is significantly higher than all previous methods. In addition, it is the only tool with the ability of adjusting to the given data set. In addition, HiTEC is time and space efficient
    • …
    corecore