16 research outputs found

    On Longest Repeat Queries Using GPU

    Full text link
    Repeat finding in strings has important applications in subfields such as computational biology. The challenge of finding the longest repeats covering particular string positions was recently proposed and solved by \.{I}leri et al., using a total of the optimal O(n)O(n) time and space, where nn is the string size. However, their solution can only find the \emph{leftmost} longest repeat for each of the nn string position. It is also not known how to parallelize their solution. In this paper, we propose a new solution for longest repeat finding, which although is theoretically suboptimal in time but is conceptually simpler and works faster and uses less memory space in practice than the optimal solution. Further, our solution can find \emph{all} longest repeats of every string position, while still maintaining a faster processing speed and less memory space usage. Moreover, our solution is \emph{parallelizable} in the shared memory architecture (SMA), enabling it to take advantage of the modern multi-processor computing platforms such as the general-purpose graphics processing units (GPU). We have implemented both the sequential and parallel versions of our solution. Experiments with both biological and non-biological data show that our sequential and parallel solutions are faster than the optimal solution by a factor of 2--3.5 and 6--14, respectively, and use less memory space.Comment: 14 page

    Parallel random projection using R high performance computing for planted motif search

    Get PDF
    Motif discovery in DNA sequences is one of the most important issues in bioinformatics. Thus, algorithms for dealing with the problem accurately and quickly have always been the goal of research in bioinformatics. Therefore, this study is intended to modify the random projection algorithm to be implemented on R high performance computing (i.e., the R package pbdMPI). Some steps are needed to achieve this objective, ie preprocessing data, splitting data according to number of batches, modifying and implementing random projection in the pbdMPI package, and then aggregating the results. To validate the proposed approach, some experiments have been conducted. Several benchmarking data were used in this study by sensitivity analysis on number of cores and batches. Experimental results show that computational cost can be reduced, which is that the computation cost of 6 cores is faster around 34 times compared with the standalone mode. Thus, the proposed approach can be used for motif discovery effectively and efficiently

    Epidemics of enterovirus infection in Chungnam Korea, 2008 and 2009

    Get PDF
    Previously, we explored the epidemic pattern and molecular characterization of enteroviruses isolated in Chungnam, Korea from 2005 to 2006. The present study extended these observations to 2008 and 2009. In this study, enteroviruses showed similar seasonal prevalent pattern from summer to fall and age distribution to previous investigation. The most prevalent month was July: 42.9% in 2008 and 31.9% in 2009. The highest rate of enterovirus-positive samples occurred in children < 1-year-old-age. Enterovirus-positive samples were subjected to sequence determination of the VP1 region, which resolved the isolated enteroviruses into 10 types in 2008 (coxsackievirus A4, A16, B1, B3, echovirus 6, 7, 9, 11, 16, and 30) and 8 types in 2009 (coxsackievirus A2, A4, A5, A16, B1, B5, echovirus 11, and enterovirus 71). The most prevalent enterovirus serotype in 2008 and 2009 was echovirus 30 and coxsackievirus B1, respectively, whereas echovirus 18 and echovirus 5 were the most prevalent types in 2005 and 2006, respectively. Comparison of coxsackievirus B1 and B5 of prevalent enterovirus type in Korea in 2009 with reference strains of each same serotype were conducted to genetic analysis by a phylogenetic tree. The sequences of coxsackievirus B1 strains segregated into four distinct clusters (A, B, C, and D) with some temporal and regional sub-clustering. Most of Korean coxsackievirus B1 strains in 2008 and 2009 were in cluster D, while only "Kor08-CVB1-001CN" was cluster C. The coxsackievirus B5 strains segregated in five distinct genetic groups (clusters A-E) were supported by high bootstrap values. The Korean strains isolated in 2001 belonged to cluster D, whereas Korean strains isolated in 2005 and 2009 belonged to cluster E. Comparison of the VP1 amino acid sequences of the Korean coxsackievirus B5 isolates with reference strains revealed amino acid sequence substitutions at nine amino acid sequences (532, 562, 570, 571, 576-578, 582, 583, and 585)

    Avian reovirus L2 genome segment sequences and predicted structure/function of the encoded RNA-dependent RNA polymerase protein

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The orthoreoviruses are infectious agents that possess a genome comprised of 10 double-stranded RNA segments encased in two concentric protein capsids. Like virtually all RNA viruses, an RNA-dependent RNA polymerase (RdRp) enzyme is required for viral propagation. RdRp sequences have been determined for the prototype mammalian orthoreoviruses and for several other closely-related reoviruses, including aquareoviruses, but have not yet been reported for any avian orthoreoviruses.</p> <p>Results</p> <p>We determined the L2 genome segment nucleotide sequences, which encode the RdRp proteins, of two different avian reoviruses, strains ARV138 and ARV176 in order to define conserved and variable regions within reovirus RdRp proteins and to better delineate structure/function of this important enzyme. The ARV138 L2 genome segment was 3829 base pairs long, whereas the ARV176 L2 segment was 3830 nucleotides long. Both segments were predicted to encode λB RdRp proteins 1259 amino acids in length. Alignments of these newly-determined ARV genome segments, and their corresponding proteins, were performed with all currently available homologous mammalian reovirus (MRV) and aquareovirus (AqRV) genome segment and protein sequences. There was ~55% amino acid identity between ARV λB and MRV λ3 proteins, making the RdRp protein the most highly conserved of currently known orthoreovirus proteins, and there was ~28% identity between ARV λB and homologous MRV and AqRV RdRp proteins. Predictive structure/function mapping of identical and conserved residues within the known MRV λ3 atomic structure indicated most identical amino acids and conservative substitutions were located near and within predicted catalytic domains and lining RdRp channels, whereas non-identical amino acids were generally located on the molecule's surfaces.</p> <p>Conclusion</p> <p>The ARV λB and MRV λ3 proteins showed the highest ARV:MRV identity values (~55%) amongst all currently known ARV and MRV proteins. This implies significant evolutionary constraints are placed on dsRNA RdRp molecules, particularly in regions comprising the canonical polymerase motifs and residues thought to interact directly with template and nascent mRNA. This may point the way to improved design of anti-viral agents specifically targeting this enzyme.</p

    Efficient implementation of lazy suffix trees

    Get PDF
    Giegerich R, Kurtz S, Stoye J. Efficient implementation of lazy suffix trees. SOFTWARE-PRACTICE &amp; EXPERIENCE. 2003;33(11):1035-1049.We present an efficient implementation of a write-only top-down construction for suffix trees. Our implementation is based on a new, space-efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy top-down construction is often faster and more space efficient than other methods. Copyright (C) 2003 John Wiley Sons, Ltd

    IMPLEMENTASI ALGORITMA SAX DAN RANDOM PROJECTION UNTUK TIME SERIES MOTIF DISCOVERY PADA BIG DATA PLATFORM STUDI KASUS: RESONANSI ELEMEN ORBITAL ASTEROID

    Get PDF
    Fenomena Big Data telah terjadi pada banyak bidang pengetahuan, salah satunya adalah pada bidang astronomi. Data yang memiliki jumlah yang sangat banyak pada bidang astronomi salah satunya adalah data resonansi elemen orbital asteroid. Data tersebut dapat diolah sehingga para ilmuwan dapat mencari mean motion resonance pada suatu partikel asteroid untuk mengetahui pada tahun keberapa asteroid akan beresonansi dengan planet planet tertentu. Untuk itu, penelitian ini membuat sebuah model komputasi untuk mendapatkan mean motion resonance secara cepat dan efektif dengan memodifikasi dan mengimplementasikan algoritma SAX dan algoritma motif discovery random projection pada Big Data Platform yaitu Apache Hadoop dan Apache Spark. Hasil penelitian ini menunjukkan adanya percepatan yang sangat signifikan antara penggunaan stand alone dan penggunaan Big Data platform dengan perancangan 2 skenario. Skenario pertama yaitu penggunaan cluster dengan 4 cores dan beberapa worker node dan skenario kedua yaitu penggunaan cluster dengan 2 worker node dan beberapa jumlah core. Penelitian ini juga membuktikan bahwa model komputasi yang dibangun dibandingkan dengan kelauran software SwiftVis mendapatkan rata-rata akurasi sebesar 83%. The Big Data phenomenon has occurred in many fields of knowledge, one of them is in the field of astronomy. One of data that has a very large amount in the astronomy field is the resonance data of asteroid orbital elements. The data can be processed so that scientists can find the mean motion resonance in an asteroid particle to find out in what year the asteroid will resonate with a particular planet. But now, it cannot be denied that to process the data with a large amount of data will take a lot of time. For this reason, this research makes a computational model to get mean motion resonance quickly and effectively by modifying and implementing the SAX algorithm and motif discovery random projection algorithm on the Big Data platform, using Apache Hadoop and Apache Spark. The results of this study indicate a very significant acceleration between standalone use and the use of Big Data platforms by designing 2 scenarios. The first scenario is the use of clusters with 4 cores and several worker nodes and the second scenario is the use of clusters with 2 worker nodes and a number of cores. This study also proves that the computational model compared to the result from SwiftVis software gets an average accuracy of 83%

    New Algorithms for EST clustering

    Get PDF
    Philosophiae Doctor - PhDExpressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of comparison of two sequences is a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies: stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, 02_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as 02_ cluster and can be immediately included into the STACKPack package for EST clustering. The stringent clustering program produces already assembled clusters and can apprehend alternatively processed variants during the clustering process
    corecore