279 research outputs found

    A Novel Signal Processing Measure to Identify Exact and Inexact Tandem Repeat Patterns in DNA Sequences

    Get PDF
    The identification and analysis of repetitive patterns are active areas of biological and computational research. Tandem repeats in telomeres play a role in cancer and hypervariable trinucleotide tandem repeats are linked to over a dozen major neurodegenerative genetic disorders. In this paper, we present an algorithm to identify the exact and inexact repeat patterns in DNA sequences based on orthogonal exactly periodic subspace decomposition technique. Using the new measure our algorithm resolves the problems like whether the repeat pattern is of period P or its multiple (i.e., 2P, 3P, etc.), and several other problems that were present in previous signal-processing-based algorithms. We present an efficient algorithm of O(NLw logLw), where N is the length of DNA sequence and Lw is the window length, for identifying repeats. The algorithm operates in two stages. In the first stage, each nucleotide is analyzed separately for periodicity, and in the second stage, the periodic information of each nucleotide is combined together to identify the tandem repeats. Datasets having exact and inexact repeats were taken up for the experimental purpose. The experimental result shows the effectiveness of the approach

    Genome-scale computational analysis of DNA curvature and repeats in Arabidopsis and rice uncovers plant-specific genomic properties

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Due to its overarching role in genome function, sequence-dependent DNA curvature continues to attract great attention. The DNA double helix is not a rigid cylinder, but presents both curvature and flexibility in different regions, depending on the sequence. More in depth knowledge of the various orders of complexity of genomic DNA structure has allowed the design of sophisticated bioinformatics tools for its analysis and manipulation, which, in turn, have yielded a better understanding of the genome itself. Curved DNA is involved in many biologically important processes, such as transcription initiation and termination, recombination, DNA replication, and nucleosome positioning. CpG islands and tandem repeats also play significant roles in the dynamics and evolution of genomes.</p> <p>Results</p> <p>In this study, we analyzed the relationship between these three structural features within rice (<it>Oryza sativa</it>) and Arabidopsis (<it>Arabidopsis thaliana</it>) genomes. A genome-scale prediction of curvature distribution in rice and Arabidopsis indicated that most of the chromosomes of both genomes have maximal chromosomal DNA curvature adjacent to the centromeric region. By analyzing tandem repeats across the genome, we found that frequencies of repeats are higher in regions adjacent to those with high curvature value. Further analysis of CpG islands shows a clear interdependence between curvature value, repeat frequencies and CpG islands. Each CpG island appears in a local minimal curvature region, and CpG islands usually do not appear in the centromere or regions with high repeat frequency. A statistical evaluation demonstrates the significance and non-randomness of these features.</p> <p>Conclusions</p> <p>This study represents the first systematic genome-scale analysis of DNA curvature, CpG islands and tandem repeats at the DNA sequence level in plant genomes, and finds that not all of the chromosomes in plants follow the same rules common to other eukaryote organisms, suggesting that some of these genomic properties might be considered as specific to plants.</p

    Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.</p> <p>Results</p> <p>We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><sup><it>β </it></sup>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations.</p> <p>Conclusion</p> <p>DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p

    TRStalker: an Efficient Heuristic for Finding NP-Complete Tandem Repeats

    Get PDF
    Genomic sequences in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage, are characterized by close spatial contiguity, and play an important role in several molecular regulatory mechanisms. Certain types of tandem repeats are highly polymorphic and constitute a fingerprint feature of individuals. Abnormal TRs are known to be linked to several diseases. Researchers in bio-informatics in the last 20 years have proposed many formal definitions for the rather loose notion of a Tandem Repeat and have proposed exact or heuristic algorithms to detect TRs in genomic sequences. The general trend has been to use formal (implicit or explicit) definitions of TR for which verification of the solution was easy (with complexity linear, or polynomial in the TR\u27s length and substitution+indel rates) while the effort was directed towards identifying efficiently the sub-strings of the input to submit to the verification phase (either implicitly or explicitly). In this paper we take a step forward: we use a definition of TR for which also the verification step is difficult (in effect, NP-complete) and we develop new filtering techniques for coping with high error levels. The resulting heuristic algorithm, christened TRStalker, is approximate since it cannot guarantee that all NP-Complete Tandem Repeats satisfying the target definition in the input string will be found. However, in synthetic experiments with 30% of errors allowed, TRStalker has demonstrated a very high recall (ranging from 100% to 60%, depending on motif length and repetition number) for the NP-complete TRs. TRStalker has consistently better performance than some stateof- the-art methods for a large range of parameters on the class of NP-complete Tandem Repeats. TRStalker aims at improving the capability of TR detection for classes of TRs for which existing methods do not perform well

    TRStalker: an efficient heuristic for finding fuzzy tandem repeats

    Get PDF
    Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Software and Hardware Acceleration of the Genomic Motif Finding Tool PhyloNet

    Get PDF

    Detecting short adjacent repeats in multiple sequences: a Bayesian approach.

    Get PDF
    Li, Qiwei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.Includes bibliographical references (p. 75-85).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Repetitive DNA Sequence --- p.3Chapter 1.1.1 --- Definition and Categorization of Repeti- tive DNA Sequence --- p.3Chapter 1.1.2 --- Definition and Categorization of Tandem Repeats --- p.4Chapter 1.1.3 --- Definition and Categorization of Interspersed Repeats --- p.6Chapter 1.2 --- Research Significance --- p.7Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Thesis Organization --- p.11Chapter 2 --- Literature Review and Overview of Our Method --- p.13Chapter 2.1 --- Existing Methods --- p.14Chapter 2.2 --- Overview of Our Method --- p.17Chapter 3 --- Theoretical Background --- p.22Chapter 3.1 --- Multinomial Distributions --- p.23Chapter 3.2 --- Dirichlet Distribution --- p.23Chapter 3.3 --- Metropolis-Hastings Sampling --- p.25Chapter 3.4 --- Gibbs Sampling --- p.26Chapter 4 --- Problem Description --- p.28Chapter 4.1 --- Generative Model --- p.29Chapter 4.1.1 --- Input Data R --- p.31Chapter 4.1.2 --- Parameters A (Repeat Segment Starting Positions) --- p.32Chapter 4.1.3 --- Parameters S (Repeat Segment Structures) --- p.33Chapter 4.1.4 --- Parameters θ(Motif Matrix) --- p.35Chapter 4.1.5 --- Parameters Φ (Background Distribution) . --- p.36Chapter 4.1.6 --- An Example of the Model Schematic Di- agram --- p.37Chapter 4.2 --- Parameter Structure --- p.38Chapter 4.3 --- Posterior Distribution --- p.40Chapter 4.3.1 --- The Full Posterior Distribution --- p.41Chapter 4.3.2 --- The Collapsed Posterior Distribution --- p.42Chapter 4.4 --- Conclusion --- p.43Chapter 5 --- Methodology --- p.45Chapter 5.1 --- Schematic Procedure --- p.46Chapter 5.1.1 --- The Basic Schematic Procedure --- p.46Chapter 5.1.2 --- The Improved Schematic Procedure --- p.47Chapter 5.2 --- Initialization --- p.49Chapter 5.3 --- Predictive Update Step for θn and Φn --- p.50Chapter 5.4 --- Gibbs Sampling Step for an --- p.50Chapter 5.5 --- Metropolis-Hastings Sampling Step for sn --- p.51Chapter 5.5.1 --- Rear Indel Move --- p.53Chapter 5.5.2 --- Partial Shift Move --- p.56Chapter 5.5.3 --- Front Indel Move --- p.56Chapter 5.6 --- Phase Shifts --- p.57Chapter 5.7 --- Conclusion --- p.58Chapter 6 --- Results and Discussion --- p.60Chapter 6.1 --- Settings --- p.61Chapter 6.2 --- Experiment on Synthetic Data --- p.63Chapter 6.3 --- Experiment on Real Data --- p.69Chapter 7 --- Conclusion and Future Work --- p.72Chapter 7.1 --- Conclusion --- p.72Chapter 7.2 --- Future Work --- p.74Bibliography --- p.7

    The plastic genome of Bordetella pertussis

    Get PDF

    Whole-genome sequence analysis for pathogen detection and diagnostics

    Get PDF
    This dissertation focuses on computational methods for improving the accuracy of commonly used nucleic acid tests for pathogen detection and diagnostics. Three specific biomolecular techniques are addressed: polymerase chain reaction, microarray comparative genomic hybridization, and whole-genome sequencing. These methods are potentially the future of diagnostics, but each requires sophisticated computational design or analysis to operate effectively. This dissertation presents novel computational methods that unlock the potential of these diagnostics by efficiently analyzing whole-genome DNA sequences. Improvements in the accuracy and resolution of each of these diagnostic tests promises more effective diagnosis of illness and rapid detection of pathogens in the environment. For designing real-time detection assays, an efficient data structure and search algorithm are presented to identify the most distinguishing sequences of a pathogen that are absent from all other sequenced genomes. Results are presented that show these "signature" sequences can be used to detect pathogens in complex samples and differentiate them from their non-pathogenic, phylogenetic near neighbors. For microarray, novel pan-genomic design and analysis methods are presented for the characterization of unknown microbial isolates. To demonstrate the effectiveness of these methods, pan-genomic arrays are applied to the study of multiple strains of the foodborne pathogen, Listeria monocytogenes, revealing new insights into the diversity and evolution of the species. Finally, multiple methods are presented for the validation of whole-genome sequence assemblies, which are capable of identifying assembly errors in even finished genomes. These validated assemblies provide the ultimate nucleic acid diagnostic, revealing the entire sequence of a genome
    corecore