Search CORE

279 research outputs found

A Novel Signal Processing Measure to Identify Exact and Inexact Tandem Repeat Patterns in DNA Sequences

Author: Gupta Ravi
Mittal Ankush
Sarthi Divya
Singh Kuldip
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

The identification and analysis of repetitive patterns are active areas of biological and computational research. Tandem repeats in telomeres play a role in cancer and hypervariable trinucleotide tandem repeats are linked to over a dozen major neurodegenerative genetic disorders. In this paper, we present an algorithm to identify the exact and inexact repeat patterns in DNA sequences based on orthogonal exactly periodic subspace decomposition technique. Using the new measure our algorithm resolves the problems like whether the repeat pattern is of period P or its multiple (i.e., 2P, 3P, etc.), and several other problems that were present in previous signal-processing-based algorithms. We present an efficient algorithm of O(NLwÃ¢Â€Â‰logLw), where N is the length of DNA sequence and Lw is the window length, for identifying repeats. The algorithm operates in two stages. In the first stage, each nucleotide is analyzed separately for periodicity, and in the second stage, the periodic information of each nucleotide is combined together to identify the tandem repeats. Datasets having exact and inexact repeats were taken up for the experimental purpose. The experimental result shows the effectiveness of the approach

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Genome-scale computational analysis of DNA curvature and repeats in Arabidopsis and rice uncovers plant-specific genomic properties

Author: Jáuregui Ruy
Masoudi-Nejad Ali
Movahedi Sara
Publication venue: BioMed Central
Publication date: 01/05/2011
Field of study

Abstract Background Due to its overarching role in genome function, sequence-dependent DNA curvature continues to attract great attention. The DNA double helix is not a rigid cylinder, but presents both curvature and flexibility in different regions, depending on the sequence. More in depth knowledge of the various orders of complexity of genomic DNA structure has allowed the design of sophisticated bioinformatics tools for its analysis and manipulation, which, in turn, have yielded a better understanding of the genome itself. Curved DNA is involved in many biologically important processes, such as transcription initiation and termination, recombination, DNA replication, and nucleosome positioning. CpG islands and tandem repeats also play significant roles in the dynamics and evolution of genomes. Results In this study, we analyzed the relationship between these three structural features within rice (<it>Oryza sativa</it>) and Arabidopsis (<it>Arabidopsis thaliana</it>) genomes. A genome-scale prediction of curvature distribution in rice and Arabidopsis indicated that most of the chromosomes of both genomes have maximal chromosomal DNA curvature adjacent to the centromeric region. By analyzing tandem repeats across the genome, we found that frequencies of repeats are higher in regions adjacent to those with high curvature value. Further analysis of CpG islands shows a clear interdependence between curvature value, repeat frequencies and CpG islands. Each CpG island appears in a local minimal curvature region, and CpG islands usually do not appear in the centromere or regions with high repeat frequency. A statistical evaluation demonstrates the significance and non-randomness of these features. Conclusions This study represents the first systematic genome-scale analysis of DNA curvature, CpG islands and tandem repeats at the DNA sequence level in plant genomes, and finds that not all of the chromosomes in plants follow the same rules common to other eukaryote organisms, suggesting that some of these genomic properties might be considered as specific to plants.</p

Helmholtz Zentrum für Infektionsforschung Repository

Directory of Open Access Journals

PubMed Central

Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats

Author: A Arneodo
A Arneodo
A Puente de la
A Som
A Weiss
AK Brodzik
AL Jorgensen
AM Lynn
AR Fuentes
B Borštnik
B Haubold
BD Silverman
BR Kim
C Lee
C Tyler-Smith
C Yin
CA Chatzidimitriou-Dreismann
CA Chatzidimitriou-Dreismann
CC Yin
CK Peng
CK Peng
D Anastassiou
D Holste
D Kotlar
D Larhammar
D Sharma
DC Benson
DD Mauresan
DG Arques
E Coward
E Coward
E Pizzi
EA Cleever
EN Trifonov
EN Trifonov
EPC Rocha
EV Korotkov
EV Korotkov
G Bernardi
G Dodin
GI Kutuzova
H Herzel
H Herzel
H Herzel
HE Stanley
HE Stanley
I Dunham
IA Alexandrov
Ivan Basar
J Felsenstein
J Gao
J Jin
J Widom
JH Jackson
JM Gutierez
JS Waye
JS Waye
JW Fickett
JW Fickett
KHA Cho
L Du
L Manuelidis
LQ Zhou
LY Romanova
M Rosandić
M Rosandić
M Sousa Vieira de
Marija Rosandić
Matko Glunčić
MK Rudd
MQ Zhang
MY Azbel
N Bouayanaya
N Nagai
Nenad Pavin
Nils Paar
P Bernaola-Galvan
P Bernaola-Galvan
PE Warburton
PG Pop
PP Vaidyanathan
PV O'Neil
R Gupta
R Hall
R Ramakrishna
R Wevrick
R Wevrick
R Zhang
RF Voss
S Guharay
S Karlin
S Nee
S Tiwari
SA Aghili
SV Buldyrev
SV Buldyrev
T Haaf
TR Gregory
TT Tran
V Afreixo
V Paar
V Paar
V Paar
VA Emanuele
Vladimir Paar
VP Turutina
VR Chechetkin
VR Chechetkin
VR Chechetkin
VR Chechetkin
VR Chechetkin
VV Lobzin
VV Pradbu
W Lee
W Li
W Li
W Li
YX Tian
Z-G Yu
Z-G Yu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><it>β </it>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

TRStalker: an Efficient Heuristic for Finding NP-Complete Tandem Repeats

Author: Pellegrini Marco
Renda Maria Elena
Vecchio Alessio
Publication venue
Publication date
Field of study

Genomic sequences in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage, are characterized by close spatial contiguity, and play an important role in several molecular regulatory mechanisms. Certain types of tandem repeats are highly polymorphic and constitute a fingerprint feature of individuals. Abnormal TRs are known to be linked to several diseases. Researchers in bio-informatics in the last 20 years have proposed many formal definitions for the rather loose notion of a Tandem Repeat and have proposed exact or heuristic algorithms to detect TRs in genomic sequences. The general trend has been to use formal (implicit or explicit) definitions of TR for which verification of the solution was easy (with complexity linear, or polynomial in the TR\u27s length and substitution+indel rates) while the effort was directed towards identifying efficiently the sub-strings of the input to submit to the verification phase (either implicitly or explicitly). In this paper we take a step forward: we use a definition of TR for which also the verification step is difficult (in effect, NP-complete) and we develop new filtering techniques for coping with high error levels. The resulting heuristic algorithm, christened TRStalker, is approximate since it cannot guarantee that all NP-Complete Tandem Repeats satisfying the target definition in the input string will be found. However, in synthetic experiments with 30% of errors allowed, TRStalker has demonstrated a very high recall (ranging from 100% to 60%, depending on motif length and repetition number) for the NP-complete TRs. TRStalker has consistently better performance than some stateof- the-art methods for a large range of parameters on the class of NP-complete Tandem Repeats. TRStalker aims at improving the capability of TR detection for classes of TRs for which existing methods do not perform well

PUblication MAnagement

TRStalker: an efficient heuristic for finding fuzzy tandem repeats

Author: Alessio Vecchio
Ames
Benson
Benson
Boeva
Brodzik
Buchner
Burkhardt
Burkhardt
Bussey
Campuzano
de la Higuera
Dujon
Elemento
Fischetti
Gelfand
Glusman
Grissa
Gupta
Gusfield
Gusfield
Hauth
Jiang
Jurka
Kelkar
Kolpakov
Kolpakov
Kolpakov
Krishnan
Kurtz
Kurtz
Landau
Leclercq
Legendre
M. Elena Renda
Marco Pellegrini
Motwani
Mudunuri
Mulmuley
O'Dushlaine
Parisi
Peterlongo
Rivals
Rivals
Rowen
Saha
Sammeth
Sharma
Sim
Smit
Sokol
Stolovitzky
Vissers
Vogler
Warburton
Wells
Wexler
Wexler
Wooster
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events

CiteSeerX

Crossref

PubMed Central

Archivio della Ricerca - Università di Pisa

High Performance Computing for DNA Sequence Alignment and Assembly

Author: Schatz Michael Christopher
Publication venue
Publication date: 01/01/2010
Field of study

Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

Digital Repository at the University of Maryland

Software and Hardware Acceleration of the Genomic Motif Finding Tool PhyloNet

Author: Brown Justin
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2008
Field of study

Washington University St. Louis: Open Scholarship

Detecting short adjacent repeats in multiple sequences: a Bayesian approach.

Author
Publication venue
Publication date: 01/01/2010
Field of study

Li, Qiwei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.Includes bibliographical references (p. 75-85).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Repetitive DNA Sequence --- p.3Chapter 1.1.1 --- Definition and Categorization of Repeti- tive DNA Sequence --- p.3Chapter 1.1.2 --- Definition and Categorization of Tandem Repeats --- p.4Chapter 1.1.3 --- Definition and Categorization of Interspersed Repeats --- p.6Chapter 1.2 --- Research Significance --- p.7Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Thesis Organization --- p.11Chapter 2 --- Literature Review and Overview of Our Method --- p.13Chapter 2.1 --- Existing Methods --- p.14Chapter 2.2 --- Overview of Our Method --- p.17Chapter 3 --- Theoretical Background --- p.22Chapter 3.1 --- Multinomial Distributions --- p.23Chapter 3.2 --- Dirichlet Distribution --- p.23Chapter 3.3 --- Metropolis-Hastings Sampling --- p.25Chapter 3.4 --- Gibbs Sampling --- p.26Chapter 4 --- Problem Description --- p.28Chapter 4.1 --- Generative Model --- p.29Chapter 4.1.1 --- Input Data R --- p.31Chapter 4.1.2 --- Parameters A (Repeat Segment Starting Positions) --- p.32Chapter 4.1.3 --- Parameters S (Repeat Segment Structures) --- p.33Chapter 4.1.4 --- Parameters θ(Motif Matrix) --- p.35Chapter 4.1.5 --- Parameters Φ (Background Distribution) . --- p.36Chapter 4.1.6 --- An Example of the Model Schematic Di- agram --- p.37Chapter 4.2 --- Parameter Structure --- p.38Chapter 4.3 --- Posterior Distribution --- p.40Chapter 4.3.1 --- The Full Posterior Distribution --- p.41Chapter 4.3.2 --- The Collapsed Posterior Distribution --- p.42Chapter 4.4 --- Conclusion --- p.43Chapter 5 --- Methodology --- p.45Chapter 5.1 --- Schematic Procedure --- p.46Chapter 5.1.1 --- The Basic Schematic Procedure --- p.46Chapter 5.1.2 --- The Improved Schematic Procedure --- p.47Chapter 5.2 --- Initialization --- p.49Chapter 5.3 --- Predictive Update Step for θn and Φn --- p.50Chapter 5.4 --- Gibbs Sampling Step for an --- p.50Chapter 5.5 --- Metropolis-Hastings Sampling Step for sn --- p.51Chapter 5.5.1 --- Rear Indel Move --- p.53Chapter 5.5.2 --- Partial Shift Move --- p.56Chapter 5.5.3 --- Front Indel Move --- p.56Chapter 5.6 --- Phase Shifts --- p.57Chapter 5.7 --- Conclusion --- p.58Chapter 6 --- Results and Discussion --- p.60Chapter 6.1 --- Settings --- p.61Chapter 6.2 --- Experiment on Synthetic Data --- p.63Chapter 6.3 --- Experiment on Real Data --- p.69Chapter 7 --- Conclusion and Future Work --- p.72Chapter 7.1 --- Conclusion --- p.72Chapter 7.2 --- Future Work --- p.74Bibliography --- p.7

CUHK Digital Repository

The plastic genome of Bordetella pertussis

Author: Abrahams Jonathan Simon
Publication venue
Publication date: 22/10/2020
Field of study

OPUS

Whole-genome sequence analysis for pathogen detection and diagnostics

Author: Phillippy Adam Michael
Publication venue
Publication date: 01/01/2010
Field of study

This dissertation focuses on computational methods for improving the accuracy of commonly used nucleic acid tests for pathogen detection and diagnostics. Three specific biomolecular techniques are addressed: polymerase chain reaction, microarray comparative genomic hybridization, and whole-genome sequencing. These methods are potentially the future of diagnostics, but each requires sophisticated computational design or analysis to operate effectively. This dissertation presents novel computational methods that unlock the potential of these diagnostics by efficiently analyzing whole-genome DNA sequences. Improvements in the accuracy and resolution of each of these diagnostic tests promises more effective diagnosis of illness and rapid detection of pathogens in the environment. For designing real-time detection assays, an efficient data structure and search algorithm are presented to identify the most distinguishing sequences of a pathogen that are absent from all other sequenced genomes. Results are presented that show these "signature" sequences can be used to detect pathogens in complex samples and differentiate them from their non-pathogenic, phylogenetic near neighbors. For microarray, novel pan-genomic design and analysis methods are presented for the characterization of unknown microbial isolates. To demonstrate the effectiveness of these methods, pan-genomic arrays are applied to the study of multiple strains of the foodborne pathogen, Listeria monocytogenes, revealing new insights into the diversity and evolution of the species. Finally, multiple methods are presented for the validation of whole-genome sequence assemblies, which are capable of identifying assembly errors in even finished genomes. These validated assemblies provide the ultimate nucleic acid diagnostic, revealing the entire sequence of a genome

Digital Repository at the University of Maryland