16,473 research outputs found
Computing distribution of scale independent motifs in biological sequences
The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques
Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants
Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research
Information content based model for the topological properties of the gene regulatory network of Escherichia coli
Gene regulatory networks (GRN) are being studied with increasingly precise
quantitative tools and can provide a testing ground for ideas regarding the
emergence and evolution of complex biological networks. We analyze the global
statistical properties of the transcriptional regulatory network of the
prokaryote Escherichia coli, identifying each operon with a node of the
network. We propose a null model for this network using the content-based
approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et
al., 2007) Random sequences that represent promoter regions and binding
sequences are associated with the nodes. The length distributions of these
sequences are extracted from the relevant databases. The network is constructed
by testing for the occurrence of binding sequences within the promoter regions.
The ensemble of emergent networks yields an exponentially decaying in-degree
distribution and a putative power law dependence for the out-degree
distribution with a flat tail, in agreement with the data. The clustering
coefficient, degree-degree correlation, rich club coefficient and k-core
visualization all agree qualitatively with the empirical network to an extent
not yet achieved by any other computational model, to our knowledge. The
significant statistical differences can point the way to further research into
non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical
Biology (2009)
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Analysis of Three-Dimensional Protein Images
A fundamental goal of research in molecular biology is to understand protein
structure. Protein crystallography is currently the most successful method for
determining the three-dimensional (3D) conformation of a protein, yet it
remains labor intensive and relies on an expert's ability to derive and
evaluate a protein scene model. In this paper, the problem of protein structure
determination is formulated as an exercise in scene analysis. A computational
methodology is presented in which a 3D image of a protein is segmented into a
graph of critical points. Bayesian and certainty factor approaches are
described and used to analyze critical point graphs and identify meaningful
substructures, such as alpha-helices and beta-sheets. Results of applying the
methodologies to protein images at low and medium resolution are reported. The
research is related to approaches to representation, segmentation and
classification in vision, as well as to top-down approaches to protein
structure prediction.Comment: See http://www.jair.org/ for any accompanying file
Probabilistic methods in the analysis of protein interaction networks
Imperial Users onl
motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences
Next-generation sequencing technology enables the identification of thousands
of gene regulatory sequences in many cell types and organisms. We consider the
problem of testing if two such sequences differ in their number of binding site
motifs for a given transcription factor (TF) protein. Binding site motifs
impart regulatory function by providing TFs the opportunity to bind to genomic
elements and thereby affect the expression of nearby genes. Evolutionary
changes to such functional DNA are hypothesized to be major contributors to
phenotypic diversity within and between species; but despite the importance of
TF motifs for gene expression, no method exists to test for motif loss or gain.
Assuming that motif counts are Binomially distributed, and allowing for
dependencies between motif instances in evolutionarily related sequences, we
derive the probability mass function of the difference in motif counts between
two nucleotide sequences. We provide a method to numerically estimate this
distribution from genomic data and show through simulations that our estimator
is accurate. Finally, we introduce the R package {\tt motifDiverge} that
implements our methodology and illustrate its application to gene regulatory
enhancers identified by a mouse developmental time course experiment. While
this study was motivated by analysis of regulatory motifs, our results can be
applied to any problem involving two correlated Bernoulli trials
- …