Search CORE

16,473 research outputs found

Computing distribution of scale independent motifs in biological sequences

Author: Almeida Jonas S
Vinga Susana
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Repositório da Universidade Nova de Lisboa

Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants

Author: Barrington Christopher
Baxter Laura
Beynon Jim
Buchanan-Wollaston Vicky
Denby Katherine J.
Dyer Nigel
Hickman R. D. G.
Jironkin Aleksey
Krusche Peter
Moore Jonathan D.
Ott Sascha
Tiskin Alexander
Publication venue: 'American Society of Plant Biologists (ASPB)'
Publication date: 01/10/2012
Field of study

Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Information content based model for the topological properties of the gene regulatory network of Escherichia coli

Author: Albert
Alberts
Almirantis
Avery
Ayşe Erzan
Babu
Balcan
Balcan
Balcan
Banzhaf
Barabasi
Barabasi
Benos
Berg
Bergmann
Berkin Malkoç
Bilu
Bollobás
Browning
Buldyrev
Colizza
Colizza
Dawkins
Dawkins
Dobrin
Dodd
Dorogovtsev
Duygu Balcan
Erdös
Erdös
Gama-Castro
Gerland
Gershenzon
Guelzim
Harbison
Jeong
Kashtan
Kauffman
Kim
Koralov
Kugiumtzis
Li
Lynch
Ma
Matsumoto
Milo
Milo
Mungan
Münch
Okuda
O’Flanagan
Pachkov
Reil
Rudd
Salgado
Salgado
Samal
Sengun
Sengupta
Shannon
Shearwin
Sneppen
Spirin
Stormo
Teixeira
van Nimwegen
van Noort
Vazquez
Wagner
Warren
Watson
Wernicke
Zhou
Publication venue: 'Elsevier BV'
Publication date: 29/12/2009
Field of study

Gene regulatory networks (GRN) are being studied with increasingly precise quantitative tools and can provide a testing ground for ideas regarding the emergence and evolution of complex biological networks. We analyze the global statistical properties of the transcriptional regulatory network of the prokaryote Escherichia coli, identifying each operon with a node of the network. We propose a null model for this network using the content-based approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et al., 2007) Random sequences that represent promoter regions and binding sequences are associated with the nodes. The length distributions of these sequences are extracted from the relevant databases. The network is constructed by testing for the occurrence of binding sequences within the promoter regions. The ensemble of emergent networks yields an exponentially decaying in-degree distribution and a putative power law dependence for the out-degree distribution with a flat tail, in agreement with the data. The clustering coefficient, degree-degree correlation, rich club coefficient and k-core visualization all agree qualitatively with the empirical network to an extent not yet achieved by any other computational model, to our knowledge. The significant statistical differences can point the way to further research into non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical Biology (2009)

arXiv.org e-Print Archive

Crossref

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Analysis of Three-Dimensional Protein Images

Author: Baxter K.
Fortier S.
Glasgow J.
Leherte L.
Steeg E.
Publication venue
Publication date: 01/01/1997
Field of study

A fundamental goal of research in molecular biology is to understand protein structure. Protein crystallography is currently the most successful method for determining the three-dimensional (3D) conformation of a protein, yet it remains labor intensive and relies on an expert's ability to derive and evaluate a protein scene model. In this paper, the problem of protein structure determination is formulated as an exercise in scene analysis. A computational methodology is presented in which a 3D image of a protein is segmented into a graph of critical points. Bayesian and certainty factor approaches are described and used to analyze critical point graphs and identify meaningful substructures, such as alpha-helices and beta-sheets. Results of applying the methodologies to protein images at low and medium resolution are reported. The research is related to approaches to representation, segmentation and classification in vision, as well as to top-down approaches to protein structure prediction.Comment: See http://www.jair.org/ for any accompanying file

arXiv.org e-Print Archive

CiteSeerX

Repository of the University of Namur

Probabilistic methods in the analysis of protein interaction networks

Author: Thorne Thomas
Thorne Thomas
Publication venue
Publication date: 01/01/2010
Field of study

Imperial Users onl

Spiral - Imperial College Digital Repository

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences

Author: Friedrich Tara
Holloway Alisha K.
Kostka Dennis
Pollard Katherine S.
Publication venue
Publication date: 31/01/2014
Field of study

Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms. We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain. Assuming that motif counts are Binomially distributed, and allowing for dependencies between motif instances in evolutionarily related sequences, we derive the probability mass function of the difference in motif counts between two nucleotide sequences. We provide a method to numerically estimate this distribution from genomic data and show through simulations that our estimator is accurate. Finally, we introduce the R package {\tt motifDiverge} that implements our methodology and illustrate its application to gene regulatory enhancers identified by a mouse developmental time course experiment. While this study was motivated by analysis of regulatory motifs, our results can be applied to any problem involving two correlated Bernoulli trials

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California