Search CORE

arXiv.org e-Print Archive

A statistical fat-tail test of predicting regulatory regions in the Drosophila genome

Author: Li Yajing
Shu Jian-Jun
Publication venue: 'Elsevier BV'
Publication date: 07/03/2014
Field of study

A statistical study of cis-regulatory modules (CRMs) is presented based on the estimation of similar-word set distribution. It is observed that CRMs tend to have a fat-tail distribution. A new statistical fat-tail test with two kurtosis-based fatness coefficients is proposed to distinguish CRMs from non-CRMs. As compared with the existing fluffy-tail test, the first fatness coefficient is designed to reduce computational time, making the novel fat-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve separation accuracy between CRMs and non-CRMs. These two fatness coefficients may be served as valuable filtering indexes to predict CRMs experimentally

A statistical thin-tail test of predicting regulatory regions in the Drosophila genome

Author: Li Yajing
Shu Jian-Jun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Background: The identification of transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs) is a crucial step in studying gene expression, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved. Methods: The statistical properties of cis-regulatory modules (CRMs) are explored by estimating the similar-word set distribution with overrepresentation (Z-score). It is observed that CRMs tend to have a thin-tail Z-score distribution. A new statistical thin-tail test with two thinness coefficients is proposed to distinguish CRMs from non-coding non-regulatory regions (NCNRs). Results: As compared with the existing fluffy-tail test, the first thinness coefficient is designed to reduce computational time, making the novel thin-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve the separation accuracy between CRMs and NCNRs. These two thinness coefficients may serve as valuable filtering indexes to predict CRMs experimentally. Conclusions: The novel thin-tail test provides an efficient and effective means for distinguishing CRMs from NCNRs based on the specific statistical properties of CRMs and can guide future experiments aimed at finding new CRMs in the post-genome time.Comment: arXiv admin note: substantial text overlap with arXiv:1402.533

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Identifying Cis-Regulatory Sequences by Word Profile Similarity

Author: A Ivan
A Nasiadka
A Sosinsky
AG Nazina
AP Lifanov
BP Berman
BP Berman
BY Chan
C Zhang
D Bachtrog
DL Halligan
DS Johnson
E Emberly
EA Glazov
EE Hare
EH Davidson
F Poulin
Garmay Leung
H Janssens
I Abnizova
L Li
M Klingler
Michael B. Eisen
MR Kantorovitz
MS Halfon
N Pierstorff
N Rajewsky
Nicholas James Provart
S Prabhakar
S Sinha
XY Li
YH Grad
Publication venue: Public Library of Science
Publication date: 01/09/2009
Field of study

Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples.We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila.Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz

Erroneous attribution of relevant transcription factor binding sites despite successful prediction of cis-regulatory modules

Author: A Ochoa-Espinosa
A Siepel
A Visel
AA Philippakis
B Estrada
B Morgenstern
BP Berman
Elizabeth R Brennan
GA Maston
J Su
J Zeitlinger
JP Noonan
L Li
M Haeussler
M Markstein
Marc S Halfon
MD Schroeder
MR Kantorovitz
MS Halfon
MS Halfon
N Bray
N Negre
P Van Loo
Qianqian Zhu
R Niwa
S Kahana
T Sandmann
T Sandmann
T Vavouri
W Krivan
WJ Kent
WW Wasserman
XY Li
YH Grad
YH Liu
Yiyun Zhou
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background <it>Cis</it>-regulatory modules are bound by transcription factors to regulate gene expression. Characterizing these DNA sequences is central to understanding gene regulatory networks and gaining insight into mechanisms of transcriptional regulation, but genome-scale regulatory module discovery remains a challenge. One popular approach is to scan the genome for clusters of transcription factor binding sites, especially those conserved in related species. When such approaches are successful, it is typically assumed that the activity of the modules is mediated by the identified binding sites and their cognate transcription factors. However, the validity of this assumption is often not assessed. Results We successfully predicted five new <it>cis</it>-regulatory modules by combining binding site identification with sequence conservation and compared these to unsuccessful predictions from a related approach not utilizing sequence conservation. Despite greatly improved predictive success, the positive set had similar degrees of sequence and binding site conservation as the negative set. We explored the reasons for this by mutagenizing putative binding sites in three <it>cis</it>-regulatory modules. A large proportion of the tested sites had little or no demonstrable role in mediating regulatory element activity. Examination of loss-of-function mutants also showed that some transcription factors supposedly binding to the modules are not required for their function. Conclusions Our results raise important questions about interpreting regulatory module predictions obtained by finding clusters of conserved binding sites. Attribution of function to these sites and their cognate transcription factors may be incorrect even when modules are successfully identified. Our study underscores the importance of empirical validation of computational results even when these results are in line with expectation.</p

Using hexamers to predict cis-regulatory motifs in Drosophila

Author: Chan Bob Y
Kibler Dennis
Publication venue: BioMed Central
Publication date: 01/10/2005
Field of study

BACKGROUND: Cis-regulatory modules (CRMs) are short stretches of DNA that help regulate gene expression in higher eukaryotes. They have been found up to 1 megabase away from the genes they regulate and can be located upstream, downstream, and even within their target genes. Due to the difficulty of finding CRMs using biological and computational techniques, even well-studied regulatory systems may contain CRMs that have not yet been discovered. RESULTS: We present a simple, efficient method (HexDiff) based only on hexamer frequencies of known CRMs and non-CRM sequence to predict novel CRMs in regulatory systems. On a data set of 16 gap and pair-rule genes containing 52 known CRMs, predictions made by HexDiff had a higher correlation with the known CRMs than several existing CRM prediction algorithms: Ahab, Cluster Buster, MSCAN, MCAST, and LWF. After combining the results of the different algorithms, 10 putative CRMs were identified and are strong candidates for future study. The hexamers used by HexDiff to distinguish between CRMs and non-CRM sequence were also analyzed and were shown to be enriched in regulatory elements. CONCLUSION: HexDiff provides an efficient and effective means for finding new CRMs based on known CRMs, rather than known binding sites

eScholarship - University of California

Simple Shared Motifs (SSM) in conserved region of promoters: a new approach to identify co-regulation patterns

Author: A Atfi
A Coppe
A Fadda
A Sandelin
A Subramanian
AB Georges
C Dieterich
C Huttenhower
D Boffelli
D Cora
DA Tagle
DJ Reiss
E Davidson
E Eskin
E Wingender
G Kreiman
G Robertson
G Thijs
GL Hager
GZ Hertz
H Le Pabic
HK Lee
JD Thompson
JM Vaquerizas
JS Michaloski
Jérémy Gruel
K Quandt
L Marino-Ramirez
M Blanchette
M Blanchette
M Endoh
M Kanehisa
M Kazemian
M Rebeiz
M Tompa
MC Frith
MC Frith
Michel LeBorgne
MM Babu
Nathalie Théret
Nolwenn LeMeur
O Hallikas
Q Zhou
RW Hamming
S Falcon
S Hannenhalli
T Knittel
TA Down
TGO Consortium
TL Bailey
VK Mootha
W Thompson
Y Halperin
YH Grad
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.</p

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments

Author: Eisen Michael B
Iyer Venky N
Moses Alan M
Pollard Daniel A
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Molecular evolutionary studies of noncoding sequences rely on multiple alignments. Yet how multiple alignment accuracy varies across sequence types, tree topologies, divergences and tools, and further how this variation impacts specific inferences, remains unclear. RESULTS: Here we develop a molecular evolution simulation platform, CisEvolver, with models of background noncoding and transcription factor binding site evolution, and use simulated alignments to systematically examine multiple alignment accuracy and its impact on two key molecular evolutionary inferences: transcription factor binding site conservation and divergence estimation. We find that the accuracy of multiple alignments is determined almost exclusively by the pairwise divergence distance of the two most diverged species and that additional species have a negligible influence on alignment accuracy. Conserved transcription factor binding sites align better than surrounding noncoding DNA yet are often found to be misaligned at relatively short divergence distances, such that studies of binding site gain and loss could easily be confounded by alignment error. Divergence estimates from multiple alignments tend to be overestimated at short divergence distances but reach a tool specific divergence at which they cease to increase, leading to underestimation at long divergences. Our most striking finding was that overall alignment accuracy, binding site alignment accuracy and divergence estimation accuracy vary greatly across branches in a tree and are most accurate for terminal branches connecting sister taxa and least accurate for internal branches connecting sub-alignments. CONCLUSION: Our results suggest that variation in alignment accuracy can lead to errors in molecular evolutionary inferences that could be construed as biological variation. These findings have implications for which species to choose for analyses, what kind of errors would be expected for a given set of species and how multiple alignment tools and phylogenetic inference methods might be improved to minimize or control for alignment errors

eScholarship - University of California

UNT Digital Library

Principal component analysis for predicting transcription-factor binding motifs from array-derived data

Author: Liu Yunlong
Vincenti Matthew P
Yokota Hiroki
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: The responses to interleukin 1 (IL-1) in human chondrocytes constitute a complex regulatory mechanism, where multiple transcription factors interact combinatorially to transcription-factor binding motifs (TFBMs). In order to select a critical set of TFBMs from genomic DNA information and an array-derived data, an efficient algorithm to solve a combinatorial optimization problem is required. Although computational approaches based on evolutionary algorithms are commonly employed, an analytical algorithm would be useful to predict TFBMs at nearly no computational cost and evaluate varying modelling conditions. Singular value decomposition (SVD) is a powerful method to derive primary components of a given matrix. Applying SVD to a promoter matrix defined from regulatory DNA sequences, we derived a novel method to predict the critical set of TFBMs. RESULTS: The promoter matrix was defined to establish a quantitative relationship between the IL-1-driven mRNA alteration and genomic DNA sequences of the IL-1 responsive genes. The matrix was decomposed with SVD, and the effects of 8 potential TFBMs (5'-CAGGC-3', 5'-CGCCC-3', 5'-CCGCC-3', 5'-ATGGG-3', 5'-GGGAA-3', 5'-CGTCC-3', 5'-AAAGG-3', and 5'-ACCCA-3') were predicted from a pool of 512 random DNA sequences. The prediction included matches to the core binding motifs of biologically known TFBMs such as AP2, SP1, EGR1, KROX, GC-BOX, ABI4, ETF, E2F, SRF, STAT, IK-1, PPARγ, STAF, ROAZ, and NFκB, and their significance was evaluated numerically using Monte Carlo simulation and genetic algorithm. CONCLUSION: The described SVD-based prediction is an analytical method to provide a set of potential TFBMs involved in transcriptional regulation. The results would be useful to evaluate analytically a contribution of individual DNA sequences