Search CORE

253 research outputs found

Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency

Author: Nazina Anna G
Papatsenko Dmitri A
Publication venue: BioMed Central
Publication date: 22/12/2003
Field of study

BACKGROUND: Transcription regulatory regions in higher eukaryotes are often represented by cis-regulatory modules (CRM) and are responsible for the formation of specific spatial and temporal gene expression patterns. These extended, ~1 KB, regions are found far from coding sequences and cannot be extracted from genome on the basis of their relative position to the coding regions. RESULTS: To explore the feasibility of CRM extraction from a genome, we generated an original training set, containing annotated sequence data for most of the known developmental CRMs from Drosophila. Based on this set of experimental data, we developed a strategy for statistical extraction of cis-regulatory modules from the genome, using exhaustive analysis of local word frequency (LWF). To assess the performance of our analysis, we measured the correlation between predictions generated by the LWF algorithm and the distribution of conserved non-coding regions in a number of Drosophila developmental genes. CONCLUSIONS: In most of the cases tested, we observed high correlation (up to 0.6–0.8, measured on the entire gene locus) between the two independent techniques. We discuss computational strategies available for extraction of Drosophila CRMs and possible extensions of these methods

Springer - Publisher Connector

PubMed Central

A statistical fat-tail test of predicting regulatory regions in the Drosophila genome

Author: Li Yajing
Shu Jian-Jun
Publication venue: 'Elsevier BV'
Publication date: 07/03/2014
Field of study

A statistical study of cis-regulatory modules (CRMs) is presented based on the estimation of similar-word set distribution. It is observed that CRMs tend to have a fat-tail distribution. A new statistical fat-tail test with two kurtosis-based fatness coefficients is proposed to distinguish CRMs from non-CRMs. As compared with the existing fluffy-tail test, the first fatness coefficient is designed to reduce computational time, making the novel fat-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve separation accuracy between CRMs and non-CRMs. These two fatness coefficients may be served as valuable filtering indexes to predict CRMs experimentally

arXiv.org e-Print Archive

Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test.

Author: Abnizova Irina
Gilks Walter R
te Boekhorst Rene
Walter Klaudia
Publication venue: BMC Bioinformatics
Publication date: 01/01/2005
Field of study

BACKGROUND: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. RESULTS: We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. CONCLUSION: We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

PubMed Central

Apollo (Cambridge)

University of Hertfordshire Research Archive

A statistical thin-tail test of predicting regulatory regions in the Drosophila genome

Author: Li Yajing
Shu Jian-Jun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Background: The identification of transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs) is a crucial step in studying gene expression, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved. Methods: The statistical properties of cis-regulatory modules (CRMs) are explored by estimating the similar-word set distribution with overrepresentation (Z-score). It is observed that CRMs tend to have a thin-tail Z-score distribution. A new statistical thin-tail test with two thinness coefficients is proposed to distinguish CRMs from non-coding non-regulatory regions (NCNRs). Results: As compared with the existing fluffy-tail test, the first thinness coefficient is designed to reduce computational time, making the novel thin-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve the separation accuracy between CRMs and NCNRs. These two thinness coefficients may serve as valuable filtering indexes to predict CRMs experimentally. Conclusions: The novel thin-tail test provides an efficient and effective means for distinguishing CRMs from NCNRs based on the specific statistical properties of CRMs and can guide future experiments aimed at finding new CRMs in the post-genome time.Comment: arXiv admin note: substantial text overlap with arXiv:1402.533

arXiv.org e-Print Archive

Springer - Publisher Connector

Using hexamers to predict cis-regulatory motifs in Drosophila

Author: Chan Bob Y
Kibler Dennis
Publication venue: BioMed Central
Publication date: 01/10/2005
Field of study

BACKGROUND: Cis-regulatory modules (CRMs) are short stretches of DNA that help regulate gene expression in higher eukaryotes. They have been found up to 1 megabase away from the genes they regulate and can be located upstream, downstream, and even within their target genes. Due to the difficulty of finding CRMs using biological and computational techniques, even well-studied regulatory systems may contain CRMs that have not yet been discovered. RESULTS: We present a simple, efficient method (HexDiff) based only on hexamer frequencies of known CRMs and non-CRM sequence to predict novel CRMs in regulatory systems. On a data set of 16 gap and pair-rule genes containing 52 known CRMs, predictions made by HexDiff had a higher correlation with the known CRMs than several existing CRM prediction algorithms: Ahab, Cluster Buster, MSCAN, MCAST, and LWF. After combining the results of the different algorithms, 10 putative CRMs were identified and are strong candidates for future study. The hexamers used by HexDiff to distinguish between CRMs and non-CRM sequence were also analyzed and were shown to be enriched in regulatory elements. CONCLUSION: HexDiff provides an efficient and effective means for finding new CRMs based on known CRMs, rather than known binding sites

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Identifying Cis-Regulatory Sequences by Word Profile Similarity

Author: A Ivan
A Nasiadka
A Sosinsky
AG Nazina
AP Lifanov
BP Berman
BP Berman
BY Chan
C Zhang
D Bachtrog
DL Halligan
DS Johnson
E Emberly
EA Glazov
EE Hare
EH Davidson
F Poulin
Garmay Leung
H Janssens
I Abnizova
L Li
M Klingler
Michael B. Eisen
MR Kantorovitz
MS Halfon
N Pierstorff
N Rajewsky
Nicholas James Provart
S Prabhakar
S Sinha
XY Li
YH Grad
Publication venue: Public Library of Science
Publication date: 01/09/2009
Field of study

Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples.We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila.Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs

Author: Halfon Marc S
Ivan Andra
Sinha Saurabh
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Prediction of cis-regulatory modules ab initio, without any input of relevant motifs, is achieved with two novel methods

Crossref

Springer - Publisher Connector

PubMed Central

A Machine Learning Approach for Identifying Novel Cell Type–Specific Transcriptional Regulators of Myogenesis

Author: A Carmena
A Carmena
A Carmena
A Dastjerdi
A Erives
A Ivan
A Nose
A Paululat
A Siepel
A Subramanian
A Visel
A Visel
A Woolfe
AA Philippakis
AC Groth
AG Nazina
AG Nazina
AK Holloway
Alan M. Michelson
AM Michelson
AM Michelson
B Estrada
B Hanczar
BL Black
BP Berman
Brian W. Busser
BW Busser
C Bourgouin
C Chang
C Jiang
C Klämbt
CA Berkes
CI Swanson
CT Ong
DN Arnosti
DT Odom
E Davidson
EE Hare
EN Olson
FC Wardle
G Hon
G Junion
G Leung
G Ranganayakulu
GE Crawford
GG Loots
H Brohmann
H Rouault
HP Shih
I Abnizova
I Costello
I Guyon
I Ovcharenko
I Reim
I Reim
Ivan Ovcharenko
J Bischof
J Crocker
J Crocker
J Enriquez
J Ernst
J Shawe-Taylor
J Zeitlinger
JA Pederson
James W. Posakony
JD Pederson
JM Claycomb
JS Jakobsen
JW Mahaffey
K Jagla
K Robasky
K Senger
L Dubois
L Li
L Narlikar
L Narlikar
L Narlikar
Leila Taher
M Capovilla
M Frasch
M Ludwig
M Markstein
M Markstein
M Porsch
M Ruiz-Gomez
M Schwaiger
MA Beer
MB Noyes
MD Biggin
MF Berger
MI Arnone
MJ Blow
MK Baylies
MK Baylies
MK Baylies
MK Gross
Molly J. Bloom
MR Kantorovitz
MS Halfon
MS Halfon
MV Taylor
N Negre
N Reeves
OL Griffith
P Tomancak
PJ Clyne
R Bodmer
R Galant
RG Ramsay
RJ Bryson-Richardson
RP Zinzen
S Barolo
S Knirr
S Knirr
S MacArthur
S Mahony
SA Ness
SB Carroll
SD Weatherbee
SJ Raudys
SM Gallo
SY Kim
T Jagla
T Sandmann
T Sandmann
Terese Tansey
TL Bailey
U Grossniklaus
V Matys
V Tixier
Y Benjamini
YH Liu
Yongsok Kim
Z Han
Publication venue: Public Library of Science
Publication date: 08/03/2012
Field of study

Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA–based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type–specific developmental gene expression patterns

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

A Novel Ensemble Learning Method for de Novo Computational Identification of DNA Binding Sites

Author: Carlson Jonathan M
Chakravarty Arijit
Gross Robert H H
Khetani Radhika S
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/2007
Field of study

Despite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought.ResultsWe present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: non-degenerate, degenerate and gapped motifs. SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs. We found that SCOPE\u27s performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms. SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin

Crossref

Directory of Open Access Journals

PubMed Central

Dartmouth Digital Commons (Dartmouth College)

Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs

Author: Dojer Norbert
Patelak Mateusz
Tiuryn Jerzy
Wilczynski Bartek
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult. Results We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms. Conclusion We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central