Search CORE

30 research outputs found

Inferring Binding Energies from Selected Binding Sites

Author: A Sarai
AE Kel
C Tuerk
Christopher Workman
DA Gilchrist
David Granas
DS Fields
DSF Homsi
E Roulet
E Sharon
Gary D. Stormo
GD Stormo
GD Stormo
GD Stormo
GD Stormo
H Ji
HF Teh
HG Roider
J Linnell
J Liu
JB Kinney
JJ Moré
L van Oeffelen
M Djordjevic
M Djordjevic
MF Berger
ML Lee
MQ Zhang
O Berg
PH von Hippel
PV Benos
PV Benos
Q Zhou
R Staden
SJ Maerkl
TH Cormen
TK Blackwell
TK Man
U Gerland
V Mustonen
VH Nagaraj
WE Wright
X Liu
X Meng
Y Takeda
Yue Zhao
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

We employ a biophysical model that accounts for the non-linear relationship between binding energy and the statistics of selected binding sites. The model includes the chemical potential of the transcription factor, non-specific binding affinity of the protein for DNA, as well as sequence-specific parameters that may include non-independent contributions of bases to the interaction. We obtain maximum likelihood estimates for all of the parameters and compare the results to standard probabilistic methods of parameter estimation. On simulated data, where the true energy model is known and samples are generated with a variety of parameter values, we show that our method returns much more accurate estimates of the true parameters and much better predictions of the selected binding site distributions. We also introduce a new high-throughput SELEX (HT-SELEX) procedure to determine the binding specificity of a transcription factor in which the initial randomized library and the selected sites are sequenced with next generation methods that return hundreds of thousands of sites. We show that after a single round of selection our method can estimate binding parameters that give very good fits to the selected site distributions, much better than standard motif identification algorithms

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Digital Commons@Becker

Decoding of Superimposed Traces Produced by Direct Sequencing of Heterozygous Indels

Author: AE Tenney
B Ewing
C Manaster
C Sousa-Santos
D Bhattramakki
Dmitry A. Dmitriev
E Dicks
E Seroussi
EN Moriyama
ER Mardis
Gary Stormo
GM Cooper
GR Brown
J Parsch
J Sorenson
J-F Flot
J-F Flot
K Chen
K Müller
KS Small
M Pop
R Staden
RE Mills
Roman A. Rakitov
S Creer
S Weckx
SF Altschul
T Bhangale
TR Bhangale
Y Seroussi
Z Zhao
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Direct Sanger sequencing of a diploid template containing a heterozygous insertion or deletion results in a difficult-to-interpret mixed trace formed by two allelic traces superimposed onto each other. Existing computational methods for deconvolution of such traces require knowledge of a reference sequence or the availability of both direct and reverse mixed sequences of the same template. We describe a simple yet accurate method, which uses dynamic programming optimization to predict superimposed allelic sequences solely from a string of letters representing peaks within an individual mixed trace. We used the method to decode 104 human traces (mean length 294 bp) containing heterozygous indels 5 to 30 bp with a mean of 99.1% bases per allelic sequence reconstructed correctly and unambiguously. Simulations with artificial sequences have demonstrated that the method yields accurate reconstructions when (1) the allelic sequences forming the mixed trace are sufficiently similar, (2) the analyzed fragment is significantly longer than the indel, and (3) multiple indels, if present, are well-spaced. Because these conditions occur in most encountered DNA sequences, the method is widely applicable. It is available as a free Web application Indelligent at http://ctap.inhs.uiuc.edu/dmitriev/indel.asp

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression

Author: A Wagner
AE Kel
B Lenhard
B Shea
BP Berman
DA Papatsenko
DS Prestridge
DS Prestridge
F Larsen
GD Stormo
GG Loots
JA Warrington
JM Claverie
K Quandt
KD Pruitt
L Ponger
LL Hsiao
M Gardiner-Garden
MC Frith
MC Frith
MI Arnone
MS Halfon
N Rajewsky
O Johansson
R Ihaka
RR Sokal
S Aerts
S Hannenhalli
S Levy
S Levy
TD Schneider
V Matys
V Solovyev
W Krivan
WH Press
WJ Ewens
WJ Kent
WJ Kent
WW Wasserman
Y Suzuki
Y Suzuki
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions. RESULTS: We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI. CONCLUSION: Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Optimized mixed Markov models for motif identification

Author: AE Kel
B Matthews
B Negre
C Burge
D Cai
David M Umbach
E Roulet
E Wingender
G Schwarz
G Yeo
GA Wray
GD Stormo
GE Crooks
H Akaike
I Carmel
J Rissanen
JP Staley
K Ellrott
K Nandabalan
K Nelson
K Quandt
Leping Li
M Kellis
MG Reese
ML Bulyk
MP Ponomarenko
MQ Zhang
N Saitou
P Agarwal
P Bühlmann
PV Benos
Q Zhou
R Staden
RP Ketterling
S Salzberg
T Thanaraj
TD Schneider
TK Man
U Ohler
Uwe Ohler
W Krivan
Weichun Huang
X Xie
X Zhao
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. RESULTS: We introduce a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif finding tool that incorporates automatic selection of the best model. OMiMa is freely available at [1]. CONCLUSION: Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MDC Repository

Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions

Author: A Hoglund
AE Kel
AE Kel
AE Vinogradov
B Efron
B Jaruga
BJ Deroo
C Burge
CD Schmid
CR Calladine
D Cai
D GuhaThakurta
DM Graunke
E Fayard
Elena A Ananko
Elena V Ignatieva
FA Wright
GD Stormo
HP Ko
I Abnizova
I Ben-Gal
IA Udalova
Igor I Turnaev
J Duarte
J Hu
JV Ponomarenko
K Ellrott
K Morohashi
K Quandt
KJ Campbell
L Quintana-Murci
LC Platanias
LG Cowell
M Beato
M Blanchette
M Costantini
M Ganapathi
M Lohoff
M Stepanova
M-LT Lee
ML Bulyk
MP Ponomarenko
MQ Zhang
MQ Zhang
NA Kolchanov
NI Gershenzon
Nikolay A Kolchanov
NV Klimova
O Kel-Margoulis
OA Podkolodnaia
OD King
OG Berg
P Val
PV Benos
Q Zhou
R Castelo
R Kiyama
R Osada
R Pudimat
RV Davuluri
S Kamalakaran
Tatyana I Merkulova
TC Hodgman
TK Man
TM Chen
TV Busygina
VG Levitskii
VG Levitsky
VG Levitsky
VG Levitsky
VG Levitsky
Victor G Levitsky
VV Solovyev
W Huang
WH Shen
WW Wasserman
X Xie
Y Barash
Publication venue: BioMed Central
Publication date: 01/12/2007
Field of study

Abstract Background Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. Results To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. Conclusion Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Unifying generative and discriminative learning principles

Author: A Bernal
A Culotta
A Feelders
A Mccallum
AE Kel
AY Ng
BP Lewis
C Burge
CM Bishop
D Cai
D Grossman
E Redhead
E Segal
E Wingender
F Pernkopf
G Bouchard
G Bouchard
G Stormo
G Yeo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Aldrich
J Cerquides
J Grau
J Keilwagen
J Keilwagen
JA Lasserre
Jan Grau
Jens Keilwagen
JH Xue
M Maragkakis
M Tompa
M Zhang
Marc Strickert
O Yakhnenko
P Grünwald
R Greiner
R Raina
R Staden
RA Fisher
S Sonnenburg
SL Salzberg
Stefan Posch
T Abeel
T Hastie
TH Kim
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. Results Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites. Conclusions We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Author: A Bernal
A Culotta
A Feelders
AE Kel
AL Berger
AY Ng
C Burge
CM Bishop
D Cai
D Grossman
D Heckerman
D Klein
E Redhead
E Segal
F Pernkopf
G Yeo
GD Stormo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Cerquides
J Davis
J Goodman
J Grau
J Keilwagen
Jan Grau
Jens Keilwagen
L Narlikar
M Arita
M Meila-Predoviciu
M Tompa
M Zhang
MI Jordan
NK Kim
O Schulte
O Yakhnenko
P Grünwald
R Castelo
R Castelo
R Greiner
R Staden
S Chen
S Sonnenburg
SL Salzberg
Stefan Posch
T Fawcett
TH Kim
TM Chen
WL Buntine
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the <it>same a-priori information</it>, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Local Gene Regulation Details a Recognition Code within the LacI Transcriptional Factor Family

Author: A Glasfeld
A Sandelin
A Sarai
A Ureta-Vidal
AE Kazakov
AV Morozov
BM Hall
BW Matthews
C Francke
CE Bell
CG Kalodimos
CI Jørgensen
CO Pabo
CO Pabo
EJ Alm
Eric J. Alm
FM Camas
Francisco M. Camas
G Kolesov
G Paillard
Gary D. Stormo
GP Smith
J Boch
J Castresana
J Nardelli
J Sartorius
J Schultz
JL Betz
JO Korbel
JR Desjarlais
Juan F. Poyatos
L Milk
M Lewis
M Lewis
M Lewis
M Perros
M Suzuki
MA Schumacher
MA Schumacher
MJ Moscou
MJ Weickert
MM Gromiha
NC Seeman
NM Luscombe
P Baldi
PB Warren
PV Benos
R Hershberg
RC Edgar
RK Salinas
S Mahony
S Mahony
SA Wolfe
SJ Maerlk
T Sera
TA Desai
V Espinosa Angarica
W Thompson
WW Wasserman
Y Choo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2010
Field of study

The specific binding of regulatory proteins to DNA sequences exhibits no clear patterns of association between amino acids (AAs) and nucleotides (NTs). This complexity of protein-DNA interactions raises the question of whether a simple set of wide-coverage recognition rules can ever be identified. Here, we analyzed this issue using the extensive LacI family of transcriptional factors (TFs). We searched for recognition patterns by introducing a new approach to phylogenetic footprinting, based on the pervasive presence of local regulation in prokaryotic transcriptional networks. We identified a set of specificity correlations –determined by two AAs of the TFs and two NTs in the binding sites– that is conserved throughout a dominant subgroup within the family regardless of the evolutionary distance, and that act as a relatively consistent recognition code. The proposed rules are confirmed with data of previous experimental studies and by events of convergent evolution in the phylogenetic tree. The presence of a code emphasizes the stable structural context of the LacI family, while defining a precise blueprint to reprogram TF specificity with many practical applications.Ministerio de Ciencia e Innovación, Spain (Formación de Profesorado Universitario fellowship)Ministerio de Ciencia e Innovación, Spain (grant BFU2008-03632/BMC)Madrid (Spain : Region) (grant CCG08-CSIC/SAL-3651

CiteSeerX

Public Library of Science (PLOS)

DSpace@MIT

Crossref

Directory of Open Access Journals

PubMed Central

Digital.CSIC

Facilitated Variation: How Evolution Learns from Past Environments To Generalize to New Environments

Author: A Abzhanov
A Gardner
A Kreimer
A Wagner
A Wagner
AE Mayo
BM Stadler
C Adami
C Reidys
CH Waddington
CH Waddington
CK Griswold
CO Wilke
D Goldberg
EA Variano
G Schlosser
G Schlosser
G Simpson
Gary Stormo
GP Wagner
GP Wagner
I Tagkopoulos
IL Hofacker
IV Hofacker
J Draghi
J Gerhart
J Gerhart
J Gerhart
JG Miller
LA Meyers
LA Meyers
LH Hartwell
LW Ancel
M Baldwin
M Conrad
M Kaern
M Kirschner
M Kirschner
M Mitchell
M Parter
MEJ Newman
Merav Parter
MJ Cohn
MJ West-Eberhard
MJ West-Eberhard
ML Dichtel-Danjoy
N Kashtan
N Kashtan
Nadav Kashtan
P Schuster
RG Winther
S Ciliberti
S Wuchty
SJ Gould
SL Rutherford
Sumedha
T Flatt
T Jiang
TF Hansen
U Alon
Uri Alon
W Fontana
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

One of the striking features of evolution is the appearance of novel structures in organisms. Recently, Kirschner and Gerhart have integrated discoveries in evolution, genetics, and developmental biology to form a theory of facilitated variation (FV). The key observation is that organisms are designed such that random genetic changes are channeled in phenotypic directions that are potentially useful. An open question is how FV spontaneously emerges during evolution. Here, we address this by means of computer simulations of two well-studied model systems, logic circuits and RNA secondary structure. We find that evolution of FV is enhanced in environments that change from time to time in a systematic way: the varying environments are made of the same set of subgoals but in different combinations. We find that organisms that evolve under such varying goals not only remember their history but also generalize to future environments, exhibiting high adaptability to novel goals. Rapid adaptation is seen to goals composed of the same subgoals in novel combinations, and to goals where one of the subgoals was never seen in the history of the organism. The mechanisms for such enhanced generation of novelty (generalization) are analyzed, as is the way that organisms store information in their genomes about their past environments. Elements of facilitated variation theory, such as weak regulatory linkage, modularity, and reduced pleiotropy of mutations, evolve spontaneously under these conditions. Thus, environments that change in a systematic, modular fashion seem to promote facilitated variation and allow evolution to generalize to novel conditions

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Global Prediction of Tissue-Specific Gene Expression and Context-Dependent Gene Networks in Caenorhabditis elegans

Author: A Rogers
AE Sluder
AI Su
AJ Finn
AR Saltiel
BP Lewis
C Huttenhower
CC Fowlkes
CE Schaner
Coleen T. Murphy
CS Thummel
CT Murphy
CT Murphy
Curtis Huttenhower
D Dupuy
D Dupuy
DL Church
DM Ferkey
DS Portman
E Hiley
EJ Cram
ES Lein
F Pauli
F Piano
Gary D. Stormo
I Lee
ID Broadbent
IM Cheeseman
J Gaudet
J Gaudet
JA Smith
JD McGhee
JF Chen
JS Gilleard
JS Gilleard
K Kim
KF Aoki
L Smirnova
LA Liotta
M Labouesse
Maria D. Chikina
MB Eisen
NJ Martinez
O Elemento
O Elemento
Olga G. Troyanskaya
P Sood
P Tomancak
PJ Roy
PM Loria
R Blelloch
R Hunt-Newbury
RC Friedman
RJ Kaufman
RM Fox
RM Fox
RP Johnson
S Bamps
S Griffiths-Jones
S Lall
S Vadakkadath Meethal
S Vasudevan
SB Pierce
SE Von Stetina
SJ Russell
SK Kim
SL Bauer Huang
T Hirotsu
V Reinke
VL Stroeher
W Chi
W Zhong
WB Raich
WK Kim
WL Johnston
X Shen
X Shen
Y Kohara
Y Shi
YV Budovskaya
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Tissue-specific gene expression plays a fundamental role in metazoan biology and is an important aspect of many complex diseases. Nevertheless, an organism-wide map of tissue-specific expression remains elusive due to difficulty in obtaining these data experimentally. Here, we leveraged existing whole-animal Caenorhabditis elegans microarray data representing diverse conditions and developmental stages to generate accurate predictions of tissue-specific gene expression and experimentally validated these predictions. These patterns of tissue-specific expression are more accurate than existing high-throughput experimental studies for nearly all tissues; they also complement existing experiments by addressing tissue-specific expression present at particular developmental stages and in small tissues. We used these predictions to address several experimentally challenging questions, including the identification of tissue-specific transcriptional motifs and the discovery of potential miRNA regulation specific to particular tissues. We also investigate the role of tissue context in gene function through tissue-specific functional interaction networks. To our knowledge, this is the first study producing high-accuracy predictions of tissue-specific expression and interactions for a metazoan organism based on whole-animal data

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central