Search CORE

61,617 research outputs found

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

Author: Down Thomas A.
Hubbard Tim J. P.
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites

CiteSeerX

King's Research Portal

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Cis-regulatory module detection using constraint programming

Author: Guns Tias
Marchal Kathleen
Nijssen Siegfried
Sun Hong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

We propose a method for finding CRMs in a set of co-regulated genes. Each CRM consists of a set of binding sites of transcription factors. We wish to find CRMs involving the same transcription factors in multiple sequences. Finding such a combination of transcription factors is inherently a combinatorial problem. We solve this problem by combining the principles of itemset mining and constraint programming. The constraints involve the putative binding sites of transcription factors, the number of sequences in which they co-occur and the proximity of the binding sites. Genomic background sequences are used to assess the significance of the modules. We experimentally validate our approach and compare it with state-of-the-art techniques

Wide-Scale Analysis of Human Functional Transcription Factor Binding Reveals a Strong Bias towards the Transcription Start Site

Author: A Ambesi-Impiombato
A Blais
A Eto
A Subramanian
AE Kel
AG Clark
AL Lam
AM McGuire
Anat Reiner
Assif Yitzhaky
B Ren
C Kimura-Yoshida
C Plessy
C Yang
CT Harbison
D Pfeifer
D Wang
DB Allison
E Emberly
E Segal
Eytan Domany
FP Roth
GC Pipes
GC Yuan
GQ Yao
GZ Hertz
H Li
H Lodish
J Zheng
JD Hughes
JL DeRisi
JQ Ling
K Frech
K Quandt
KD MacIsaac
L Amir-Zilberstein
L Elnitski
L Marino-Ramirez
L McCue
M Ashburner
M Kellis
M Milyavsky
MA Nobrega
Mark Koudritsky
MC Frith
ML Howard
ML Whitfield
N Rajewsky
Or Zuk
P Carninci
P Carninci
P Cliften
PM Haverty
PR Buckland
R Elkon
R Liu
R Sharan
Ran Brosh
S Aerts
S Rashi-Elkeles
S Tavazoie
SJ Cooper
SJ Ho Sui
Sui Huang
U Gerland
Varda Rotter
WW Wasserman
X Xie
Y Barash
Y Benjamini
Y Benjamini
Y Tabach
Yossi Buganim
Yuval Tabach
Z Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

We introduce a novel method to screen the promoters of a set of genes with shared biological function, against a precompiled library of motifs, and find those motifs which are statistically over-represented in the gene set. The gene sets were obtained from the functional Gene Ontology (GO) classification; for each set and motif we optimized the sequence similarity score threshold, independently for every location window (measured with respect to the TSS), taking into account the location dependent nucleotide heterogeneity along the promoters of the target genes. We performed a high throughput analysis, searching the promoters (from 200bp downstream to 1000bp upstream the TSS), of more than 8000 human and 23,000 mouse genes, for 134 functional Gene Ontology classes and for 412 known DNA motifs. When combined with binding site and location conservation between human and mouse, the method identifies with high probability functional binding sites that regulate groups of biologically related genes. We found many location-sensitive functional binding events and showed that they clustered close to the TSS. Our method and findings were put to several experimental tests. By allowing a "flexible" threshold and combining our functional class and location specific search method with conservation between human and mouse, we are able to identify reliably functional TF binding sites. This is an essential step towards constructing regulatory networks and elucidating the design principles that govern transcriptional regulation of expression. The promoter region proximal to the TSS appears to be of central importance for regulation of transcription in human and mouse, just as it is in bacteria and yeast.Comment: 31 pages, including Supplementary Information and figure

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

An intuitionistic approach to scoring DNA sequences against transcription factor binding site motifs

Author: A Sandelin
A Sandelin
A Sharov
A Tomovic
Adrian J Shepherd
Armando Blanco
C Lawrence
D Denning
E Baker
E Szmidt
E Wingender
F Garcia
F Lam
F Lopez
F Offner
F Zare-Mirakabad
Fernando Garcia-Alcalde
G Chamilos
G Diop
G Hertz
J Hanley
J Hughes
J Sainz
J Van Helden
J Zhao
K Atanassov
K Atanassov
K Atanassov
K Atanassov
K Won
L Liang
L Zadeh
M Bulyk
M Das
M Eisen
N Dror
N Kim
P Benos
P Bochud
P Schling
R Gordan
S De
T Bailey
T Fawcett
T Hehlgans
T Tamura
T Tamura
V Khatibi
W Hung
W Wasserman
X Chen
Y Haudry
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Background: Transcription factors (TFs) control transcription by binding to specific regions of DNA called transcription factor binding sites (TFBSs). The identification of TFBSs is a crucial problem in computational biology and includes the subtask of predicting the location of known TFBS motifs in a given DNA sequence. It has previously been shown that, when scoring matches to known TFBS motifs, interdependencies between positions within a motif should be taken into account. However, this remains a challenging task owing to the fact that sequences similar to those of known TFBSs can occur by chance with a relatively high frequency. Here we present a new method for matching sequences to TFBS motifs based on intuitionistic fuzzy sets (IFS) theory, an approach that has been shown to be particularly appropriate for tackling problems that embody a high degree of uncertainty. Results: We propose SCintuit, a new scoring method for measuring sequence-motif affinity based on IFS theory. Unlike existing methods that consider dependencies between positions, SCintuit is designed to prevent overestimation of less conserved positions of TFBSs. For a given pair of bases, SCintuit is computed not only as a function of their combined probability of occurrence, but also taking into account the individual importance of each single base at its corresponding position. We used SCintuit to identify known TFBSs in DNA sequences. Our method provides excellent results when dealing with both synthetic and real data, outperforming the sensitivity and the specificity of two existing methods in all the experiments we performed. Conclusions: The results show that SCintuit improves the prediction quality for TFs of the existing approaches without compromising sensitivity. In addition, we show how SCintuit can be successfully applied to real research problems. In this study the reliability of the IFS theory for motif discovery tasks is proven

Springer - Publisher Connector

Directory of Open Access Journals

Repositorio Institucional Universidad de Granada

Birkbeck Institutional Research Online

Recommended from our members

Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing.

Author: Korf Ian
Segal David J
Zykovich Artem
Publication venue: eScholarship, University of California
Publication date: 01/12/2009
Field of study

Transcription factor-DNA interactions are some of the most important processes in biology because they directly control hereditary information. The targets of most transcription factor are unknown. In this report, we introduce Bind-n-Seq, a new high-throughput method for analyzing protein-DNA interactions in vitro, with several advantages over current methods. The procedure has three steps (i) binding proteins to randomized oligonucleotide DNA targets, (ii) sequencing the bound oligonucleotide with massively parallel technology and (iii) finding motifs among the sequences. De novo binding motifs determined by this method for the DNA-binding domains of two well-characterized zinc-finger proteins were similar to those described previously. Furthermore, calculations of the relative affinity of the proteins for specific DNA sequences correlated significantly with previous studies (R(2 )= 0.9). These results present Bind-n-Seq as a highly rapid and parallel method for determining in vitro binding sites and relative affinities

eScholarship - University of California

ModuleDigger: an itemset mining framework for the detection of cis-regulatory modules

Author: De Bie Tijl
De Moor Bart
Dhollander Thomas
Fu Qiang
Lemmens Karen
Marchal Kathleen
Storms Valerie
Sun Hong
Verstuyf Annemieke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The detection of cis-regulatory modules (CRMs) that mediate transcriptional responses in eukaryotes remains a key challenge in the postgenomic era. A CRM is characterized by a set of co-occurring transcription factor binding sites (TFBS). In silico methods have been developed to search for CRMs by determining the combination of TFBS that are statistically overrepresented in a certain geneset. Most of these methods solve this combinatorial problem by relying on computational intensive optimization methods. As a result their usage is limited to finding CRMs in small datasets (containing a few genes only) and using binding sites for a restricted number of transcription factors (TFs) out of which the optimal module will be selected. Results: We present an itemset mining based strategy for computationally detecting cis-regulatory modules (CRMs) in a set of genes. We tested our method by applying it on a large benchmark data set, derived from a ChIP-Chip analysis and compared its performance with other well known cis-regulatory module detection tools. Conclusion: We show that by exploiting the computational efficiency of an itemset mining approach and combining it with a well-designed statistical scoring scheme, we were able to prioritize the biologically valid CRMs in a large set of coregulated genes using binding sites for a large number of potential TFs as input

Springer - Publisher Connector

Vitamin D receptor ChIP-seq in primary CD4+ cells: relationship to serum 25-hydroxyvitamin D levels and autoimmune disease

Author: A Sandelin
A Sanyal
Adam E Handel
AE Handel
Antonio J Berlanga-Taylor
AP Boyle
B Langmead
B Lehmann
BE Bernstein
C Carlberg
CE Grant
CS Ross-Innes
CY McLean
D Berglund
E Wingender
F Birzele
Finn Drabløs
G Pavesi
Gavin Giovannoni
Geir K Sandve
George C Ebers
Giulio Disanto
Giuseppe Gallone
GK Sandve
Heather Hanwell
IV Kulakovskiy
J Orgaz-Molina
J-C Souberbielle
JHA Martens
K Li
KL Munger
LA Hindorff
LL Issa
M Ashburner
M Caliskan
M Lutz
M Thomas-Chollier
MA Kriegel
MD Shirley
ML McCullough
NU Rashid
O Weth
PA Fujita
PA Marshall
R Salehi-Tabar
RM Tolón
S Gundersen
S Heikkinen
Sreeram V Ramagopalan
SV Ramagopalan
T Liu
TA Owen
TL Bailey
TL Bailey
TL Bailey
Y Zhang
Y-C Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

PMCID: PMC3710212This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

Springer - Publisher Connector

Oxford University Research Archive

Spiral - Imperial College Digital Repository