108,463 research outputs found
Motif Discovery through Predictive Modeling of Gene Regulation
We present MEDUSA, an integrative method for learning motif models of
transcription factor binding sites by incorporating promoter sequence and gene
expression data. We use a modern large-margin machine learning approach, based
on boosting, to enable feature selection from the high-dimensional search space
of candidate binding sequences while avoiding overfitting. At each iteration of
the algorithm, MEDUSA builds a motif model whose presence in the promoter
region of a gene, coupled with activity of a regulator in an experiment, is
predictive of differential expression. In this way, we learn motifs that are
functional and predictive of regulatory response rather than motifs that are
simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model
of the transcriptional control logic that can predict the expression of any
gene in the organism, given the sequence of the promoter region of the target
gene and the expression state of a set of known or putative transcription
factors and signaling molecules. Each motif model is either a -length
sequence, a dimer, or a PSSM that is built by agglomerative probabilistic
clustering of sequences with similar boosting loss. By applying MEDUSA to a set
of environmental stress response expression data in yeast, we learn motifs
whose ability to predict differential expression of target genes outperforms
motifs from the TRANSFAC dataset and from a previously published candidate set
of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed
binding sites associated with environmental stress response from the
literature.Comment: RECOMB 200
Supersparse Linear Integer Models for Optimized Medical Scoring Systems
Scoring systems are linear classification models that only require users to
add, subtract and multiply a few small numbers in order to make a prediction.
These models are in widespread use by the medical community, but are difficult
to learn from data because they need to be accurate and sparse, have coprime
integer coefficients, and satisfy multiple operational constraints. We present
a new method for creating data-driven scoring systems called a Supersparse
Linear Integer Model (SLIM). SLIM scoring systems are built by solving an
integer program that directly encodes measures of accuracy (the 0-1 loss) and
sparsity (the -seminorm) while restricting coefficients to coprime
integers. SLIM can seamlessly incorporate a wide range of operational
constraints related to accuracy and sparsity, and can produce highly tailored
models without parameter tuning. We provide bounds on the testing and training
accuracy of SLIM scoring systems, and present a new data reduction technique
that can improve scalability by eliminating a portion of the training data
beforehand. Our paper includes results from a collaboration with the
Massachusetts General Hospital Sleep Laboratory, where SLIM was used to create
a highly tailored scoring system for sleep apnea screeningComment: This version reflects our findings on SLIM as of January 2016
(arXiv:1306.5860 and arXiv:1405.4047 are out-of-date). The final published
version of this articled is available at http://www.springerlink.co
- …