3,378 research outputs found
Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns
DSL: Discriminative Subgraph Learning via Sparse Self-Representation
The goal in network state prediction (NSP) is to classify the global state
(label) associated with features embedded in a graph. This graph structure
encoding feature relationships is the key distinctive aspect of NSP compared to
classical supervised learning. NSP arises in various applications: gene
expression samples embedded in a protein-protein interaction (PPI) network,
temporal snapshots of infrastructure or sensor networks, and fMRI coherence
network samples from multiple subjects to name a few. Instances from these
domains are typically ``wide'' (more features than samples), and thus, feature
sub-selection is required for robust and generalizable prediction. How to best
employ the network structure in order to learn succinct connected subgraphs
encompassing the most discriminative features becomes a central challenge in
NSP. Prior work employs connected subgraph sampling or graph smoothing within
optimization frameworks, resulting in either large variance of quality or weak
control over the connectivity of selected subgraphs.
In this work we propose an optimization framework for discriminative subgraph
learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity
and (iii) high discriminative power of the resulting subgraphs of features. Our
optimization algorithm is a single-step solution for the NSP and the associated
feature selection problem. It is rooted in the rich literature on
maximal-margin optimization, spectral graph methods and sparse subspace
self-representation. DSL simultaneously ensures solution interpretability and
superior predictive power (up to 16% improvement in challenging instances
compared to baselines), with execution times up to an hour for large instances.Comment: 9 page
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors
with overlapping, global features. Each input's latent representation is
predicted conditional on the observable data using a feature-rich conditional
random field. Then a reconstruction of the input is (re)generated, conditional
on the latent structure, using models for which maximum likelihood estimation
has a closed-form. Our autoencoder formulation enables efficient learning
without making unrealistic independence assumptions or restricting the kinds of
features that can be used. We illustrate insightful connections to traditional
autoencoders, posterior regularization and multi-view learning. We show
competitive results with instantiations of the model for two canonical NLP
tasks: part-of-speech induction and bitext word alignment, and show that
training our model can be substantially more efficient than comparable
feature-rich baselines
Transcription Factor-DNA Binding Via Machine Learning Ensembles
We present ensemble methods in a machine learning (ML) framework combining
predictions from five known motif/binding site exploration algorithms. For a
given TF the ensemble starts with position weight matrices (PWM's) for the
motif, collected from the component algorithms. Using dimension reduction, we
identify significant PWM-based subspaces for analysis. Within each subspace a
machine classifier is built for identifying the TF's gene (promoter) targets
(Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool.
Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string)
feature PWM-based subspaces that stand out in identifying gene targets. We
approach Problem 3 (binding sites) with a novel machine learning approach that
uses promoter string features and ML importance scores in a classification
algorithm locating binding sites across the genome. For target gene
identification this method improves performance (measured by the F1 score) by
about 10 percentage points over the (a) motif scanning method and (b) the
coexpression-based association method. Top motif outperformed 5 component
algorithms as well as two other common algorithms (BEST and DEME). For
identifying individual binding sites on a benchmark cross species database
(Tompa et al., 2005) we match the best performer without much human
intervention. It also improved the performance on mammalian TFs.
The ensemble can integrate orthogonal information from different weak
learners (potentially using entirely different types of features) into a
machine learner that can perform consistently better for more TFs. The TF gene
target identification component (problem 1 above) is useful in constructing a
transcriptional regulatory network from known TF-target associations. The
ensemble is easily extendable to include more tools as well as future PWM-based
information.Comment: 33 page
- ā¦