20,953 research outputs found
Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression
Background: Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results: The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion: In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk
Transcription Factor-DNA Binding Via Machine Learning Ensembles
We present ensemble methods in a machine learning (ML) framework combining
predictions from five known motif/binding site exploration algorithms. For a
given TF the ensemble starts with position weight matrices (PWM's) for the
motif, collected from the component algorithms. Using dimension reduction, we
identify significant PWM-based subspaces for analysis. Within each subspace a
machine classifier is built for identifying the TF's gene (promoter) targets
(Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool.
Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string)
feature PWM-based subspaces that stand out in identifying gene targets. We
approach Problem 3 (binding sites) with a novel machine learning approach that
uses promoter string features and ML importance scores in a classification
algorithm locating binding sites across the genome. For target gene
identification this method improves performance (measured by the F1 score) by
about 10 percentage points over the (a) motif scanning method and (b) the
coexpression-based association method. Top motif outperformed 5 component
algorithms as well as two other common algorithms (BEST and DEME). For
identifying individual binding sites on a benchmark cross species database
(Tompa et al., 2005) we match the best performer without much human
intervention. It also improved the performance on mammalian TFs.
The ensemble can integrate orthogonal information from different weak
learners (potentially using entirely different types of features) into a
machine learner that can perform consistently better for more TFs. The TF gene
target identification component (problem 1 above) is useful in constructing a
transcriptional regulatory network from known TF-target associations. The
ensemble is easily extendable to include more tools as well as future PWM-based
information.Comment: 33 page
TF2Network : predicting transcription factor regulators and gene regulatory networks in Arabidopsis using publicly available binding site information
A gene regulatory network (GRN) is a collection of regulatory interactions between transcription factors (TFs) and their target genes. GRNs control different biological processes and have been instrumental to understand the organization and complexity of gene regulation. Although various experimental methods have been used to map GRNs in Arabidop-sis thaliana, their limited throughput combined with the large number of TFs makes that for many genes our knowledge about regulating TFs is incomplete. We introduce TF2Network, a tool that exploits the vast amount of TF binding site information and enables the delineation of GRNs by detecting potential regulators for a set of co-expressed or functionally related genes. Validation using two experimental benchmarks reveals that TF2Network predicts the correct regulator in 75-92% of the test sets. Furthermore, our tool is robust to noise in the input gene sets, has a low false discovery rate, and shows a better performance to recover correct regulators compared to other plant tools. TF2Network is accessible through a web interface where GRNs are interactively visualized and annotated with various types of experimental functional information. TF2Network was used to perform systematic functional and regulatory gene annotations, identifying new TFs involved in circadian rhythm and stress response
Exact and efficient top-K inference for multi-target prediction by querying separable linear relational models
Many complex multi-target prediction problems that concern large target
spaces are characterised by a need for efficient prediction strategies that
avoid the computation of predictions for all targets explicitly. Examples of
such problems emerge in several subfields of machine learning, such as
collaborative filtering, multi-label classification, dyadic prediction and
biological network inference. In this article we analyse efficient and exact
algorithms for computing the top- predictions in the above problem settings,
using a general class of models that we refer to as separable linear relational
models. We show how to use those inference algorithms, which are modifications
of well-known information retrieval methods, in a variety of machine learning
settings. Furthermore, we study the possibility of scoring items incompletely,
while still retaining an exact top-K retrieval. Experimental results in several
application domains reveal that the so-called threshold algorithm is very
scalable, performing often many orders of magnitude more efficiently than the
naive approach
A two-step learning approach for solving full and almost full cold start problems in dyadic prediction
Dyadic prediction methods operate on pairs of objects (dyads), aiming to
infer labels for out-of-sample dyads. We consider the full and almost full cold
start problem in dyadic prediction, a setting that occurs when both objects in
an out-of-sample dyad have not been observed during training, or if one of them
has been observed, but very few times. A popular approach for addressing this
problem is to train a model that makes predictions based on a pairwise feature
representation of the dyads, or, in case of kernel methods, based on a tensor
product pairwise kernel. As an alternative to such a kernel approach, we
introduce a novel two-step learning algorithm that borrows ideas from the
fields of pairwise learning and spectral filtering. We show theoretically that
the two-step method is very closely related to the tensor product kernel
approach, and experimentally that it yields a slightly better predictive
performance. Moreover, unlike existing tensor product kernel methods, the
two-step method allows closed-form solutions for training and parameter
selection via cross-validation estimates both in the full and almost full cold
start settings, making the approach much more efficient and straightforward to
implement
Refining interaction search through signed iterative Random Forests
Advances in supervised learning have enabled accurate prediction in
biological systems governed by complex interactions among biomolecules.
However, state-of-the-art predictive algorithms are typically black-boxes,
learning statistical interactions that are difficult to translate into testable
hypotheses. The iterative Random Forest algorithm took a step towards bridging
this gap by providing a computationally tractable procedure to identify the
stable, high-order feature interactions that drive the predictive accuracy of
Random Forests (RF). Here we refine the interactions identified by iRF to
explicitly map responses as a function of interacting features. Our method,
signed iRF, describes subsets of rules that frequently occur on RF decision
paths. We refer to these rule subsets as signed interactions. Signed
interactions share not only the same set of interacting features but also
exhibit similar thresholding behavior, and thus describe a consistent
functional relationship between interacting features and responses. We describe
stable and predictive importance metrics to rank signed interactions. For each
SPIM, we define null importance metrics that characterize its expected behavior
under known structure. We evaluate our proposed approach in biologically
inspired simulations and two case studies: predicting enhancer activity and
spatial gene expression patterns. In the case of enhancer activity, s-iRF
recovers one of the few experimentally validated high-order interactions and
suggests novel enhancer elements where this interaction may be active. In the
case of spatial gene expression patterns, s-iRF recovers all 11 reported links
in the gap gene network. By refining the process of interaction recovery, our
approach has the potential to guide mechanistic inquiry into systems whose
scale and complexity is beyond human comprehension
- …