1,500 research outputs found
Applying MDL to Learning Best Model Granularity
The Minimum Description Length (MDL) principle is solidly based on a provably
ideal method of inference using Kolmogorov complexity. We test how the theory
behaves in practice on a general problem in model selection: that of learning
the best model granularity. The performance of a model depends critically on
the granularity, for example the choice of precision of the parameters. Too
high precision generally involves modeling of accidental noise and too low
precision may lead to confusion of models that should be distinguished. This
precision is often determined ad hoc. In MDL the best model is the one that
most compresses a two-part code of the data set: this embodies ``Occam's
Razor.'' In two quite different experimental settings the theoretical value
determined using MDL coincides with the best value found experimentally. In the
first experiment the task is to recognize isolated handwritten characters in
one subject's handwriting, irrespective of size and orientation. Based on a new
modification of elastic matching, using multiple prototypes per character, the
optimal prediction rate is predicted for the learned parameter (length of
sampling interval) considered most likely by MDL, which is shown to coincide
with the best value found experimentally. In the second experiment the task is
to model a robot arm with two degrees of freedom using a three layer
feed-forward neural network where we need to determine the number of nodes in
the hidden layer giving best modeling performance. The optimal model (the one
that extrapolizes best on unseen examples) is predicted for the number of nodes
in the hidden layer considered most likely by MDL, which again is found to
coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To
appea
Solving Multiclass Learning Problems via Error-Correcting Output Codes
Multiclass learning problems involve finding a definition for an unknown
function f(x) whose range is a discrete set containing k > 2 values (i.e., k
``classes''). The definition is acquired by studying collections of training
examples of the form [x_i, f (x_i)]. Existing approaches to multiclass learning
problems include direct application of multiclass algorithms such as the
decision-tree algorithms C4.5 and CART, application of binary concept learning
algorithms to learn individual binary functions for each of the k classes, and
application of binary concept learning algorithms with distributed output
representations. This paper compares these three approaches to a new technique
in which error-correcting codes are employed as a distributed output
representation. We show that these output representations improve the
generalization performance of both C4.5 and backpropagation on a wide range of
multiclass learning tasks. We also demonstrate that this approach is robust
with respect to changes in the size of the training sample, the assignment of
distributed representations to particular classes, and the application of
overfitting avoidance techniques such as decision-tree pruning. Finally, we
show that---like the other methods---the error-correcting code technique can
provide reliable class probability estimates. Taken together, these results
demonstrate that error-correcting output codes provide a general-purpose method
for improving the performance of inductive learning programs on multiclass
problems.Comment: See http://www.jair.org/ for any accompanying file
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
ON THE USE OF THE DEMPSTER SHAFER MODEL IN INFORMATION INDEXING AND RETRIEVAL APPLICATIONS
The Dempster Shafer theory of evidence concerns the elicitation and manipulation
of degrees of belief rendered by multiple sources of evidence to a common
set of propositions. Information indexing and retrieval applications use a variety
of quantitative means - both probabilistic and quasi-probabilistic - to represent
and manipulate relevance numbers and index vectors. Recently, several
proposals were made to use the Dempster Shafes model as a relevance calculus
in such applications. The paper provides a critical review of these proposals,
pointing at several theoretical caveats and suggesting ways to resolve them.
The methodology is based on expounding a canonical indexing model whose
relevance measures and combination mechanisms are shown to be isomorphic
to Shafer's belief functions and to Dempster's rule, respectively. Hence, the
paper has two objectives: (i) to describe and resolve some caveats in the way
the Dempster Shafer theory is applied to information indexing and retrieval,
and (ii) to provide an intuitive interpretation of the Dempster Shafer theory, as
it unfolds in the simple context of a canonical indexing model.Information Systems Working Papers Serie
Reverse engineering of biological signaling networks via integration of data and knowledge using probabilistic graphical models
Motivation The postulate that biological molecules rather act together in intricate networks, pioneered systems biology and popularized the study on approaches to reconstruct and understand these networks. These networks give an insight of the underlying biological process and diseases involving aberration in these pathways like, cancer and neuro degenerative diseases. These networks can be reconstructed by two different approaches namely, data driven and knowledge driven methods. This leaves a critical question of relying on either of them. Relying completely on data driven approaches brings in the issue of overfitting, whereas, an entirely knowledge driven approach leaves us without acquisition of any new information/knowledge. This thesis presents hybrid approach in terms of integration of high throughput data and biological knowledge to reverse-engineer the structure of biological networks in a probabilistic way and showcases the improvement brought about as a result. Accomplishments The current work aims to learn networks from perturbation data. It extends the existing Nested Effects Model (NEMs) for pathway reconstruction in order to use the time course data, allowing the differentiation between direct and indirect effects and resolve feedback loops. The thesis also introduces an approach to learn the signaling network from phenotype data in form of images/movie, widening the scope of NEMs, which was so far limited to gene expression data. Furthermore, the thesis introduces methodologies to integrate knoowledge from different existing sources as probabilistic prior that improved the reconstruction accuracy of the network and could make it biologically more rational. These methods were finally integrated and for reverse engineering of more accurate and realistic networks. Conclusion The thesis added three dimensions to existing scope of network reverse engineering specially Nested Effects Models in terms of use of time course data, phenotype data and finally the incorporation of prior biological knowledge from multiple sources. The approaches developed demonstrate their application to understand signaling in stem cells and cell division and breast cancer. Furthermore the integrative approach shows the reconstruction of AMPK/EGFR pathway that is used to identify potential drug targets in lung cancer which were also validated experimentally, meeting one of the desired goals in systems biology
- …