1,500 research outputs found

    Applying MDL to Learning Best Model Granularity

    Get PDF
    The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends critically on the granularity, for example the choice of precision of the parameters. Too high precision generally involves modeling of accidental noise and too low precision may lead to confusion of models that should be distinguished. This precision is often determined ad hoc. In MDL the best model is the one that most compresses a two-part code of the data set: this embodies ``Occam's Razor.'' In two quite different experimental settings the theoretical value determined using MDL coincides with the best value found experimentally. In the first experiment the task is to recognize isolated handwritten characters in one subject's handwriting, irrespective of size and orientation. Based on a new modification of elastic matching, using multiple prototypes per character, the optimal prediction rate is predicted for the learned parameter (length of sampling interval) considered most likely by MDL, which is shown to coincide with the best value found experimentally. In the second experiment the task is to model a robot arm with two degrees of freedom using a three layer feed-forward neural network where we need to determine the number of nodes in the hidden layer giving best modeling performance. The optimal model (the one that extrapolizes best on unseen examples) is predicted for the number of nodes in the hidden layer considered most likely by MDL, which again is found to coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To appea

    Solving Multiclass Learning Problems via Error-Correcting Output Codes

    Full text link
    Multiclass learning problems involve finding a definition for an unknown function f(x) whose range is a discrete set containing k &gt 2 values (i.e., k ``classes''). The definition is acquired by studying collections of training examples of the form [x_i, f (x_i)]. Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decision-tree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which error-correcting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of overfitting avoidance techniques such as decision-tree pruning. Finally, we show that---like the other methods---the error-correcting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems.Comment: See http://www.jair.org/ for any accompanying file

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

    ON THE USE OF THE DEMPSTER SHAFER MODEL IN INFORMATION INDEXING AND RETRIEVAL APPLICATIONS

    Get PDF
    The Dempster Shafer theory of evidence concerns the elicitation and manipulation of degrees of belief rendered by multiple sources of evidence to a common set of propositions. Information indexing and retrieval applications use a variety of quantitative means - both probabilistic and quasi-probabilistic - to represent and manipulate relevance numbers and index vectors. Recently, several proposals were made to use the Dempster Shafes model as a relevance calculus in such applications. The paper provides a critical review of these proposals, pointing at several theoretical caveats and suggesting ways to resolve them. The methodology is based on expounding a canonical indexing model whose relevance measures and combination mechanisms are shown to be isomorphic to Shafer's belief functions and to Dempster's rule, respectively. Hence, the paper has two objectives: (i) to describe and resolve some caveats in the way the Dempster Shafer theory is applied to information indexing and retrieval, and (ii) to provide an intuitive interpretation of the Dempster Shafer theory, as it unfolds in the simple context of a canonical indexing model.Information Systems Working Papers Serie

    Reverse engineering of biological signaling networks via integration of data and knowledge using probabilistic graphical models

    Get PDF
    Motivation The postulate that biological molecules rather act together in intricate networks, pioneered systems biology and popularized the study on approaches to reconstruct and understand these networks. These networks give an insight of the underlying biological process and diseases involving aberration in these pathways like, cancer and neuro degenerative diseases. These networks can be reconstructed by two different approaches namely, data driven and knowledge driven methods. This leaves a critical question of relying on either of them. Relying completely on data driven approaches brings in the issue of overfitting, whereas, an entirely knowledge driven approach leaves us without acquisition of any new information/knowledge. This thesis presents hybrid approach in terms of integration of high throughput data and biological knowledge to reverse-engineer the structure of biological networks in a probabilistic way and showcases the improvement brought about as a result. Accomplishments The current work aims to learn networks from perturbation data. It extends the existing Nested Effects Model (NEMs) for pathway reconstruction in order to use the time course data, allowing the differentiation between direct and indirect effects and resolve feedback loops. The thesis also introduces an approach to learn the signaling network from phenotype data in form of images/movie, widening the scope of NEMs, which was so far limited to gene expression data. Furthermore, the thesis introduces methodologies to integrate knoowledge from different existing sources as probabilistic prior that improved the reconstruction accuracy of the network and could make it biologically more rational. These methods were finally integrated and for reverse engineering of more accurate and realistic networks. Conclusion The thesis added three dimensions to existing scope of network reverse engineering specially Nested Effects Models in terms of use of time course data, phenotype data and finally the incorporation of prior biological knowledge from multiple sources. The approaches developed demonstrate their application to understand signaling in stem cells and cell division and breast cancer. Furthermore the integrative approach shows the reconstruction of AMPK/EGFR pathway that is used to identify potential drug targets in lung cancer which were also validated experimentally, meeting one of the desired goals in systems biology
    • …
    corecore