27 research outputs found

    The Data Mining OPtimization Ontology

    Get PDF
    The Data Mining OPtimization Ontology (DMOP) has been developed to support informed decision-making at various choice points of the data mining process. The ontology can be used by data miners and deployed in ontology-driven information systems. The primary purpose for which DMOP has been developed is the automation of algorithm and model selection through semantic meta-mining that makes use of an ontology-based meta-analysis of complete data mining processes in view of extracting patterns associated with mining performance. To this end, DMOP contains detailed descriptions of data mining tasks (e.g., learning, feature selection), data, algorithms, hypotheses such as mined models or patterns, and workflows. A development methodology was used for DMOP, including items such as competency questions and foundational ontology reuse. Several non-trivial modeling problems were encountered and due to the complexity of the data mining details, the ontology requires the use of the OWL 2 DL profile. DMOP was successfully evaluated for semantic meta-mining and used in constructing the Intelligent Discovery Assistant, deployed at the popular data mining environment RapidMiner

    Stability of feature selection algorithms: a study on high-dimensional spaces

    No full text
    With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process. Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected. This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set. We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets. The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms. Finally, we show how stability profiles can support the choice of a feature selection algorithm

    Data Imbalance in surveillance of nosocomial infections

    No full text
    Abstract. An important problem that arises in hospitals is the monitoring and detection of nosocomial or hospital acquired infections (NIs). This paper describes a retrospective analysis of a prevalence survey of NIs done in the Geneva University Hospital. Our goal is to identify patients with one or more NIs on the basis of clinical and other data collected during the survey. In this classification task, the main difficulty resides in the significant imbalance between positive or infected (11%) and negative (89%) cases. To remedy class imbalance, we propose a novel approach in which both oversampling of rare positives and undersampling of the non infected majority rely on synthetic cases generated via class-specific subclustering. Experiments have shown this approach to be remarkably more effective than classical random resampling methods.

    Using meta-mining to support data mining workflow planning and optimization

    Get PDF
    Knowledge Discovery in Databases is a complex process that involves many different data processing and learning operators. Today’s Knowledge Discovery Support Systems can contain several hundred operators. A major challenge is to assist the user in designing workflows which are not only valid but also – ideally – optimize some performance measure associated with the user goal. In this paper we present such a system. The system relies on a meta-mining module which analyses past data mining experiments and extracts metamining models which associate dataset characteristics with workflow descriptors in view of workflow performance optimization. The meta-mining model is used within a data mining workflow planner, to guide the planner during the workflow planning. We learn the metamining models using a similarity learning approach, and extract the workflow descriptors by mining the workflows for generalized relational patterns accounting also for domain knowledge provided by a data mining ontology. We evaluate the quality of the data mining workflows that the system produces on a collection of real world datasets coming from biology and show that it produces workflows that are significantly better than alternative methods that can only do workflow selection and not planning

    An Architecture for Musical Score Recognition using High-Level Domain Knowledge

    No full text
    This work proposes an original approach to musical score recognition, a particular case of highlevel document analysis. In order to overcome the limitations of existing systems, we propose an architecture which allows for a continuous and bidirectional interaction between high-level knowledge and low-level data, and which is able to improve itself over time by learning. This architecture is made of three cooperating layers, one made of parameterized feature detectors, another working as an object-oriented knowledge repository and the other as a supervising Bayesian metaprocessor. Although the implementation is still in progress, we show how this architecture is adequate for modeling and processing knowledge

    Neurosymbolic Integration: Cognitive Grounds and Computational Strategies

    No full text
    The ultimate---if implicit---goal of artificial intelligence (AI) research is to model the full range of human cognitive capabilities. Symbolic AI and connectionism, the major AI paradigms, have each tried---and failed---to attain this goal. In the meantime, the idea has gained ground that this goal might still be within reach if we could harness the respective strengths of these two paradigms in integrated neurosymbolic models. This paper attempts to lay a cognitive basis for neurosymbolic integration and describes the different strategies that have been adopted to date. Unified approaches strive to attain symbol-processing capabilities using neural network techniques alone, while hybrid approaches blend symbolic and neural models in novel architectures with the hope of gleaning the best of both paradigms. Keywords: Connectionism, symbolic AI, neurosymbolic integration, hybrid models, connectionist symbol processing 1 Introduction Since its inception, artificial intelligence (AI) ha..

    Feature weighting using margin and radius based error bound optimization in SVMs

    No full text
    The Support Vector Machine error bound is a function of the margin and radius. Standard SVM algorithms maximize the margin within a given feature space, therefore the radius is fixed and thus ignored in the optimization. We propose an extension of the standard SVM optimization in which we also account for the radius in order to produce an even tighter error bound than what we get by controlling only for the margin. We use a second set of parameters, Ό, that control the radius introducing like that an explicit feature weighting mechanism in the SVM algorithm. We impose an l1 constraint on Ό which results in a sparse vector, thus performing feature selection. Our original formulation is not convex, we give a convex approximation and show how to solve it. We experiment with real world datasets and report very good predictive performance compared to standard SVM

    Margin and radius based multiple Kernel Learning

    No full text
    A serious drawback of kernel methods, and Support Vector Machines (SVM) in particular, is the difficulty in choosing a suitable kernel function for a given dataset. One of the approaches proposed to address this problem is Multiple Kernel Learning (MKL) in which several kernels are combined adaptively for a given dataset. Many of the existing MKL methods use the SVM objective function and try to find a linear combination of basic kernels such that the separating margin between the classes is maximized. However, these methods ignore the fact that the theoretical error bound depends not only on the margin, but also on the radius of the smallest sphere that contains all the training instances. We present a novel MKL algorithm that optimizes the error bound taking account of both the margin and the radius. The empirical results show that the proposed method compares favorably with other state-of-the-art MKL methods

    Learning from imbalanced data in surveillance of nosocomial infection

    No full text
    OBJECTIVE: An important problem that arises in hospitals is the monitoring and detection of nosocomial or hospital acquired infections (NIs). This paper describes a retrospective analysis of a prevalence survey of NIs done in the Geneva University Hospital. Our goal is to identify patients with one or more NIs on the basis of clinical and other data collected during the survey. METHODS AND MATERIAL: Standard surveillance strategies are time-consuming and cannot be applied hospital-wide; alternative methods are required. In NI detection viewed as a classification task, the main difficulty resides in the significant imbalance between positive or infected (11%) and negative (89%) cases. To remedy class imbalance, we explore two distinct avenues: (1) a new re-sampling approach in which both over-sampling of rare positives and under-sampling of the noninfected majority rely on synthetic cases (prototypes) generated via class-specific sub-clustering, and (2) a support vector algorithm in which asymmetrical margins are tuned to improve recognition of rare positive cases. RESULTS AND CONCLUSION: Experiments have shown both approaches to be effective for the NI detection problem. Our novel re-sampling strategies perform remarkably better than classical random re-sampling. However, they are outperformed by asymmetrical soft margin support vector machines which attained a sensitivity rate of 92%, significantly better than the highest sensitivity (87%) obtained via prototype-based re-sampling
    corecore