138 research outputs found

    Learning vector quantization for proximity data

    Get PDF
    Hofmann D. Learning vector quantization for proximity data. Bielefeld: Universität Bielefeld; 2016.Prototype-based classifiers such as learning vector quantization (LVQ) often display intuitive and flexible classification and learning rules. However, classical techniques are restricted to vectorial data only, and hence not suited for more complex data structures. Therefore, a few extensions of diverse LVQ variants to more general data which are characterized based on pairwise similarities or dissimilarities only have been proposed recently in the literature. In this contribution, we propose a novel extension of LVQ to similarity data which is based on the kernelization of an underlying probabilistic model: kernel robust soft LVQ (KRSLVQ). Relying on the notion of a pseudo-Euclidean embedding of proximity data, we put this specific approach as well as existing alternatives into a general framework which characterizes different fundamental possibilities how to extend LVQ towards proximity data: the main characteristics are given by the choice of the cost function, the interface to the data in terms of similarities or dissimilarities, and the way in which optimization takes place. In particular the latter strategy highlights the difference of popular kernel approaches versus so-called relational approaches. While KRSLVQ and alternatives lead to state of the art results, these extensions have two drawbacks as compared to their vectorial counterparts: (i) a quadratic training complexity is encountered due to the dependency of the methods on the full proximity matrix; (ii) prototypes are no longer given by vectors but they are represented in terms of an implicit linear combination of data, i.e. interpretability of the prototypes is lost. We investigate different techniques to deal with these challenges: We consider a speed-up of training by means of low rank approximations of the Gram matrix by its Nyström approximation. In benchmarks, this strategy is successful if the considered data are intrinsically low-dimensional. We propose a quick check to efficiently test this property prior to training. We extend KRSLVQ by sparse approximations of the prototypes: instead of the full coefficient vectors, few exemplars which represent the prototypes can be directly inspected by practitioners in the same way as data. We compare different paradigms based on which to infer a sparse approximation: sparsity priors while training, geometric approaches including orthogonal matching pursuit and core techniques, and heuristic approximations based on the coefficients or proximities. We demonstrate the performance of these LVQ techniques for benchmark data, reaching state of the art results. We discuss the behavior of the methods to enhance performance and interpretability as concerns quality, sparsity, and representativity, and we propose different measures how to quantitatively evaluate the performance of the approaches. We would like to point out that we had the possibility to present our findings in international publication organs including three journal articles [6, 9, 2], four conference papers [8, 5, 7, 1] and two workshop contributions [4, 3]. References [1] A. Gisbrecht, D. Hofmann, and B. Hammer. Discriminative dimensionality reduction mappings. Advances in Intelligent Data Analysis, 7619: 126–138, 2012. [2] B. Hammer, D. Hofmann, F.-M. Schleif, and X. Zhu. Learning vector quantization for (dis-)similarities. Neurocomputing, 131: 43–51, 2014. [3] D. Hofmann. Sparse approximations for kernel robust soft lvq. Mittweida Workshop on Computational Intelligence, 2013. [4] D. Hofmann, A. Gisbrecht, and B. Hammer. Discriminative probabilistic prototype based models in kernel space. New Challenges in Neural Computation, TR Machine Learning Reports, 2012. [5] D. Hofmann, A. Gisbrecht, and B. Hammer. Efficient approximations of kernel robust soft lvq. Workshop on Self-Organizing Maps, 198: 183–192, 2012. [6] D. Hofmann, A. Gisbrecht, and B. Hammer. Efficient approximations of robust soft learning vector quantization for non-vectorial data. Neurocomputing, 147: 96–106, 2015. [7] D. Hofmann and B. Hammer. Kernel robust soft learning vector quantization. Artificial Neural Networks in Pattern Recognition, 7477: 14–23, 2012. [8] D. Hofmann and B. Hammer. Sparse approximations for kernel learning vector quantization. European Symposium on Artificial Neural Networks, 549–554, 2013. [9] D. Hofmann, F.-M. Schleif, B. Paaßen, and B. Hammer. Learning interpretable kernelized prototype-based models. Neurocomputing, 141: 84–96, 2014

    Toward Accident Prevention Through Machine Learning Analysis of Accident Reports

    Get PDF
    Occupational safety remains of interest in the construction sector. The frequency of accidents has decreased in Sweden but only to a level that remains constant over the last ten years. Although Sweden shows to be performing better in comparison to other European countries, the construction industry continues to contribute to a fifth of fatal accidents in Europe. The latter situation pushes towards the need for reducing the frequency and fatalities of occupational accident occurrences in the construction sector. In the Swedish context, several initiatives have been established for prevention and accident frequency reduction. However, risk analysis models and causal links have been found to be rare in this context.The continuous reporting of accidents and near-misses creates large datasets with potentially useful information about accidents and their causes. In addition to that, there has been an increased research interest in analysing this data through machine learning (ML). The state-of-art research efforts include applying ML to analyse the textual data within the accumulated accident reports, identifying contributing factors, and extracting accident information. However, solutions that are created by ML models can lead to changes for a company and the industry. ML modelling includes a prototype development that is accompanied by the industry’s and domain experts’ requirements. The aim of this thesis is to investigate how ML based methods and techniques could be used to develop a research-based prototype for occupational accident prevention in a contracting company. The thesis focus is on the exploration of a development processes that bridges ML data analysis technical part with the context of safety in a contracting company. The thesis builds on accident causation models (ACMs) and ML methods, utilising the Cross Industry Standard Process Development Method (CRISP-DM) as a method. These were employed to interpret and understand the empirical material of accident reports and interviews within the health and safety (H&S) unit.The results of the thesis showed that analysing accident reports via ML can lead to the discovery of knowledge about accidents. However, there were several challenges that were found to hinder the extraction of knowledge and the application of ML. The identified challenges mainly related to the standardization of the development process and, the feasibility of implementation and evaluation. Moreover, the tendency of the ML-related literature to focus on predicting severity was found not compatible either with the function of ML analysis or the findings of accident causation literature which considers severity as a stochastic element. The analysis further concluded that ACMs seemed to have reached a mature stage, where a new approach is needed to understand the rules that govern the relationships between emergent new risks – rather than the systemization of risks themselves. The analysis of accident reports by ML needs further research in systemized methods for such analysis in the domain of construction and in the context of contracting companies – as only few research efforts have focused on this area regarding ML evaluation metrics and data pre-processing

    Simple integrative preprocessing preserves what is shared in data sources

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.</p> <p>Results</p> <p>It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>.</p> <p>Conclusion</p> <p>We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p
    • …
    corecore