1,508 research outputs found

    Basics of Feature Selection and Statistical Learning for High Energy Physics

    Get PDF
    This document introduces basics in data preparation, feature selection and learning basics for high energy physics tasks. The emphasis is on feature selection by principal component analysis, information gain and significance measures for features. As examples for basic statistical learning algorithms, the maximum a posteriori and maximum likelihood classifiers are shown. Furthermore, a simple rule based classification as a means for automated cut finding is introduced. Finally two toolboxes for the application of statistical learning techniques are introduced.Comment: 12 pages, 8 figures. Part of the proceedings of the Track 'Computational Intelligence for HEP Data Analysis' at iCSC 200

    Understanding Slow Feature Analysis: A Mathematical Framework

    Get PDF
    Slow feature analysis is an algorithm for unsupervised learning of invariant representations from data with temporal correlations. Here, we present a mathematical analysis of slow feature analysis for the case where the input-output functions are not restricted in complexity. We show that the optimal functions obey a partial differential eigenvalue problem of a type that is common in theoretical physics. This analogy allows the transfer of mathematical techniques and intuitions from physics to concrete applications of slow feature analysis, thereby providing the means for analytical predictions and a better understanding of simulation results. We put particular emphasis on the situation where the input data are generated from a set of statistically independent sources.\ud The dependence of the optimal functions on the sources is calculated analytically for the cases where the sources have Gaussian or uniform distribution

    Porting concepts from DNNs back to GMMs

    Get PDF
    Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination

    Decorrelation using Optimal Transport

    Full text link
    Being able to decorrelate a feature space from protected attributes is an area of active research and study in ethics, fairness, and also natural sciences. We introduce a novel decorrelation method using Convex Neural Optimal Transport Solvers (Cnots), that is able to decorrelate continuous feature space against protected attributes with optimal transport. We demonstrate how well it performs in the context of jet classification in high energy physics, where classifier scores are desired to be decorrelated from the mass of a jet. The decorrelation achieved in binary classification approaches the levels achieved by the state-of-the-art using conditional normalising flows. When moving to multiclass outputs the optimal transport approach performs significantly better than the state-of-the-art, suggesting substantial gains at decorrelating multidimensional feature spaces

    On the spatial modelling of mixed and constrained geospatial data

    Get PDF
    Spatial uncertainty modelling and prediction of a set of regionalized dependent variables from various sample spaces (e.g. continuous and categorical) is a common challenge for geoscience modellers and many geoscience applications such as evaluation of mineral resources, characterization of oil reservoirs or hydrology of groundwater. To consider the complex statistical and spatial relationships, categorical data such as rock types, soil types, alteration units, and continental crustal blocks should be modelled jointly with other continuous attributes (e.g. porosity, permeability, seismic velocity, mineral and geochemical compositions or pollutant concentration). These multivariate geospatial data normally have complex statistical and spatial relationships which should be honoured in the predicted models. Continuous variables in the form of percentages, proportions, frequencies, and concentrations are compositional which means they are non-negative values representing some parts of a whole. Such data carry just relative information and the constant sum constraint forces at least one covariance to be negative and induces spurious statistical and spatial correlations. As a result, classical (geo)statistical techniques should not be implemented on the original compositional data. Several geostatistical techniques have been developed recently for the spatial modelling of compositional data. However, few of these consider the joint statistical and/or spatial relationships of regionalized compositional data with the other dependent categorical information. This PhD thesis explores and introduces approaches to spatial modelling of regionalized compositional and categorical data. The first proposed approach is in the multiple-point geostatistics framework, where the direct sampling algorithm is developed for joint simulation of compositional and categorical data. The second proposed method is based on two-point geostatistics and is useful for the situation where a large and representative training image is not available or difficult to build. Approaches to geostatistical simulation of regionalized compositions consisting of several populations are explored and investigated. The multi-population characteristic is usually related to a dependent categorical variable (e.g. rock type, soil type, and land use). Finally, a hybrid predictive model based on the advanced geostatistical simulation techniques for compositional data and machine learning is introduced. Such a hybrid model has the ability to rank and select features internally, which is useful for geoscience process discovery analysis. The proposed techniques were evaluated via several case studies and results supported their usefulness and applicability

    Data compression in remote sensing applications

    Get PDF
    A survey of current data compression techniques which are being used to reduce the amount of data in remote sensing applications is provided. The survey aspect is far from complete, reflecting the substantial activity in this area. The purpose of the survey is more to exemplify the different approaches being taken rather than to provide an exhaustive list of the various proposed approaches

    Conditional Random Fields for Integrating Local Discriminative Classifiers

    Full text link
    corecore