51 research outputs found

    Functional approach for excess mass estimation in the density model

    Full text link
    We consider a multivariate density model where we estimate the excess mass of the unknown probability density ff at a given level ν>0\nu>0 from nn i.i.d. observed random variables. This problem has several applications such as multimodality testing, density contour clustering, anomaly detection, classification and so on. For the first time in the literature we estimate the excess mass as an integrated functional of the unknown density ff. We suggest an estimator and evaluate its rate of convergence, when ff belongs to general Besov smoothness classes, for several risk measures. A particular care is devoted to implementation and numerical study of the studied procedure. It appears that our procedure improves the plug-in estimator of the excess mass.Comment: Published in at http://dx.doi.org/10.1214/07-EJS079 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Sloshing in the LNG shipping industry: risk modelling through multivariate heavy-tail analysis

    Get PDF
    In the liquefied natural gas (LNG) shipping industry, the phenomenon of sloshing can lead to the occurrence of very high pressures in the tanks of the vessel. The issue of modelling or estimating the probability of the simultaneous occurrence of such extremal pressures is now crucial from the risk assessment point of view. In this paper, heavy-tail modelling, widely used as a conservative approach to risk assessment and corresponding to a worst-case risk analysis, is applied to the study of sloshing. Multivariate heavy-tailed distributions are considered, with Sloshing pressures investigated by means of small-scale replica tanks instrumented with d >1 sensors. When attempting to fit such nonparametric statistical models, one naturally faces computational issues inherent in the phenomenon of dimensionality. The primary purpose of this article is to overcome this barrier by introducing a novel methodology. For d-dimensional heavy-tailed distributions, the structure of extremal dependence is entirely characterised by the angular measure, a positive measure on the intersection of a sphere with the positive orthant in Rd. As d increases, the mutual extremal dependence between variables becomes difficult to assess. Based on a spectral clustering approach, we show here how a low dimensional approximation to the angular measure may be found. The nonparametric method proposed for model sloshing has been successfully applied to pressure data. The parsimonious representation thus obtained proves to be very convenient for the simulation of multivariate heavy-tailed distributions, allowing for the implementation of Monte-Carlo simulation schemes in estimating the probability of failure. Besides confirming its performance on artificial data, the methodology has been implemented on a real data set specifically collected for risk assessment of sloshing in the LNG shipping industry

    Statistical learning for wind power : a modeling and stability study towards forecasting

    Full text link
    We focus on wind power modeling using machine learning techniques. We show on real data provided by the wind energy company Ma{\"i}a Eolis, that parametric models, even following closely the physical equation relating wind production to wind speed are outperformed by intelligent learning algorithms. In particular, the CART-Bagging algorithm gives very stable and promising results. Besides, as a step towards forecast, we quantify the impact of using deteriorated wind measures on the performances. We show also on this application that the default methodology to select a subset of predictors provided in the standard random forest package can be refined, especially when there exists among the predictors one variable which has a major impact

    A clusterwise supervised learning procedure based on aggregation of distances

    Get PDF
    Nowadays, many machine learning procedures are available on the shelve and may be used easily to calibrate predictive models on supervised data. However, when the input data consists of more than one unknown cluster, and when different underlying predictive models exist, fitting a model is a more challenging task. We propose, in this paper, a procedure in three steps to automatically solve this problem. The KFC procedure aggregates different models adaptively on data. The first step of the procedure aims at catching the clustering structure of the input data, which may be characterized by several statistical distributions. It provides several partitions, given the assumptions on the distributions. For each partition, the second step fits a specific predictive model based on the data in each cluster. The overall model is computed by a consensual aggregation of the models corresponding to the different partitions. A comparison of the performances on different simulated and real data assesses the excellent performance of our method in a large variety of prediction problems

    Grouping Strategies and Thresholding for High Dimensional Linear Models

    Full text link
    The estimation problem in a high regression model with structured sparsity is investigated. An algorithm using a two steps block thresholding procedure called GR-LOL is provided. Convergence rates are produced: they depend on simple coherence-type indices of the Gram matrix -easily checkable on the data- as well as sparsity assumptions of the model parameters measured by a combination of l1l_1 within-blocks with lq,q<1l_q,q<1 between-blocks norms. The simplicity of the coherence indicator suggests ways to optimize the rates of convergence when the group structure is not naturally given by the problem and is unknown. In such a case, an auto-driven procedure is provided to determine the regressors groups (number and contents). An intensive practical study compares our grouping methods with the standard LOL algorithm. We prove that the grouping rarely deteriorates the results but can improve them very significantly. GR-LOL is also compared with group-Lasso procedures and exhibits a very encouraging behavior. The results are quite impressive, especially when GR-LOL algorithm is combined with a grouping pre-processing

    To tree or not to tree? Assessing the impact of smoothing the decision boundaries

    Full text link
    When analyzing a dataset, it can be useful to assess how smooth the decision boundaries need to be for a model to better fit the data. This paper addresses this question by proposing the quantification of how much should the 'rigid' decision boundaries, produced by an algorithm that naturally finds such solutions, be relaxed to obtain a performance improvement. The approach we propose starts with the rigid decision boundaries of a seed Decision Tree (seed DT), which is used to initialize a Neural DT (NDT). The initial boundaries are challenged by relaxing them progressively through training the NDT. During this process, we measure the NDT's performance and decision agreement to its seed DT. We show how these two measures can help the user in figuring out how expressive his model should be, before exploring it further via model selection. The validity of our approach is demonstrated with experiments on simulated and benchmark datasets.Comment: 12 pages, 3 figures, 3 tables. arXiv admin note: text overlap with arXiv:2006.1145

    A Meta-Generation framework for Industrial System Generation

    Full text link
    Generative design is an increasingly important tool in the industrial world. It allows the designers and engineers to easily explore vast ranges of design options, providing a cheaper and faster alternative to the trial and failure approaches. Thanks to the flexibility they offer, Deep Generative Models are gaining popularity amongst Generative Design technologies. However, developing and evaluating these models can be challenging. The field lacks accessible benchmarks, in order to evaluate and compare objectively different Deep Generative Models architectures. Moreover, vanilla Deep Generative Models appear to be unable to accurately generate multi-components industrial systems that are controlled by latent design constraints. To address these challenges, we propose an industry-inspired use case that incorporates actual industrial system characteristics. This use case can be quickly generated and used as a benchmark. We propose a Meta-VAE capable of producing multi-component industrial systems and showcase its application on the proposed use case
    corecore