5 research outputs found

    Modelling high-dimensional categorical data using nonconvex fusion penalties

    Get PDF
    We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN

    Understanding the timing of eruption end using a machine learning approach to classification of seismic time series

    No full text
    The timing and processes that govern the end of volcanic eruptions are not yet fully understood, and there currently exists no systematic definition for the end of a volcanic eruption. Currently, end of eruption is established either by generic criteria (typically 90 days after the end of visual signals of eruption) or criteria specific to a given volcano. We explore the application of supervised machine learning classification methods: Support Vector Machine, Logistic Regression, Random Forest and Gaussian Process Classifiers and define a decisiveness index D to evaluate the consistency of the classifications obtained by these models. We apply these methods to seismic time series from two volcanoes chosen because they display contrasting styles of eruption: Telica (Nicaragua) and Nevado del Ruiz (Colombia). We find that, for both volcanic systems, the end-date we obtain by classification of seismic data is 2–4 months later than end-dates defined by the last occurrence of visual eruption (such as ash emission). This finding is in agreement with previous, general definitions of eruption end and is consistent across models. Our classifications have a higher correspondence of eruptive activity with visual activity than with database records of eruption start and end. We analyze the relative importance of the different features of seismic activity used in our models (e.g. peak event amplitude, daily event counts) and find little consistency between the two volcanic systems in terms of the most important features which determine whether activity is eruptive or non-eruptive. These initial results look promising and our approach may offer a robust tool to help determine when an eruption has ended in the absence of visual confirmation
    corecore