5 research outputs found
Modelling high-dimensional categorical data using nonconvex fusion penalties
We propose a method for estimation in high-dimensional linear models with
nominal categorical data. Our estimator, called SCOPE, fuses levels together by
making their corresponding coefficients exactly equal. This is achieved using
the minimax concave penalty on differences between the order statistics of the
coefficients for a categorical variable, thereby clustering the coefficients.
We provide an algorithm for exact and efficient computation of the global
minimum of the resulting nonconvex objective in the case with a single variable
with potentially many levels, and use this within a block coordinate descent
procedure in the multivariate case. We show that an oracle least squares
solution that exploits the unknown level fusions is a limit point of the
coordinate descent with high probability, provided the true levels have a
certain minimum separation; these conditions are known to be minimal in the
univariate case. We demonstrate the favourable performance of SCOPE across a
range of real and simulated datasets. An R package CatReg implementing SCOPE
for linear models and also a version for logistic regression is available on
CRAN
Recommended from our members
High-dimensional regression with potential prior information on variable importance
There are a variety of settings where vague prior information may be
available on the importance of predictors in high-dimensional regression
settings. Examples include ordering on the variables offered by their empirical
variances (which is typically discarded through standardisation), the lag of
predictors when fitting autoregressive models in time series settings, or the
level of missingness of the variables. Whilst such orderings may not match the
true importance of variables, we argue that there is little to be lost, and
potentially much to be gained, by using them. We propose a simple scheme
involving fitting a sequence of models indicated by the ordering. We show that
the computational cost for fitting all models when ridge regression is used is
no more than for a single fit of ridge regression, and describe a strategy for
Lasso regression that makes use of previous fits to greatly speed up fitting
the entire sequence of models. We propose to select a final estimator by
cross-validation and provide a general result on the quality of the best
performing estimator on a test set selected from among a number of
competing estimators in a high-dimensional linear regression setting. Our
result requires no sparsity assumptions and shows that only a price is
incurred compared to the unknown best estimator. We demonstrate the
effectiveness of our approach when applied to missing or corrupted data, and
time series settings. An R package is available on github
Recommended from our members
Bouveret Syndrome: A Systematic Review of Endoscopic Therapy and a Novel Predictive Tool to Aid in Management.
BACKGROUND AND GOALS: Bouveret syndrome is characterized by gastroduodenal obstruction caused by an impacted gallstone. Current literature recommends endoscopic therapy as the first line of intervention despite significantly lower success rates compared with surgery. The lack of treatment efficacy studies and the paucity of clinical guidelines contribute to current practices being arbitrary. The aim of this systematic review was to identify factors that predict outcomes of endoscopic therapy. Subsequently, a predictive tool was devised to predict the success of endoscopic therapy and recommendations were proposed to improve current management strategies of impacted gallstones in the upper gastrointestinal tract. METHODS: A systematic search of PubMed, Medline, Cochrane, and Scopus was performed for articles that contained the terms "Bouveret syndrome," "Bouveret's syndrome," "gallstone" AND "gastric obstruction" and "gallstone" AND "duodenal obstruction" that were published between January 1, 1950 to April 15, 2018. Articles were reviewed by 3 reviewers and raw data collated. χ and Kolmogorov-Smirnov tests were used to test associations between predictors and endoscopic outcomes. A logistic regression model was then used to create a predictive tool which was cross validated. RESULTS: Failure of endoscopic therapy is associated with increasing gallstone length (P<0.0001) and impaction in the distal duodenum (P<0.05). Using multiple endoscopic modalities is associated with better success rates (P<0.05). The novel predictive tool predicted success of endoscopic therapy with an area under the receiver operating characteristic score of 0.86 (95% confidence interval: 0.79-0.94). CONCLUSION: In Bouveret syndrome, a selective approach to endoscopic therapy can expedite definitive treatment and improve current management strategies
Understanding the timing of eruption end using a machine learning approach to classification of seismic time series
The timing and processes that govern the end of volcanic eruptions are not yet fully understood, and there currently exists no systematic definition for the end of a volcanic eruption. Currently, end of eruption is established either by generic criteria (typically 90 days after the end of visual signals of eruption) or criteria specific to a given volcano. We explore the application of supervised machine learning classification methods: Support Vector Machine, Logistic Regression, Random Forest and Gaussian Process Classifiers and define a decisiveness index D to evaluate the consistency of the classifications obtained by these models. We apply these methods to seismic time series from two volcanoes chosen because they display contrasting styles of eruption: Telica (Nicaragua) and Nevado del Ruiz (Colombia). We find that, for both volcanic systems, the end-date we obtain by classification of seismic data is 2–4 months later than end-dates defined by the last occurrence of visual eruption (such as ash emission). This finding is in agreement with previous, general definitions of eruption end and is consistent across models. Our classifications have a higher correspondence of eruptive activity with visual activity than with database records of eruption start and end. We analyze the relative importance of the different features of seismic activity used in our models (e.g. peak event amplitude, daily event counts) and find little consistency between the two volcanic systems in terms of the most important features which determine whether activity is eruptive or non-eruptive. These initial results look promising and our approach may offer a robust tool to help determine when an eruption has ended in the absence of visual confirmation