Search CORE

5 research outputs found

Modelling high-dimensional categorical data using nonconvex fusion penalties

Author: Shah RD
Stokell BG
Tibshirani RJ
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 28/07/2021
Field of study

We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN

arXiv.org e-Print Archive

Apollo (Cambridge)

Recommended from our members

High-dimensional regression with potential prior information on variable importance

Author: Shah RD
Stokell BG
Publication venue: Statistics and Computing
Publication date: 29/06/2022
Field of study

There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number

M

of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a

\log M

price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github

Apollo (Cambridge)

Recommended from our members

Bouveret Syndrome: A Systematic Review of Endoscopic Therapy and a Novel Predictive Tool to Aid in Management.

Author: Al-Naeeb Y
Lucarelli P
Ong J
Ong S
Rouhani FJ
Shankar A
Stokell BG
Swift C
Publication venue: J Clin Gastroenterol
Publication date: 01/04/2019
Field of study

BACKGROUND AND GOALS: Bouveret syndrome is characterized by gastroduodenal obstruction caused by an impacted gallstone. Current literature recommends endoscopic therapy as the first line of intervention despite significantly lower success rates compared with surgery. The lack of treatment efficacy studies and the paucity of clinical guidelines contribute to current practices being arbitrary. The aim of this systematic review was to identify factors that predict outcomes of endoscopic therapy. Subsequently, a predictive tool was devised to predict the success of endoscopic therapy and recommendations were proposed to improve current management strategies of impacted gallstones in the upper gastrointestinal tract. METHODS: A systematic search of PubMed, Medline, Cochrane, and Scopus was performed for articles that contained the terms "Bouveret syndrome," "Bouveret's syndrome," "gallstone" AND "gastric obstruction" and "gallstone" AND "duodenal obstruction" that were published between January 1, 1950 to April 15, 2018. Articles were reviewed by 3 reviewers and raw data collated. χ and Kolmogorov-Smirnov tests were used to test associations between predictors and endoscopic outcomes. A logistic regression model was then used to create a predictive tool which was cross validated. RESULTS: Failure of endoscopic therapy is associated with increasing gallstone length (P<0.0001) and impaction in the distal duodenum (P<0.05). Using multiple endoscopic modalities is associated with better success rates (P<0.05). The novel predictive tool predicted success of endoscopic therapy with an area under the receiver operating characteristic score of 0.86 (95% confidence interval: 0.79-0.94). CONCLUSION: In Bouveret syndrome, a selective approach to endoscopic therapy can expedite definitive treatment and improve current management strategies

Apollo (Cambridge)

CUED - Cambridge University Engineering Department

Understanding the timing of eruption end using a machine learning approach to classification of seismic time series

Author: Clifton D
Londoño JM
Manley G
Mather T
Pyle D
Rodgers M
Roman DC
Stokell BG
Thompson G
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

The timing and processes that govern the end of volcanic eruptions are not yet fully understood, and there currently exists no systematic definition for the end of a volcanic eruption. Currently, end of eruption is established either by generic criteria (typically 90 days after the end of visual signals of eruption) or criteria specific to a given volcano. We explore the application of supervised machine learning classification methods: Support Vector Machine, Logistic Regression, Random Forest and Gaussian Process Classifiers and define a decisiveness index D to evaluate the consistency of the classifications obtained by these models. We apply these methods to seismic time series from two volcanoes chosen because they display contrasting styles of eruption: Telica (Nicaragua) and Nevado del Ruiz (Colombia). We find that, for both volcanic systems, the end-date we obtain by classification of seismic data is 2–4 months later than end-dates defined by the last occurrence of visual eruption (such as ash emission). This finding is in agreement with previous, general definitions of eruption end and is consistent across models. Our classifications have a higher correspondence of eruptive activity with visual activity than with database records of eruption start and end. We analyze the relative importance of the different features of seismic activity used in our models (e.g. peak event amplitude, daily event counts) and find little consistency between the two volcanic systems in terms of the most important features which determine whether activity is eruptive or non-eruptive. These initial results look promising and our approach may offer a robust tool to help determine when an eruption has ended in the absence of visual confirmation

Oxford University Research Archive

Overexpression of citrate operon in Herbaspirillum seropedicae Z67 enhances organic acid secretion, mineral phosphate solubilization and growth promotion of Oryza sativa

Author: A Förster
A Peix
A Unge
A Valverde
A Vikram
AA Belimov
AB Buch
AD Buch
AD Buch
AJ Wolfe
B Boesten
BG Hall
BN Ames
C Kumar
CHSG Meneses
CSL Vicente
DJ Stokell
E Elkoca
EK James
EK James
F Rojo
FL Olivares
FO Pedrosa
G Archana
G. Archana
G. Naresh Kumar
GL Peterson
J Sambrook
J Wagh
JI Baldani
Jitendra Wagh
JK Ladha
JS Lolkema
K Matsuno
K Walsh
LDB Roncato-Maccari
M Aoshima
M Dijkstra
M Papagianni
MS Khan
OB Weber
P Gyaneshwar
P Gyaneshwar
P Gyaneshwar
PA Serre
PH Viollier
Praveena Bhandari
R Chabot
RI Pikovskaya
S Anastassiadis
S Srivastava
SJ Park
Sonal Shah
SS Pal
TSD Radwan
U Sauer
VLD Baldani
VN Tiwari
W Fang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref