Search CORE

21 research outputs found

Generalized additive and index models with shape constraints

Author: Chen Y
Samworth RJ
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 26/10/2015
Field of study

We study generalised additive models, with shape restrictions (e.g. monotonicity, convexity, concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a nonparametric estimator of each additive component, obtained by maximising the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consistent on compact intervals. More generally, our methodology can be applied to generalised additive index models. Here again, the procedure can be justified on theoretical grounds and, like the original algorithm, possesses highly competitive finite-sample performance. Practical utility is illustrated through the use of these methods in the analysis of two real datasets. Our algorithms are publicly available in the \texttt{R} package \textbf{scar}, short for \textbf{s}hape-\textbf{c}onstrained \textbf{a}dditive \textbf{r}egression.Both authors are supported by the second author’s Engineering and Physical Sciences Research Fellowship EP/J017213/1.This is the final version of the article. It first appeared from Wiley via http://dx.doi.org/10.1111/rssb.1213

Crossref

LSE Research Online

Apollo (Cambridge)

Variable selection with error control: Another look at stability selection

Author: Samworth RJ
Shah RD
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 21/06/2012
Field of study

Stability Selection was recently introduced by Meinshausen and Buhlmann (2010) as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected number of variables included by CPSS that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology.This is the accepted manuscript version. The final published version is available from Wiley at http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01034.x/abstract

CiteSeerX

Apollo (Cambridge)

Nonparametric independence testing via mutual information

Author: Berrett TB
Samworth RJ
Publication venue: BIOMETRIKA
Publication date: 01/01/2019
Field of study

We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data.EPSRC Leverhulme Trust SIMS fun

Warwick Research Archives Portal Repository

Apollo (Cambridge)

Sparse principal component analysis via axis-aligned random projections

Author: Gataric M
Samworth RJ
Wang T
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 01/01/2020
Field of study

We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected axis-aligned random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. We provide theoretical guarantees under which our principal subspace estimator can attain the minimax optimal rate of convergence in polynomial time. In addition, our theory provides a more refined understanding of the statistical and computational trade-off in the problem of sparse principal component estimation, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance.The research of the first and third authors was supported by an Engineering and Physical Sciences Research Council (EPSRC) grant EP/N014588/1 for the centre for Mathematical and Statistical Analysis of Multimodal Clinical Imaging. The second and third authors were supported by EPSRC Fellowship EP/J017213/1 and EP/P031447/1, and grant RG81761 from the Leverhulme Trust

UCL Discovery

Apollo (Cambridge)

Recommended from our members

The conditional permutation test for independence while controlling for confounders

Author: Barber RF
Berrett TB
Samworth RJ
Wang Y
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 01/01/2020
Field of study

We propose a general new method, the conditional permutation test, for testing the conditional independence of variables

X

and

Y

given a potentially high-dimensional random vector

Z

that may contain confounding factors. The proposed test permutes entries of

X

non-uniformly, so as to respect the existing dependence between

X

and

Z

and thus account for the presence of these confounders. Like the conditional randomization test of Cand\`es et al. (2018), our test relies on the availability of an approximation to the distribution of

X \mid Z

. While Cand\`es et al. (2018)'s test uses this estimate to draw new

X

values, for our test we use this approximation to design an appropriate non-uniform distribution on permutations of the

X

values already seen in the true data. We provide an efficient Markov Chain Monte Carlo sampler for the implementation of our method, and establish bounds on the Type I error in terms of the error in the approximation of the conditional distribution of

X\mid Z

, finding that, for the worst case test statistic, the inflation in Type I error of the conditional permutation test is no larger than that of the conditional randomization test. We validate these theoretical results with experiments on simulated data and on the Capital Bikeshare data set

Apollo (Cambridge)

High-dimensional change point estimation via sparse projection

Author: Samworth RJ
Wang T
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 01/01/2018
Field of study

Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called 'inspect' for estimation of the changepoints: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimisation problem derived from the CUSUM transformation of the time series. We then apply an existing univariate changepoint estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated changepoints and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data generating mechanisms. Software implementing the methodology is available in the R package 'InspectChangepoint'

UCL Discovery

Apollo (Cambridge)

Isotonic regression in general dimensions

Author: Chatterjee Sabyasachi
Han Qiyang
Samworth RJ
Wang Tengyao
Publication venue: Annals of Statistics
Publication date: 01/01/2019
Field of study

We study the least squares regression function estimator over the class of real-valued functions on

[0,1]^d

that are increasing in each coordinate. For uniformly bounded signals and with a fixed, cubic lattice design, we establish that the estimator achieves the minimax rate of order

n^{-\min\{2/(d+2),1/d\}}

in the empirical

L_2

loss, up to poly-logarithmic factors. Further, we prove a sharp oracle inequality, which reveals in particular that when the true regression function is piecewise constant on

k

hyperrectangles, the least squares estimator enjoys a faster, adaptive rate of convergence of

(k/n)^{\min(1,2/d)}

, again up to poly-logarithmic factors. Previous results are confined to the case

d \leq 2

. Finally, we establish corresponding bounds (which are new even in the case

d=2

) in the more challenging random design setting. There are two surprising features of these results: first, they demonstrate that it is possible for a global empirical risk minimisation procedure to be rate optimal up to poly-logarithmic factors even when the corresponding entropy integral for the function class diverges rapidly; second, they indicate that the adaptation rate for shape-constrained estimators can be strictly worse than the parametric rate.The research of the first author is supported in part by NSF Grant DMS-1566514. The research of the second and fourth authors is supported by EPSRC fellowship EP/J017213/1 and a grant from the Leverhulme Trust RG81761

UCL Discovery

Apollo (Cambridge)

Comments on: High-dimensional simultaneous inference with the bootstrap

Author: Lockhart RA
Samworth RJ
Publication venue: Test
Publication date: 29/03/2017
Field of study

We congratulate the authors on their stimulating contribution to the burgeoning high-dimensional inference literature. The bootstrap offers such an attractive methodology in these settings, but it is well-known that its naive application in the context of shrinkage/superefficiency is fraught with danger (e.g. Samworth, 2003; Chatterjee and Lahiri, 2011). The authors show how these perils can be elegantly sidestepped by working with de-biased, or de-sparsified, versions of estimators.EPSRC (EP/J017213/1), Leverhulme Trust (PLP-2014-353

arXiv.org e-Print Archive

Apollo (Cambridge)

Ensemble of a subset of kNN classifiers

Author: A Karatzoglou
Aris Perperoglou
Asma Gul
Berthold Lausen
C Müssel
D Mease
DF Nettleton
E Bauer
EW Steyerberg
J Hernández-Orallo
J Kruppa
L Breiman
L Lausser
Miftahuddin Miftahuddin
O Mahmoud
Osama Mahmoud
P Hall
P Melville
R Barandela
R Maclin
RJ Samworth
S Li
T Cover
T Hothorn
T Hothorn
T Hothorn
T Hothorn
T Khoshgoftaar
Werner Adler
Z Liu
Zardad Khan
ZH Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines

University of Essex Research Repository

Crossref

Springer - Publisher Connector

Explore Bristol Research

A multiple myeloma classification system that associates normal B-cell subset phenotypes with prognosis.

Author: Bertsch U
Broyl A
Brøndum RF
Bødker JS
Bøgsted M
Davies F
Due H
Dybkær K
El-Galaly T
Goldschmidt H
Jespersen DS
Johansen P
Johnson D
Johnson HE
Kaiser M
Morgan GJ
Munshi N
Nørgaard CH
Nørgaard MA
Orfao A
Pawlyn C
Perez-Andres M
Samur MK
Samworth RJ
Schmitz A
Schönherz AA
Shah R
Sonneveld P
Sønderkær M
van Duin M
Vesteghem C
Walker B
Publication venue: 'American Society of Hematology'
Publication date: 01/01/2018
Field of study

Despite the recent progress in treatment of multiple myeloma (MM), it is still an incurable malignant disease, and we are therefore in need of new risk stratification tools that can help us to understand the disease and optimize therapy. Here we propose a new subtyping of myeloma plasma cells (PCs) from diagnostic samples, assigned by normal B-cell subset associated gene signatures (BAGS). For this purpose, we combined fluorescence-activated cell sorting and gene expression profiles from normal bone marrow (BM) Pre-BI, Pre-BII, immature, naïve, memory, and PC subsets to generate BAGS for assignment of normal BM subtypes in diagnostic samples. The impact of the subtypes was analyzed in 8 available data sets from 1772 patients' myeloma PC samples. The resulting tumor assignments in available clinical data sets exhibited similar BAGS subtype frequencies in 4 cohorts from de novo MM patients across 1296 individual cases. The BAGS subtypes were significantly associated with progression-free and overall survival in a meta-analysis of 916 patients from 3 prospective clinical trials. The major impact was observed within the Pre-BII and memory subtypes, which had a significantly inferior prognosis compared with other subtypes. A multiple Cox proportional hazard analysis documented that BAGS subtypes added significant, independent prognostic information to the translocations and cyclin D classification. BAGS subtype analysis of patient cases identified transcriptional differences, including a number of differentially spliced genes. We identified subtype differences in myeloma at diagnosis, with prognostic impact and predictive potential, supporting an acquired B-cell trait and phenotypic plasticity as a pathogenetic hallmark of MM

EUR Research Repository

Institute of Cancer Research Repository