249 research outputs found
Banking the unbanked: the Mzansi intervention in South Africa:
Purpose
This paper aims to understand household’s latent behaviour decision making in accessing financial services. In this analysis we look at the determinants of the choice of the pre-entry Mzansi account by consumers in South Africa.
Design/methodology/approach
We use 102 variables, grouped in the following categories: basic literacy, understanding financial terms, targets for financial advice, desired financial education and financial perception. Employing a computationally efficient variable selection algorithm we study which variables can satisfactorily explain the choice of a Mzansi account.
Findings
The Mzansi intervention is appealing to individuals with basic but insufficient financial education. Aspirations seem to be very influential in revealing the choice of financial services and to this end Mzansi is perceived as a pre-entry account not meeting the aspirations of individuals aiming to climb up the financial services ladder. We find that Mzansi holders view the account mainly as a vehicle for receiving payments, but on the other hand are debt-averse and inclined to save. Hence although there is at present no concrete evidence that the Mzansi intervention increases access to finance via diversification (i.e. by recruiting customers into higher level accounts and services) our analysis shows that this is very likely to be the case.
Originality/value
The issue of demand side constraints on access to finance have been largely ignored in the theoretical and empirical literature. This paper undertakes some preliminary steps in addressing this gap
Recommended from our members
Computationally Efficient Methods for High-Dimensional Statistical Problems
With the ever-increasing amount of computational power available, so broadens the horizon of statistical problems that can be tackled. However, many practitioners have only an ordinary personal computer on which to do their work. The need for computationally efficient methodology is as pressing as ever, and there remain some questions as-yet without a confident answer for a practitioner working with tight computational constraints. This thesis develops methods for three such problems. The first, introductory, chapter provides an overview of the area and an accessible preamble to the problems these methods address.
In the second chapter we address the problem of modelling a high-dimensional linear regression with categorical predictor variables.
The natural sparsity assumption in this setting is on the number of unique values the coefficients within each categorical variable can take. With this assumption, we introduce a new form of penalty function for tackling this problem. While the number of combinations of levels can grow extremely fast in the number of levels, the unique structure of the method enables fast optimisation for this problem. A novel and intricate dynamic programming algorithm computes the exact global optimum over each variable, and is embedded within a block coordinate descent algorithm. This allows fitting of such models quickly on a laptop computer in a memory efficient manner. The scaling requirements sufficient for this method to recover the correct groups cannot be relaxed for any estimator; this strong performance is validated by a range of experiments using both simulated and real data.
In the third chapter we explore the possibility that a practitioner has some a priori belief to which variables are most likely to be important, which will be in the form of a permutation of the columns. Our approach takes this ordering and efficiently computes a grid of solution paths by sequentially removing groups of variables without unnecessary recomputation of coefficients. Typical examples of such orderings include the column norms in the (unscaled) design matrix, or the recentness of observations in time series data. This procedure, combined with selecting the size of support set by validation on a test set, has similar performance to that of fitting the oracular submodel.
The fourth chapter concerns the efficient estimation of conditional independence graphs in Gaussian graphical models. Neighbourhood selection is practical, popular, and enjoys good performance, but in large-scale settings it can still have computational demands exceeding the resources available to many practitioners. Screening approaches promise large improvements in speed with only a small price to pay in terms of resulting estimation performance. Although it is well-known that nodes adjacent in the conditional independence graph may be uncorrelated, a minimum absolute correlation between adjacent nodes is often tacitly or explicitly assumed in order for screening procedures to be effective. We make use of recent work in covariance estimation and high-dimensional screening of variables to develop a fast, two-stage, screening procedure specifically for use within neighbourhood selection and avoiding this restrictive assumption. Provided that a weaker version of a minimum edge strength requirement holds over most of the graph, the performance of the post-screening nodewise regressions is not compromised, while being substantially faster than the full procedure. This method is robust to the presence of latent confounders, as well as other scenarios that typically impede the screening of variables. Experiments show that our approach strikes a favourable balance between edge detection and computational efficiencyCantab Capital Institute for the Mathematics of Informatio
Sparse reduced-rank regression for imaging genetics studies: models and applications
We present a novel statistical technique; the sparse reduced rank regression (sRRR) model
which is a strategy for multivariate modelling of high-dimensional imaging responses and
genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity
in the regression coefficients, identifying subsets of genetic markers that best explain
the variability observed in subsets of the phenotypes. To properly exploit the rich structure
present in each of the imaging and genetics domains, we additionally propose the use of
several structured penalties within the sRRR model. Using simulation procedures that accurately
reflect realistic imaging genetics data, we present detailed evaluations of the sRRR
method in comparison with the more traditional univariate linear modelling approach. In
all settings considered, we show that sRRR possesses better power to detect the deleterious
genetic variants. Moreover, using a simple genetic model, we demonstrate the potential
benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to
extracting averages over regions of interest in the brain. Since this entails the use of phenotypic
vectors of enormous dimensionality, we suggest the use of a sparse classification
model as a de-noising step, prior to the imaging genetics study. Finally, we present the
application of a data re-sampling technique within the sRRR model for model selection.
Using this approach we are able to rank the genetic markers in order of importance of association
to the phenotypes, and similarly rank the phenotypes in order of importance to
the genetic markers. In the very end, we illustrate the application perspective of the proposed
statistical models in three real imaging genetics datasets and highlight some potential
associations
Shrinkage methods for variable selection and prediction with applications to genetic data
Identifying genotypes using genetic material was at first a painstaking
laboratory task. In the decades since the first gene was sequenced,
techniques have progressed through milestones requiring massive international
collaboration. Today’s genotype sequencing facilities use
high-throughput technology to sequence entire genomes within days.
Despite these technological improvements, and the resultant volume
of genetic data, the identification of meaningful genotype-phenotype
associations has not been as straightforward as was anticipated in the
pre-genome era. The genetic architecture of many common diseases
is complex, and heritability often cannot be explained when simple
statistical tests are used.
This thesis addresses a clinically important problem in statistical genetics
- that of predicting disease risk based on genotype information.
First, we review progress and current limitations in genetic risk prediction.
We then introduce penalised regression. This thesis focusses
on ridge regression, a penalised regression approach that has shown
promise in risk prediction for high-dimensional data. The choice of the
ridge parameter, which controls the amount of penalisation in ridge
regression, has not been addressed in the literature with the specific
aim of analysing genetic data. We present a method for automatically
choosing the ridge parameter based on genome-wide SNP data. Software
implementing the method is available to the community. We evaluate
the method using simulation studies and a real data example.
A ridge regression model does not indicate the strength of association
of individual variants with the outcome, a property that is often of
interest to geneticists. To this end we extend a previously proposed test of significance in ridge regression models to high-dimensional data and
to the logistic model which commonly occurs in the biomedical context.
This test is evaluated by comparison to a permutation test, which we
view as a benchmark. This test is integrated into the software package
mentioned above
High Dimensional Statistical Modelling with Limited Information
Modern scientific experiments often rely on different statistical tools, regularisation
being one of them. Regularisation methods are usually used to avoid overfitting but
we may also want use regularisation methods for variable selection,
especially when the number of modelling parameters are higher than the total number
of observations. However, performing variable selection can often be difficult under limited
information and we may get a misspecified model. To overcome this issue, we propose a robust
variable selection routine using a Bayesian hierarchical model.
We adapt the framework of Narisetty and He to propose a novel spike and slab prior specification
for the regression coefficients. We take inspiration from the imprecise beta model and
use a set of beta distributions to specify the prior expectation of the selection probability.
We perform a robust Bayesian analysis over this set of distributions in order to
incorporate expert opinion in an efficient manner.
We also discuss novel results on likelihood-based approaches for variable selection.
We exploit the framework of the adaptive LASSO to propose sensitivity analyses
of LASSO-type problems. The sensitivity analysis also gives us a novel non-deterministic classifier
for high dimensional problems, which we illustrate using real datasets.
Finally, we illustrate our novel robust Bayesian variable selection using synthetic and real-world data.
We show the importance of prior elicitation in variable selection as well as model fitting and compare
our method with other Bayesian approaches for variable selection
Randomised and L1-penalty approaches to segmentation in time series and regression models
It is a common approach in statistics to assume that the parameters of a stochastic model change. The simplest model involves parameters than can be exactly or approximately piecewise constant. In such a model, the aim is the posteriori detection of the number and location in time of the changes in the parameters. This thesis develops segmentation methods for non-stationary time series and regression models using randomised methods or methods that involve L1 penalties which force the coefficients in a regression model to be exactly zero. Randomised techniques are not commonly found in nonparametric statistics, whereas L1 methods draw heavily from the variable selection literature. Considering these two categories together, apart from other contributions, enables a comparison between them by pointing out strengths and weaknesses. This is achieved by organising the thesis into three main parts.
First, we propose a new technique for detecting the number and locations of the change-points in the second-order structure of a time series. The core of the segmentation procedure is the Wild Binary Segmentation method (WBS) of Fryzlewicz (2014), a technique which involves a certain randomised mechanism. The advantage of WBS over the standard Binary Segmentation lies in its localisation feature, thanks to which it works in cases where the spacings between change-points are short. Our main change-point detection statistic is the wavelet periodogram which allows a rigorous estimation of the local autocovariance of a piecewise-stationary process. We provide a proof of consistency and examine the performance of the method on simulated and real data sets.
Second, we study the fused lasso estimator which, in its simplest form, deals with the estimation of a piecewise constant function contaminated with Gaussian noise (Friedman et al. (2007)). We show a fast way of implementing the solution path algorithm of Tibshirani and Taylor (2011) and we make a connection between their algorithm and the taut-string method of Davies and Kovac (2001). In addition, a theoretical result and a simulation study indicate that the fused lasso estimator is suboptimal in detecting the location of a change-point.
Finally, we propose a method to estimate regression models in which the coefficients vary with respect to some covariate such as time. In particular, we present a path algorithm based on Tibshirani and Taylor (2011) and the fused lasso method of Tibshirani et al. (2005). Thanks to the adaptability of the fused lasso penalty, our proposed method goes beyond the estimation of piecewise constant models to models where the underlying coefficient function can be piecewise linear, quadratic or cubic. Our simulation studies show that in most cases the method outperforms smoothing splines, a common approach in estimating this class of models
Sparse generalised principal component analysis
In this paper, we develop a sparse method for unsupervised dimension reduction for data from an exponential-family distribution. Our idea extends previous work on Generalised Principal Component Analysis by adding L1 and SCAD penalties to introduce sparsity. We demonstrate the significance and advantages of our method with synthetic and real data examples. We focus on the application to text data which is high-dimensional and non-Gaussian by nature and discuss the potential advantages of our methodology in achieving dimension reduction
- …