249 research outputs found

    Banking the unbanked: the Mzansi intervention in South Africa:

    Get PDF
    Purpose This paper aims to understand household’s latent behaviour decision making in accessing financial services. In this analysis we look at the determinants of the choice of the pre-entry Mzansi account by consumers in South Africa. Design/methodology/approach We use 102 variables, grouped in the following categories: basic literacy, understanding financial terms, targets for financial advice, desired financial education and financial perception. Employing a computationally efficient variable selection algorithm we study which variables can satisfactorily explain the choice of a Mzansi account. Findings The Mzansi intervention is appealing to individuals with basic but insufficient financial education. Aspirations seem to be very influential in revealing the choice of financial services and to this end Mzansi is perceived as a pre-entry account not meeting the aspirations of individuals aiming to climb up the financial services ladder. We find that Mzansi holders view the account mainly as a vehicle for receiving payments, but on the other hand are debt-averse and inclined to save. Hence although there is at present no concrete evidence that the Mzansi intervention increases access to finance via diversification (i.e. by recruiting customers into higher level accounts and services) our analysis shows that this is very likely to be the case. Originality/value The issue of demand side constraints on access to finance have been largely ignored in the theoretical and empirical literature. This paper undertakes some preliminary steps in addressing this gap

    Sparse reduced-rank regression for imaging genetics studies: models and applications

    Get PDF
    We present a novel statistical technique; the sparse reduced rank regression (sRRR) model which is a strategy for multivariate modelling of high-dimensional imaging responses and genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity in the regression coefficients, identifying subsets of genetic markers that best explain the variability observed in subsets of the phenotypes. To properly exploit the rich structure present in each of the imaging and genetics domains, we additionally propose the use of several structured penalties within the sRRR model. Using simulation procedures that accurately reflect realistic imaging genetics data, we present detailed evaluations of the sRRR method in comparison with the more traditional univariate linear modelling approach. In all settings considered, we show that sRRR possesses better power to detect the deleterious genetic variants. Moreover, using a simple genetic model, we demonstrate the potential benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to extracting averages over regions of interest in the brain. Since this entails the use of phenotypic vectors of enormous dimensionality, we suggest the use of a sparse classification model as a de-noising step, prior to the imaging genetics study. Finally, we present the application of a data re-sampling technique within the sRRR model for model selection. Using this approach we are able to rank the genetic markers in order of importance of association to the phenotypes, and similarly rank the phenotypes in order of importance to the genetic markers. In the very end, we illustrate the application perspective of the proposed statistical models in three real imaging genetics datasets and highlight some potential associations

    Shrinkage methods for variable selection and prediction with applications to genetic data

    No full text
    Identifying genotypes using genetic material was at first a painstaking laboratory task. In the decades since the first gene was sequenced, techniques have progressed through milestones requiring massive international collaboration. Today’s genotype sequencing facilities use high-throughput technology to sequence entire genomes within days. Despite these technological improvements, and the resultant volume of genetic data, the identification of meaningful genotype-phenotype associations has not been as straightforward as was anticipated in the pre-genome era. The genetic architecture of many common diseases is complex, and heritability often cannot be explained when simple statistical tests are used. This thesis addresses a clinically important problem in statistical genetics - that of predicting disease risk based on genotype information. First, we review progress and current limitations in genetic risk prediction. We then introduce penalised regression. This thesis focusses on ridge regression, a penalised regression approach that has shown promise in risk prediction for high-dimensional data. The choice of the ridge parameter, which controls the amount of penalisation in ridge regression, has not been addressed in the literature with the specific aim of analysing genetic data. We present a method for automatically choosing the ridge parameter based on genome-wide SNP data. Software implementing the method is available to the community. We evaluate the method using simulation studies and a real data example. A ridge regression model does not indicate the strength of association of individual variants with the outcome, a property that is often of interest to geneticists. To this end we extend a previously proposed test of significance in ridge regression models to high-dimensional data and to the logistic model which commonly occurs in the biomedical context. This test is evaluated by comparison to a permutation test, which we view as a benchmark. This test is integrated into the software package mentioned above

    High Dimensional Statistical Modelling with Limited Information

    Get PDF
    Modern scientific experiments often rely on different statistical tools, regularisation being one of them. Regularisation methods are usually used to avoid overfitting but we may also want use regularisation methods for variable selection, especially when the number of modelling parameters are higher than the total number of observations. However, performing variable selection can often be difficult under limited information and we may get a misspecified model. To overcome this issue, we propose a robust variable selection routine using a Bayesian hierarchical model. We adapt the framework of Narisetty and He to propose a novel spike and slab prior specification for the regression coefficients. We take inspiration from the imprecise beta model and use a set of beta distributions to specify the prior expectation of the selection probability. We perform a robust Bayesian analysis over this set of distributions in order to incorporate expert opinion in an efficient manner. We also discuss novel results on likelihood-based approaches for variable selection. We exploit the framework of the adaptive LASSO to propose sensitivity analyses of LASSO-type problems. The sensitivity analysis also gives us a novel non-deterministic classifier for high dimensional problems, which we illustrate using real datasets. Finally, we illustrate our novel robust Bayesian variable selection using synthetic and real-world data. We show the importance of prior elicitation in variable selection as well as model fitting and compare our method with other Bayesian approaches for variable selection

    Randomised and L1-penalty approaches to segmentation in time series and regression models

    Get PDF
    It is a common approach in statistics to assume that the parameters of a stochastic model change. The simplest model involves parameters than can be exactly or approximately piecewise constant. In such a model, the aim is the posteriori detection of the number and location in time of the changes in the parameters. This thesis develops segmentation methods for non-stationary time series and regression models using randomised methods or methods that involve L1 penalties which force the coefficients in a regression model to be exactly zero. Randomised techniques are not commonly found in nonparametric statistics, whereas L1 methods draw heavily from the variable selection literature. Considering these two categories together, apart from other contributions, enables a comparison between them by pointing out strengths and weaknesses. This is achieved by organising the thesis into three main parts. First, we propose a new technique for detecting the number and locations of the change-points in the second-order structure of a time series. The core of the segmentation procedure is the Wild Binary Segmentation method (WBS) of Fryzlewicz (2014), a technique which involves a certain randomised mechanism. The advantage of WBS over the standard Binary Segmentation lies in its localisation feature, thanks to which it works in cases where the spacings between change-points are short. Our main change-point detection statistic is the wavelet periodogram which allows a rigorous estimation of the local autocovariance of a piecewise-stationary process. We provide a proof of consistency and examine the performance of the method on simulated and real data sets. Second, we study the fused lasso estimator which, in its simplest form, deals with the estimation of a piecewise constant function contaminated with Gaussian noise (Friedman et al. (2007)). We show a fast way of implementing the solution path algorithm of Tibshirani and Taylor (2011) and we make a connection between their algorithm and the taut-string method of Davies and Kovac (2001). In addition, a theoretical result and a simulation study indicate that the fused lasso estimator is suboptimal in detecting the location of a change-point. Finally, we propose a method to estimate regression models in which the coefficients vary with respect to some covariate such as time. In particular, we present a path algorithm based on Tibshirani and Taylor (2011) and the fused lasso method of Tibshirani et al. (2005). Thanks to the adaptability of the fused lasso penalty, our proposed method goes beyond the estimation of piecewise constant models to models where the underlying coefficient function can be piecewise linear, quadratic or cubic. Our simulation studies show that in most cases the method outperforms smoothing splines, a common approach in estimating this class of models

    Sparse generalised principal component analysis

    Get PDF
    In this paper, we develop a sparse method for unsupervised dimension reduction for data from an exponential-family distribution. Our idea extends previous work on Generalised Principal Component Analysis by adding L1 and SCAD penalties to introduce sparsity. We demonstrate the significance and advantages of our method with synthetic and real data examples. We focus on the application to text data which is high-dimensional and non-Gaussian by nature and discuss the potential advantages of our methodology in achieving dimension reduction
    • …
    corecore