38 research outputs found

    A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

    Full text link
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Beta Induced Sparsity Algorithm

    Get PDF
    We propose a novel technique that exploits some interesting properties of the Beta distribution to derive a sparse solution to the traditional general linear regression under the Gaussian noise assumption. Our proposed technique provides a theoretically, conceptually and computationally better alternative to both the LASSO and the relevance vector machine in the sense that it is centered around an objective function that is convex and easy to interpret. We demonstrate the strength of our proposed technique through examples, and we also provide a theoretical proof of the merits of our method

    Bayesian Computation of the Intrinsic Structure of Factor Analytic Models

    Get PDF
    The study of factor analytic models often has to address two important issues: (a) the determination of the “optimum” number of factors and (b) the derivation of a unique simple structure whose interpretation is easy and straightforward. The classical approach deals with these two tasks separately, and sometimes resorts to ad-hoc methods. This paper proposes a Bayesian approach to these two important issues, and adapts ideas from stochastic geometry and Bayesian finite mixture modelling to construct an ergodic Markov chain having the posterior distribution of the complete collection of parameters (including the number of factors) as its equilibrium distribution. The proposed method uses an Automatic Relevance Determination (ARD) prior as the device of achieving the desired simple structure. A Gibbs sampler updating scheme is then combined with the simulation of a continuous-time birth-and-death point process to produce a sampling scheme that efficiently explores the posterior distribution of interest. The MCMC sample path obtained from the simulated posterior then provides a flexible ingredient for most of the inferential tasks of interest. Illustrations on both artificial and real tasks are provided, while major difficulties and challenges are discussed, along with ideas for future improvements
    corecore