503,387 research outputs found

    Randomized Parallel Selection

    Get PDF
    We show that selection on an input of size N can be performed on a P-node hypercube (P = N/(log N)) in time O(n/P) with high probability, provided each node can process all the incident edges in one unit of time (this model is called the parallel model and has been assumed by previous researchers (e.g.,[17])). This result is important in view of a lower bound of Plaxton that implies selection takes Ω((N/P)loglog P+log P) time on a P-node hypercube if each node can process only one edge at a time (this model is referred to as the sequential model)

    Robust Model-free Variable Screening, Double-parallel Monte Carlo and Average Bayesian Information Criterion

    Get PDF
    Big data analysis and high dimensional data analysis are two popular and challenging topics in current statistical research. They bring us a lot of opportunities as well as many challenges. For big data, traditional methods are generally not efficient enough to handle them, from both time perspective and space perspective. For high dimensional data, most traditional methods can’t be implemented, let alone maintain their desirable properties, such as consistency. In this disseration, three new strategies are proposed to solve these issues. HZSIS is a robust model-free variable screening method and possesses sure screening property under the ultrahigh-dimensional setting. It works based on the nonparanormal transformation and Henze-Zirkler’s test. The numerical results indicate that, compared to the existing methods, the proposed method is more robust to the data generated from heavy-tailed distributions and/or complex models with interaction variables. Double Parallel Monte Carlo is a simple, practical and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo (Pop-SAMC) algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since the proposed algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as “Double Parallel Monte Carlo”. The validity of the proposed algorithm is justified both mathematically and numerically. Average Bayesian Information Criterion (ABIC) and its high-dimensional variant Average Extended Bayesian Information Criterion (AEBIC) led to an innovative way to use posterior samples to conduct model selection. The consistency of this method is established for the high-dimensional generalized linear model under some sparsity and regularity conditions. The numerical results also indicate that, when the sample size is large enough, this method can accurately select the smallest true model with high probability

    Outage Performance Analysis of Multicarrier Relay Selection for Cooperative Networks

    Full text link
    In this paper, we analyze the outage performance of two multicarrier relay selection schemes, i.e. bulk and per-subcarrier selections, for two-hop orthogonal frequency-division multiplexing (OFDM) systems. To provide a comprehensive analysis, three forwarding protocols: decode-and-forward (DF), fixed-gain (FG) amplify-and-forward (AF) and variable-gain (VG) AF relay systems are considered. We obtain closed-form approximations for the outage probability and closed-form expressions for the asymptotic outage probability in the high signal-to-noise ratio (SNR) region for all cases. Our analysis is verified by Monte Carlo simulations, and provides an analytical framework for multicarrier systems with relay selection

    REC: Fast sparse regression-based multicategory classification

    Get PDF
    Recent advance in technology enables researchers to gather and store enormous data sets with ultra high dimensionality. In bioinformatics, microarray and next generation sequencing technologies can produce data with tens of thousands of predictors of biomarkers. On the other hand, the corresponding sample sizes are often limited. For classification problems, to predict new observations with high accuracy, and to better understand the effect of predictors on classification, it is desirable, and often necessary, to train the classifier with variable selection. In the literature, sparse regularized classification techniques have been popular due to the ability of simultaneous classification and variable selection. Despite its success, such a sparse penalized method may have low computational speed, when the dimension of the problem is ultra high. To overcome this challenge, we propose a new sparse REgression based multicategory Classifier (REC). Our method uses a simplex to represent different categories of the classification problem. A major advantage of REC is that the optimization can be decoupled into smaller independent sparse penalized regression problems, and hence solved by using parallel computing. Consequently, REC enjoys an extraordinarily fast computational speed. Moreover, REC is able to provide class conditional probability estimation. Simulated examples and applications on microarray and next generation sequencing data suggest that REC is very competitive when compared to several existing methods

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class
    • …
    corecore