152 research outputs found

    Approximating Partial Likelihood Estimators via Optimal Subsampling

    Full text link
    With the growing availability of large-scale biomedical data, it is often time-consuming or infeasible to directly perform traditional statistical analysis with relatively limited computing resources at hand. We propose a fast and stable subsampling method to effectively approximate the full data maximum partial likelihood estimator in Cox's model, which reduces the computational burden when analyzing massive survival data. We establish consistency and asymptotic normality of a general subsample-based estimator. The optimal subsampling probabilities with explicit expressions are determined via minimizing the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator. We propose a two-step subsampling algorithm for practical implementation, which has a significant reduction in computing time compared to the full data method. The asymptotic properties of the resulting two-step subsample-based estimator is established. In addition, a subsampling-based Breslow-type estimator for the cumulative baseline hazard function and a subsample estimated survival function are presented. Extensive experiments are conducted to assess the proposed subsampling strategy. Finally, we provide an illustrative example about large-scale lymphoma cancer dataset from the Surveillance, Epidemiology,and End Results Program

    Fast Inference for Quantile Regression with Tens of Millions of Observations

    Full text link
    Big data analytics has opened new avenues in economic research, but the challenge of analyzing datasets with tens of millions of observations is substantial. Conventional econometric methods based on extreme estimators require large amounts of computing resources and memory, which are often not readily available. In this paper, we focus on linear quantile regression applied to ``ultra-large'' datasets, such as U.S. decennial censuses. A fast inference framework is presented, utilizing stochastic sub-gradient descent (S-subGD) updates. The inference procedure handles cross-sectional data sequentially: (i) updating the parameter estimate with each incoming "new observation", (ii) aggregating it as a Polyak-Ruppert average, and (iii) computing a pivotal statistic for inference using only a solution path. The methodology draws from time series regression to create an asymptotically pivotal statistic through random scaling. Our proposed test statistic is calculated in a fully online fashion and critical values are calculated without resampling. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as (n,d)(107,103)(n, d) \sim (10^7, 10^3), where nn is the sample size and dd is the number of regressors, our method generates new insights, surpassing current inference methods in computation. Our method specifically reveals trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over 10310^3 covariates to mitigate confounding effects.Comment: 45 pages, 6 figure

    New Efficient Approach to Solve Big Data Systems Using Parallel Gauss–Seidel Algorithms

    Get PDF
    In order to perform big-data analytics, regression involving large matrices is often necessary. In particular, large scale regression problems are encountered when one wishes to extract semantic patterns for knowledge discovery and data mining. When a large matrix can be processed in its factorized form, advantages arise in terms of computation, implementation, and data-compression. In this work, we propose two new parallel iterative algorithms as extensions of the Gauss–Seidel algorithm (GSA) to solve regression problems involving many variables. The convergence study in terms of error-bounds of the proposed iterative algorithms is also performed, and the required computation resources, namely time-and memory-complexities, are evaluated to benchmark the efficiency of the proposed new algorithms. Finally, the numerical results from both Monte Carlo simulations and real-world datasets are presented to demonstrate the striking effectiveness of our proposed new methods

    SUPERVISED LEARNING FOR COMPLEX DATA

    Get PDF
    Supervised learning problems are commonly seen in a wide range of scientific fields such as medicine and neuroscience. Given data with predictors and responses, an important goal of supervised learning is to find the underlying relationship between predictors and responses for future prediction. In this dissertation, we propose three new supervised learning approaches for the analysis of complex data. For the first two projects, we focus on block-wise missing multi-modal data which contain samples with different modalities. In the first project, we study regression problems with multiple responses. We propose a new penalized method to predict multiple correlated responses jointly, using not only the information from block-wise missing predictors but also the correlation information among responses. In the second project, we study regression problems with censored outcomes. We propose a penalized Buckley-James method that can simultaneously handle block-wise missing covariates and censored outcomes. For the third project, we analyze data streams under reproducing kernel Hilbert spaces. Specifically, we develop a new supervised learning method to learn the underlying model with limited storage space, where the model may be non-stationary. We use a shrinkage parameter and a data sparsity constraint to balance the bias-variance tradeoff, and use random feature approximation to control the storage space.Doctor of Philosoph

    Forecasting Cinema Attendance at the Movie Show Level: Evidence from Poland

    Get PDF
    Background: Cinema programmes are set in advance (usually with a weekly frequency), which motivates us to investigate the short-term forecasting of attendance. In the literature on the cinema industry, the issue of attendance forecasting has gained less research attention compared to modelling the aggregate performance of movies. Furthermore, unlike most existing studies, we use data on attendance at the individual show level (179,103 shows) rather than aggregate box office sales. Objectives: In the paper, we evaluate short-term forecasting models of cinema attendance. The main purpose of the study is to find the factors that are useful in forecasting cinema attendance at the individual show level (i.e., the number of tickets sold for a particular movie, time and cinema). Methods/Approach: We apply several linear regression models, estimated for each recursive sample, to produce one-week ahead forecasts of the attendance. We then rank the models based on the out-of-sample fit. Results: The results show that the best performing models are those that include cinema- and region-specific variables, in addition to movie parameters (e.g., genre, age classification) or title popularity. Conclusions: Regression models using a wide set of variables (cinema- and region-specific variables, movie features, title popularity) may be successfully applied for predicting individual cinema shows attendance in Poland
    corecore