62,236 research outputs found
Optimal Cox Regression Subsampling Procedure with Rare Events
Massive sized survival datasets are becoming increasingly prevalent with the
development of the healthcare industry. Such datasets pose computational
challenges unprecedented in traditional survival analysis use-cases. A popular
way for coping with massive datasets is downsampling them to a more manageable
size, such that the computational resources can be afforded by the researcher.
Cox proportional hazards regression has remained one of the most popular
statistical models for the analysis of survival data to-date. This work
addresses the settings of right censored and possibly left truncated data with
rare events, such that the observed failure times constitute only a small
portion of the overall sample. We propose Cox regression subsampling-based
estimators that approximate their full-data partial-likelihood-based
counterparts, by assigning optimal sampling probabilities to censored
observations, and including all observed failures in the analysis. Asymptotic
properties of the proposed estimators are established under suitable regularity
conditions, and simulation studies are carried out to evaluate the finite
sample performance of the estimators. We further apply our procedure on
UK-biobank colorectal cancer genetic and environmental risk factors
Attribute dependency data analysis for massive datasets by fuzzy transforms
We present a numerical attribute dependency method for massive datasets based on the concepts of direct and inverse fuzzy transform. In a previous work, we used these concepts for numerical attribute dependency in data analysis: Therein, the multi-dimensional inverse fuzzy transform was useful for approximating a regression function. Here we give an extension of this method in massive datasets because the previous method could not be applied due to the high memory size. Our method is proved on a large dataset formed from 402,678 census sections of the Italian regions provided by the Italian National Statistical Institute (ISTAT) in 2011. The results of comparative tests with the well-known methods of regression, called support vector regression and multilayer perceptron, show that the proposed algorithm has comparable performance with those obtained using these two methods. Moreover, the number of parameters requested in our method is minor with respect to those of the cited in the above two algorithms
Recommended from our members
Computational Approach to Identifying Universal Macrophage Biomarkers.
Macrophages engulf and digest microbes, cellular debris, and various disease-associated cells throughout the body. Understanding the dynamics of macrophage gene expression is crucial for studying human diseases. As both bulk RNAseq and single cell RNAseq datasets become more numerous and complex, identifying a universal and reliable marker of macrophage cell becomes paramount. Traditional approaches have relied upon tissue specific expression patterns. To identify universal biomarkers of macrophage, we used a previously published computational approach called BECC (Boolean Equivalent Correlated Clusters) that was originally used to identify conserved cell cycle genes. We performed BECC analysis using the known macrophage marker CD14 as a seed gene. The main idea behind BECC is that it uses massive database of public gene expression dataset to establish robust co-expression patterns identified using a combination of correlation, linear regression and Boolean equivalences. Our analysis identified and validated FCER1G and TYROBP as novel universal biomarkers for macrophages in human and mouse tissues
Estimation of regression-based model with bulk noisy data
The bulk noise has been provoking a contributed data due to a communication network with a tremendously low signal to noise ratio. An appreciated method for revising massive noise of individuals through information theory is widely discussed. One of the practical applications of this approach for bulk noise estimation is analyzed using intelligent automation and machine learning tools, dealing the case of bulk noise existence or nonexistence. A regression-based model is employed for the investigation and experiment. Estimation for the practical case with bulk noisy datasets is proposed. The proposed method applies slice-and-dice technique to partition a body of datasets down into slighter portions so that it can be carried out. The average error, correlation, absolute error and mean square error are computed to validate the estimation. Results from massive online analysis will be verified with data collected in the following period. In many cases, the prediction with bulk noisy data through MOA simulation reveals Random Imputation minimizes the average error
Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments
With continued advances in Geographic Information Systems and related
computational technologies, statisticians are often required to analyze very
large spatial datasets. This has generated substantial interest over the last
decade, already too vast to be summarized here, in scalable methodologies for
analyzing large spatial datasets. Scalable spatial process models have been
found especially attractive due to their richness and flexibility and,
particularly so in the Bayesian paradigm, due to their presence in hierarchical
model settings. However, the vast majority of research articles present in this
domain have been geared toward innovative theory or more complex model
development. Very limited attention has been accorded to approaches for easily
implementable scalable hierarchical models for the practicing scientist or
spatial analyst. This article is submitted to the Practice section of the
journal with the aim of developing massively scalable Bayesian approaches that
can rapidly deliver Bayesian inference on spatial process that are practically
indistinguishable from inference obtained using more expensive alternatives. A
key emphasis is on implementation within very standard (modest) computing
environments (e.g., a standard desktop or laptop) using easily available
statistical software packages without requiring message-parsing interfaces or
parallel programming paradigms. Key insights are offered regarding assumptions
and approximations concerning practical efficiency.Comment: 20 pages, 4 figures, 2 table
- …