Search CORE

62,236 research outputs found

Optimal Cox Regression Subsampling Procedure with Rare Events

Author: Gorfine Malka
Keret Nir
Publication venue
Publication date: 14/03/2021
Field of study

Massive sized survival datasets are becoming increasingly prevalent with the development of the healthcare industry. Such datasets pose computational challenges unprecedented in traditional survival analysis use-cases. A popular way for coping with massive datasets is downsampling them to a more manageable size, such that the computational resources can be afforded by the researcher. Cox proportional hazards regression has remained one of the most popular statistical models for the analysis of survival data to-date. This work addresses the settings of right censored and possibly left truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. Asymptotic properties of the proposed estimators are established under suitable regularity conditions, and simulation studies are carried out to evaluate the finite sample performance of the estimators. We further apply our procedure on UK-biobank colorectal cancer genetic and environmental risk factors

arXiv.org e-Print Archive

Attribute dependency data analysis for massive datasets by fuzzy transforms

Author: ferdinando di martino
salvatore sessa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

We present a numerical attribute dependency method for massive datasets based on the concepts of direct and inverse fuzzy transform. In a previous work, we used these concepts for numerical attribute dependency in data analysis: Therein, the multi-dimensional inverse fuzzy transform was useful for approximating a regression function. Here we give an extension of this method in massive datasets because the previous method could not be applied due to the high memory size. Our method is proved on a large dataset formed from 402,678 census sections of the Italian regions provided by the Italian National Statistical Institute (ISTAT) in 2011. The results of comparative tests with the well-known methods of regression, called support vector regression and multilayer perceptron, show that the proposed algorithm has comparable performance with those obtained using these two methods. Moreover, the number of parameters requested in our method is minor with respect to those of the cited in the above two algorithms

Archivio della ricerca - Università degli studi di Napoli Federico II

Open Access Repository

Recommended from our members

Computational Approach to Identifying Universal Macrophage Biomarkers.

Author: Dang Dharanidhar
Das Soumita
Ghosh Pradipta
Prince Lawrence S
Sahoo Debashis
Taheri Sahar
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Macrophages engulf and digest microbes, cellular debris, and various disease-associated cells throughout the body. Understanding the dynamics of macrophage gene expression is crucial for studying human diseases. As both bulk RNAseq and single cell RNAseq datasets become more numerous and complex, identifying a universal and reliable marker of macrophage cell becomes paramount. Traditional approaches have relied upon tissue specific expression patterns. To identify universal biomarkers of macrophage, we used a previously published computational approach called BECC (Boolean Equivalent Correlated Clusters) that was originally used to identify conserved cell cycle genes. We performed BECC analysis using the known macrophage marker CD14 as a seed gene. The main idea behind BECC is that it uses massive database of public gene expression dataset to establish robust co-expression patterns identified using a combination of correlation, linear regression and Boolean equivalences. Our analysis identified and validated FCER1G and TYROBP as novel universal biomarkers for macrophages in human and mouse tissues

eScholarship - University of California

Estimation of regression-based model with bulk noisy data

Author: Jittawiriyanukoon Chanintorn
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/10/2019
Field of study

The bulk noise has been provoking a contributed data due to a communication network with a tremendously low signal to noise ratio. An appreciated method for revising massive noise of individuals through information theory is widely discussed. One of the practical applications of this approach for bulk noise estimation is analyzed using intelligent automation and machine learning tools, dealing the case of bulk noise existence or nonexistence. A regression-based model is employed for the investigation and experiment. Estimation for the practical case with bulk noisy datasets is proposed. The proposed method applies slice-and-dice technique to partition a body of datasets down into slighter portions so that it can be carried out. The average error, correlation, absolute error and mean square error are computed to validate the estimation. Results from massive online analysis will be verified with data collected in the following period. In many cases, the prediction with bulk noisy data through MOA simulation reveals Random Imputation minimizes the average error

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments

Author: Cressie N.
Golub G. H.
Lauritzen S. L.
Murphy K.
Stein M. L.
Vecchia A. V.
Yeniay O.
Publication venue: 'Wiley'
Publication date: 07/11/2018
Field of study

With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article is submitted to the Practice section of the journal with the aim of developing massively scalable Bayesian approaches that can rapidly deliver Bayesian inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages without requiring message-parsing interfaces or parallel programming paradigms. Key insights are offered regarding assumptions and approximations concerning practical efficiency.Comment: 20 pages, 4 figures, 2 table

arXiv.org e-Print Archive

Crossref

eScholarship - University of California