Search CORE

18,572 research outputs found

A memory-based method to select the number of relevant components in Principal Component Analysis

Author: Di Matteo Tiziana
Verma Anshul
Vivo Pierpaolo
Publication venue: 'IOP Publishing'
Publication date: 01/01/2019
Field of study

We propose a new data-driven method to select the optimal number of relevant components in Principal Component Analysis (PCA). This new method applies to correlation matrices whose time autocorrelation function decays more slowly than an exponential, giving rise to long memory effects. In comparison with other available methods present in the literature, our procedure does not rely on subjective evaluations and is computationally inexpensive. The underlying basic idea is to use a suitable factor model to analyse the residual memory after sequentially removing more and more components, and stopping the process when the maximum amount of memory has been accounted for by the retained components. We validate our methodology on both synthetic and real financial data, and find in all cases a clear and computationally superior answer entirely compatible with available heuristic criteria, such as cumulative variance and cross-validation.Comment: 29 pages, publishe

arXiv.org e-Print Archive

King's Research Portal

Microdata Disclosure by Resampling: Empirical Findings for Business Survey Data

Author: Gottschalk Sandra
Publication venue
Publication date
Field of study

A problem statistical offices and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of data, i.e., information loss is potentially high. In this paper I discuss an alternative technique of creating scientific-use-files, which reproduce the characteristics of the original data quite well. It is based on Fienberg?s (1997 and 1994) [5], [6] idea to estimate and resample from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates datasets - the resample - which have the same characteristics as the original survey data. In this paper I present some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and compare resampling with a common method of disclosure control, i.e. disturbance with multiplicative error, concerning confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if (a) the resampling procedure implements the correlation structure of the original data as a scale and (b) the data is multiplicative perturbed and a correction term is used. On average, anonymized data with multiplicative perturbed values better protect against re?identification as the various resampling methods used. --resampling,multiplicative data perturbation,Monte Carlo studies,business survey data

Research Papers in Economics

Decentralized learning with budgeted network load using Gaussian copulas and classifier ensembles

Author: AP Dawid
C Genest
DH Wolpert
ED Sontag
F Pedregosa
GB Giannakis
I Zezula
J Kittler
J Kittler
L Breiman
L Xu
LK Hansen
M Wozniak
OP Faugeras
S Deerwester
TK Ho
V Tresp
Y Freund
Y Koren
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/03/2019
Field of study

We examine a network of learners which address the same classification task but must learn from different data sets. The learners cannot share data but instead share their models. Models are shared only one time so as to preserve the network load. We introduce DELCO (standing for Decentralized Ensemble Learning with COpulas), a new approach allowing to aggregate the predictions of the classifiers trained by each learner. The proposed method aggregates the base classifiers using a probabilistic model relying on Gaussian copulas. Experiments on logistic regressor ensembles demonstrate competing accuracy and increased robustness in case of dependent classifiers. A companion python implementation can be downloaded at https://github.com/john-klein/DELC

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

UCL Discovery

Hal-Diderot

Statistics of the seasonal cycle of the 1951-2000 surface temperature records in Italy

Author: Lucarini Valerio
Nanni Teresa
Speranza Antonio
Publication venue
Publication date: 01/01/2004
Field of study

We present an analysis of seasonal cycle of the last 50 years of records of surface temperature in Italy. We consider two data sets which synthesize the surface temperature fields of Northern and Southern Italy. Such data sets consist of records of daily maximum and minimum temperature. We compute the best estimate of the seasonal cycle of the variables considered by adopting the cyclograms' technique. We observe that in general the minimum temperature cycle lags behind the maximum temperature cycle, and that the cycles of the Southern Italy temperatures records lag behind the corresponding cycles referring to Northern Italy. All seasonal cycles lag considerably behind the solar cycle. The amplitude and phase of the seasonal cycles do not show any statistically significant trend in the time interval considered.Comment: 30 pages, 6 figures, submitted to IJ

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Scientific Open-access Literature Archive and Repository

CERN Document Server

Segmentation of Fault Networks Determined from Spatial Clustering of Earthquakes

Author: Allégre
Ben-Zion
Ben-Zion
Bishop
Bowman
Corduneanu
D. Sornette
de Joussineau
Duda
Eneva
Faulkner
G. Ouillon
Gardner
Hauksson
Hauksson
Helmstetter
Helmstetter
Helmstetter
Helmstetter
Kagan
Kagan
Kagan
Kagan
Kagan
Kagan
Kilb
Knopoff
Liu
Lyakhovsky
MacQueen
Manighetti
Marsan
Martel
Martel
Miltenberger
Nemser
Ouillon
Ouillon
Ouillon
Ouillon
Pisarenko
Plesch
Powers
Reasenberg
Robertson
Scholz
Scholz
Sornette
Sornette
Sornette
Sornette
Sornette
Sornette
Sornette
Sornette
Vermilye
Waldhauser
Wang
Weatherley
Willemse
Wilson
Zhou
Zhuang
Zhuang
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 04/06/2010
Field of study

We present a new method of data clustering applied to earthquake catalogs, with the goal of reconstructing the seismically active part of fault networks. We first use an original method to separate clustered events from uncorrelated seismicity using the distribution of volumes of tetrahedra defined by closest neighbor events in the original and randomized seismic catalogs. The spatial disorder of the complex geometry of fault networks is then taken into account by defining faults as probabilistic anisotropic kernels, whose structures are motivated by properties of discontinuous tectonic deformation and previous empirical observations of the geometry of faults and of earthquake clusters at many spatial and temporal scales. Combining this a priori knowledge with information theoretical arguments, we propose the Gaussian mixture approach implemented in an Expectation-Maximization (EM) procedure. A cross-validation scheme is then used and allows the determination of the number of kernels that should be used to provide an optimal data clustering of the catalog. This three-steps approach is applied to a high quality relocated catalog of the seismicity following the 1986 Mount Lewis (

M_l=5.7

) event in California and reveals that events cluster along planar patches of about 2 km

^2

, i.e. comparable to the size of the main event. The finite thickness of those clusters (about 290 m) suggests that events do not occur on well-defined euclidean fault core surfaces, but rather that the damage zone surrounding faults may be seismically active at depth. Finally, we propose a connection between our methodology and multi-scale spatial analysis, based on the derivation of spatial fractal dimension of about 1.8 for the set of hypocenters in the Mnt Lewis area, consistent with recent observations on relocated catalogs

arXiv.org e-Print Archive

Crossref