18,572 research outputs found
A memory-based method to select the number of relevant components in Principal Component Analysis
We propose a new data-driven method to select the optimal number of relevant
components in Principal Component Analysis (PCA). This new method applies to
correlation matrices whose time autocorrelation function decays more slowly
than an exponential, giving rise to long memory effects. In comparison with
other available methods present in the literature, our procedure does not rely
on subjective evaluations and is computationally inexpensive. The underlying
basic idea is to use a suitable factor model to analyse the residual memory
after sequentially removing more and more components, and stopping the process
when the maximum amount of memory has been accounted for by the retained
components. We validate our methodology on both synthetic and real financial
data, and find in all cases a clear and computationally superior answer
entirely compatible with available heuristic criteria, such as cumulative
variance and cross-validation.Comment: 29 pages, publishe
Microdata Disclosure by Resampling: Empirical Findings for Business Survey Data
A problem statistical offices and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of data, i.e., information loss is potentially high. In this paper I discuss an alternative technique of creating scientific-use-files, which reproduce the characteristics of the original data quite well. It is based on Fienberg?s (1997 and 1994) [5], [6] idea to estimate and resample from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates datasets - the resample - which have the same characteristics as the original survey data. In this paper I present some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and compare resampling with a common method of disclosure control, i.e. disturbance with multiplicative error, concerning confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if (a) the resampling procedure implements the correlation structure of the original data as a scale and (b) the data is multiplicative perturbed and a correction term is used. On average, anonymized data with multiplicative perturbed values better protect against re?identification as the various resampling methods used. --resampling,multiplicative data perturbation,Monte Carlo studies,business survey data
Decentralized learning with budgeted network load using Gaussian copulas and classifier ensembles
We examine a network of learners which address the same classification task
but must learn from different data sets. The learners cannot share data but
instead share their models. Models are shared only one time so as to preserve
the network load. We introduce DELCO (standing for Decentralized Ensemble
Learning with COpulas), a new approach allowing to aggregate the predictions of
the classifiers trained by each learner. The proposed method aggregates the
base classifiers using a probabilistic model relying on Gaussian copulas.
Experiments on logistic regressor ensembles demonstrate competing accuracy and
increased robustness in case of dependent classifiers. A companion python
implementation can be downloaded at https://github.com/john-klein/DELC
Statistics of the seasonal cycle of the 1951-2000 surface temperature records in Italy
We present an analysis of seasonal cycle of the last 50 years of records of
surface temperature in Italy. We consider two data sets which synthesize the
surface temperature fields of Northern and Southern Italy. Such data sets
consist of records of daily maximum and minimum temperature. We compute the
best estimate of the seasonal cycle of the variables considered by adopting the
cyclograms' technique. We observe that in general the minimum temperature cycle
lags behind the maximum temperature cycle, and that the cycles of the Southern
Italy temperatures records lag behind the corresponding cycles referring to
Northern Italy. All seasonal cycles lag considerably behind the solar cycle.
The amplitude and phase of the seasonal cycles do not show any statistically
significant trend in the time interval considered.Comment: 30 pages, 6 figures, submitted to IJ
Segmentation of Fault Networks Determined from Spatial Clustering of Earthquakes
We present a new method of data clustering applied to earthquake catalogs,
with the goal of reconstructing the seismically active part of fault networks.
We first use an original method to separate clustered events from uncorrelated
seismicity using the distribution of volumes of tetrahedra defined by closest
neighbor events in the original and randomized seismic catalogs. The spatial
disorder of the complex geometry of fault networks is then taken into account
by defining faults as probabilistic anisotropic kernels, whose structures are
motivated by properties of discontinuous tectonic deformation and previous
empirical observations of the geometry of faults and of earthquake clusters at
many spatial and temporal scales. Combining this a priori knowledge with
information theoretical arguments, we propose the Gaussian mixture approach
implemented in an Expectation-Maximization (EM) procedure. A cross-validation
scheme is then used and allows the determination of the number of kernels that
should be used to provide an optimal data clustering of the catalog. This
three-steps approach is applied to a high quality relocated catalog of the
seismicity following the 1986 Mount Lewis () event in California and
reveals that events cluster along planar patches of about 2 km, i.e.
comparable to the size of the main event. The finite thickness of those
clusters (about 290 m) suggests that events do not occur on well-defined
euclidean fault core surfaces, but rather that the damage zone surrounding
faults may be seismically active at depth. Finally, we propose a connection
between our methodology and multi-scale spatial analysis, based on the
derivation of spatial fractal dimension of about 1.8 for the set of hypocenters
in the Mnt Lewis area, consistent with recent observations on relocated
catalogs
- …