144 research outputs found
Approaches for Outlier Detection in Sparse High-Dimensional Regression Models
Modern regression studies often encompass a very large number of potential predictors,
possibly larger than the sample size, and sometimes growing with the sample
size itself. This increases the chances that a substantial portion of the predictors
is redundant, as well as the risk of data contamination. Tackling these problems is
of utmost importance to facilitate scientific discoveries, since model estimates are
highly sensitive both to the choice of predictors and to the presence of outliers. In
this thesis, we contribute to this area considering the problem of robust model selection
in a variety of settings, where outliers may arise both in the response and
the predictors. Our proposals simplify model interpretation, guarantee predictive
performance, and allow us to study and control the influence of outlying cases on
the fit.
First, we consider the co-occurrence of multiple mean-shift and variance-inflation
outliers in low-dimensional linear models. We rely on robust estimation techniques
to identify outliers of each type, exclude mean-shift outliers, and use restricted
maximum likelihood estimation to down-weight and accommodate variance-inflation
outliers into the model fit. Second, we extend our setting to high-dimensional linear
models. We show that mean-shift and variance-inflation outliers can be modeled as
additional fixed and random components, respectively, and evaluated independently.
Specifically, we perform feature selection and mean-shift outlier detection through
a robust class of nonconcave penalization methods, and variance-inflation outlier
detection through the penalization of the restricted posterior mode. The resulting
approach satisfies a robust oracle property for feature selection in the presence of
data contamination – which allows the number of features to exponentially increase
with the sample size – and detects truly outlying cases of each type with asymptotic
probability one. This provides an optimal trade-off between a high breakdown point
and efficiency. Third, focusing on high-dimensional linear models affected by meanshift
outliers, we develop a general framework in which L0-constraints coupled with
mixed-integer programming techniques are used to perform simultaneous feature
selection and outlier detection with provably optimal guarantees. In particular,
we provide necessary and sufficient conditions for a robustly strong oracle property,
where again the number of features can increase exponentially with the sample size,
and prove optimality for parameter estimation and the resulting breakdown point.
Finally, we consider generalized linear models and rely on logistic slippage to perform
outlier detection and removal in binary classification. Here we use L0-constraints
and mixed-integer conic programming techniques to solve the underlying double
combinatorial problem of feature selection and outlier detection, and the framework
allows us again to pursue optimality guarantees.
For all the proposed approaches, we also provide computationally lean heuristic
algorithms, tuning procedures, and diagnostic tools which help to guide the analysis.
We consider several real-world applications, including the study of the relationships
between childhood obesity and the human microbiome, and of the main drivers of
honey bee loss. All methods developed and data used, as well as the source code to
replicate our analyses, are publicly available
Measurement of the cosmic ray spectrum above eV using inclined events detected with the Pierre Auger Observatory
A measurement of the cosmic-ray spectrum for energies exceeding
eV is presented, which is based on the analysis of showers
with zenith angles greater than detected with the Pierre Auger
Observatory between 1 January 2004 and 31 December 2013. The measured spectrum
confirms a flux suppression at the highest energies. Above
eV, the "ankle", the flux can be described by a power law with
index followed by
a smooth suppression region. For the energy () at which the
spectral flux has fallen to one-half of its extrapolated value in the absence
of suppression, we find
eV.Comment: Replaced with published version. Added journal reference and DO
Energy Estimation of Cosmic Rays with the Engineering Radio Array of the Pierre Auger Observatory
The Auger Engineering Radio Array (AERA) is part of the Pierre Auger
Observatory and is used to detect the radio emission of cosmic-ray air showers.
These observations are compared to the data of the surface detector stations of
the Observatory, which provide well-calibrated information on the cosmic-ray
energies and arrival directions. The response of the radio stations in the 30
to 80 MHz regime has been thoroughly calibrated to enable the reconstruction of
the incoming electric field. For the latter, the energy deposit per area is
determined from the radio pulses at each observer position and is interpolated
using a two-dimensional function that takes into account signal asymmetries due
to interference between the geomagnetic and charge-excess emission components.
The spatial integral over the signal distribution gives a direct measurement of
the energy transferred from the primary cosmic ray into radio emission in the
AERA frequency range. We measure 15.8 MeV of radiation energy for a 1 EeV air
shower arriving perpendicularly to the geomagnetic field. This radiation energy
-- corrected for geometrical effects -- is used as a cosmic-ray energy
estimator. Performing an absolute energy calibration against the
surface-detector information, we observe that this radio-energy estimator
scales quadratically with the cosmic-ray energy as expected for coherent
emission. We find an energy resolution of the radio reconstruction of 22% for
the data set and 17% for a high-quality subset containing only events with at
least five radio stations with signal.Comment: Replaced with published version. Added journal reference and DO
Measurement of the Radiation Energy in the Radio Signal of Extensive Air Showers as a Universal Estimator of Cosmic-Ray Energy
We measure the energy emitted by extensive air showers in the form of radio
emission in the frequency range from 30 to 80 MHz. Exploiting the accurate
energy scale of the Pierre Auger Observatory, we obtain a radiation energy of
15.8 \pm 0.7 (stat) \pm 6.7 (sys) MeV for cosmic rays with an energy of 1 EeV
arriving perpendicularly to a geomagnetic field of 0.24 G, scaling
quadratically with the cosmic-ray energy. A comparison with predictions from
state-of-the-art first-principle calculations shows agreement with our
measurement. The radiation energy provides direct access to the calorimetric
energy in the electromagnetic cascade of extensive air showers. Comparison with
our result thus allows the direct calibration of any cosmic-ray radio detector
against the well-established energy scale of the Pierre Auger Observatory.Comment: Replaced with published version. Added journal reference and DOI.
Supplemental material in the ancillary file
Combined fit to the spectrum and composition data measured by the Pierre Auger Observatory including magnetic horizon effects
The measurements by the Pierre Auger Observatory of the energy spectrum and mass composition of cosmic rays can be interpreted assuming the presence of two extragalactic source populations, one dominating the flux at energies above a few EeV and the other below. To fit the data ignoring magnetic field effects, the high-energy population needs to accelerate a mixture of nuclei with very hard spectra, at odds with the approximate E shape expected from diffusive shock acceleration. The presence of turbulent extragalactic magnetic fields in the region between the closest sources and the Earth can significantly modify the observed CR spectrum with respect to that emitted by the sources, reducing the flux of low-rigidity particles that reach the Earth. We here take into account this magnetic horizon effect in the combined fit of the spectrum and shower depth distributions, exploring the possibility that a spectrum for the high-energy population sources with a shape closer to E be able to explain the observations
- …