5,497 research outputs found
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
We consider the problems of estimation and selection of parameters endowed
with a known group structure, when the groups are assumed to be sign-coherent,
that is, gathering either nonnegative, nonpositive or null parameters. To
tackle this problem, we propose the cooperative-Lasso penalty. We derive the
optimality conditions defining the cooperative-Lasso estimate for generalized
linear models, and propose an efficient active set algorithm suited to
high-dimensional problems. We study the asymptotic consistency of the estimator
in the linear regression setup and derive its irrepresentable conditions, which
are milder than the ones of the group-Lasso regarding the matching of groups
with the sparsity pattern of the true parameters. We also address the problem
of model selection in linear regression by deriving an approximation of the
degrees of freedom of the cooperative-Lasso estimator. Simulations comparing
the proposed estimator to the group and sparse group-Lasso comply with our
theoretical results, showing consistent improvements in support recovery for
sign-coherent groups. We finally propose two examples illustrating the wide
applicability of the cooperative-Lasso: first to the processing of ordinal
variables, where the penalty acts as a monotonicity prior; second to the
processing of genomic data, where the set of differentially expressed probes is
enriched by incorporating all the probes of the microarray that are related to
the corresponding genes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS520 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Clustering in an Object-Oriented Environment
This paper describes the incorporation of seven stand-alone clustering programs into S-PLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objects-by-variables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of S-PLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the object-oriented principle supported by S-PLUS. The new functions have a uniform interface, and are compatible with existing S-PLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.
Point process-based modeling of multiple debris flow landslides using INLA: an application to the 2009 Messina disaster
We develop a stochastic modeling approach based on spatial point processes of
log-Gaussian Cox type for a collection of around 5000 landslide events provoked
by a precipitation trigger in Sicily, Italy. Through the embedding into a
hierarchical Bayesian estimation framework, we can use the Integrated Nested
Laplace Approximation methodology to make inference and obtain the posterior
estimates. Several mapping units are useful to partition a given study area in
landslide prediction studies. These units hierarchically subdivide the
geographic space from the highest grid-based resolution to the stronger
morphodynamic-oriented slope units. Here we integrate both mapping units into a
single hierarchical model, by treating the landslide triggering locations as a
random point pattern. This approach diverges fundamentally from the unanimously
used presence-absence structure for areal units since we focus on modeling the
expected landslide count jointly within the two mapping units. Predicting this
landslide intensity provides more detailed and complete information as compared
to the classically used susceptibility mapping approach based on relative
probabilities. To illustrate the model's versatility, we compute absolute
probability maps of landslide occurrences and check its predictive power over
space. While the landslide community typically produces spatial predictive
models for landslides only in the sense that covariates are spatially
distributed, no actual spatial dependence has been explicitly integrated so far
for landslide susceptibility. Our novel approach features a spatial latent
effect defined at the slope unit level, allowing us to assess the spatial
influence that remains unexplained by the covariates in the model
Multilevel mixed-type data analysis for validating partitions of scrapie isolates
The dissertation arises from a joint study with the Department of Food Safety and Veterinary Public Health of the Istituto Superiore di SanitĂ . The aim is to investigate and validate the existence of distinct strains of the scrapie disease taking into account the availability of a priori benchmark partition formulated by researchers. Scrapie of small ruminants is caused by prions, which are unconventional infectious agents of proteinaceous nature a ecting humans and animals. Due to the absence of nucleic acids, which precludes direct analysis of strain variation by molecular methods, the presence of di erent sheep scrapie strains is usually investigated by bioassay in laboratory rodents. Data are collected by an experimental study on scrapie conducted at the Istituto Superiore di SanitĂ by experimental transmission of scrapie isolates to bank voles.
We aim to discuss the validation of a given partition in a statistical classification framework using a multi-step procedure. Firstly, we use unsupervised classification to see how alternative clustering results match researchers’ understanding of the heterogeneity of the isolates. We discuss whether and how clustering results can be eventually exploited to extend the preliminary partition elicited by researchers. Then we motivate the subsequent partition validation based on the predictive performance of several supervised classifiers.
Our data-driven approach contains two main methodological original contributions. We advocate the use of partition validation measures to investigate a given benchmark partition: firstly we discuss the issue of how the data can be used to evaluate a preliminary benchmark partition and eventually modify it with statistical results to find a conclusive partition that could be used as a “gold standard” in future studies. Moreover, collected data have a multilevel structure and for each lower-level unit, mixed-type data are available. Each step in the procedure is then adapted to deal with multilevel mixed-type data. We extend distance-based clustering algorithms to deal with multilevel mixed-type data. Whereas in supervised classification we propose a two-step approach to classify the higher-level units starting from the lower-level observations. In this framework, we also need to define an ad-hoc cross validation algorithm
Reviewer Integration and Performance Measurement for Malware Detection
We present and evaluate a large-scale malware detection system integrating
machine learning with expert reviewers, treating reviewers as a limited
labeling resource. We demonstrate that even in small numbers, reviewers can
vastly improve the system's ability to keep pace with evolving threats. We
conduct our evaluation on a sample of VirusTotal submissions spanning 2.5 years
and containing 1.1 million binaries with 778GB of raw feature data. Without
reviewer assistance, we achieve 72% detection at a 0.5% false positive rate,
performing comparable to the best vendors on VirusTotal. Given a budget of 80
accurate reviews daily, we improve detection to 89% and are able to detect 42%
of malicious binaries undetected upon initial submission to VirusTotal.
Additionally, we identify a previously unnoticed temporal inconsistency in the
labeling of training datasets. We compare the impact of training labels
obtained at the same time training data is first seen with training labels
obtained months later. We find that using training labels obtained well after
samples appear, and thus unavailable in practice for current training data,
inflates measured detection by almost 20 percentage points. We release our
cluster-based implementation, as well as a list of all hashes in our evaluation
and 3% of our entire dataset.Comment: 20 papers, 11 figures, accepted at the 13th Conference on Detection
of Intrusions and Malware & Vulnerability Assessment (DIMVA 2016
- …