8,032 research outputs found
Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning
In biomedical research, many different types of patient data can be
collected, such as various types of omics data and medical imaging modalities.
Applying multi-view learning to these different sources of information can
increase the accuracy of medical classification models compared with
single-view procedures. However, collecting biomedical data can be expensive
and/or burdening for patients, so that it is important to reduce the amount of
required data collection. It is therefore necessary to develop multi-view
learning methods which can accurately identify those views that are most
important for prediction. In recent years, several biomedical studies have used
an approach known as multi-view stacking (MVS), where a model is trained on
each view separately and the resulting predictions are combined through
stacking. In these studies, MVS has been shown to increase classification
accuracy. However, the MVS framework can also be used for selecting a subset of
important views. To study the view selection potential of MVS, we develop a
special case called stacked penalized logistic regression (StaPLR). Compared
with existing view-selection methods, StaPLR can make use of faster
optimization algorithms and is easily parallelized. We show that nonnegativity
constraints on the parameters of the function which combines the views play an
important role in preventing unimportant views from entering the model. We
investigate the performance of StaPLR through simulations, and consider two
real data examples. We compare the performance of StaPLR with an existing view
selection method called the group lasso and observe that, in terms of view
selection, StaPLR is often more conservative and has a consistently lower false
positive rate.Comment: 26 pages, 9 figures. Accepted manuscrip
Corporate payments networks and credit risk rating
Aggregate and systemic risk in complex systems are emergent phenomena
depending on two properties: the idiosyncratic risks of the elements and the
topology of the network of interactions among them. While a significant
attention has been given to aggregate risk assessment and risk propagation once
the above two properties are given, less is known about how the risk is
distributed in the network and its relations with the topology. We study this
problem by investigating a large proprietary dataset of payments among 2.4M
Italian firms, whose credit risk rating is known. We document significant
correlations between local topological properties of a node (firm) and its
risk. Moreover we show the existence of an homophily of risk, i.e. the tendency
of firms with similar risk profile to be statistically more connected among
themselves. This effect is observed when considering both pairs of firms and
communities or hierarchies identified in the network. We leverage this
knowledge to show the predictability of the missing rating of a firm using only
the network properties of the associated node
Population Structure and Cryptic Relatedness in Genetic Association Studies
We review the problem of confounding in genetic association studies, which
arises principally because of population structure and cryptic relatedness.
Many treatments of the problem consider only a simple ``island'' model of
population structure. We take a broader approach, which views population
structure and cryptic relatedness as different aspects of a single confounder:
the unobserved pedigree defining the (often distant) relationships among the
study subjects. Kinship is therefore a central concept, and we review methods
of defining and estimating kinship coefficients, both pedigree-based and
marker-based. In this unified framework we review solutions to the problem of
population structure, including family-based study designs, genomic control,
structured association, regression control, principal components adjustment and
linear mixed models. The last solution makes the most explicit use of the
kinships among the study subjects, and has an established role in the analysis
of animal and plant breeding studies. Recent computational developments mean
that analyses of human genetic association data are beginning to benefit from
its powerful tests for association, which protect against population structure
and cryptic kinship, as well as intermediate levels of confounding by the
pedigree.Comment: Published in at http://dx.doi.org/10.1214/09-STS307 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Multiple imputation for sharing precise geographies in public use data
When releasing data to the public, data stewards are ethically and often
legally obligated to protect the confidentiality of data subjects' identities
and sensitive attributes. They also strive to release data that are informative
for a wide range of secondary analyses. Achieving both objectives is
particularly challenging when data stewards seek to release highly resolved
geographical information. We present an approach for protecting the
confidentiality of data with geographic identifiers based on multiple
imputation. The basic idea is to convert geography to latitude and longitude,
estimate a bivariate response model conditional on attributes, and simulate new
latitude and longitude values from these models. We illustrate the proposed
methods using data describing causes of death in Durham, North Carolina. In the
context of the application, we present a straightforward tool for generating
simulated geographies and attributes based on regression trees, and we present
methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …