49,943 research outputs found
A Distance-Based Test of Association Between Paired Heterogeneous Genomic Data
Due to rapid technological advances, a wide range of different measurements
can be obtained from a given biological sample including single nucleotide
polymorphisms, copy number variation, gene expression levels, DNA methylation
and proteomic profiles. Each of these distinct measurements provides the means
to characterize a certain aspect of biological diversity, and a fundamental
problem of broad interest concerns the discovery of shared patterns of
variation across different data types. Such data types are heterogeneous in the
sense that they represent measurements taken at very different scales or
described by very different data structures. We propose a distance-based
statistical test, the generalized RV (GRV) test, to assess whether there is a
common and non-random pattern of variability between paired biological
measurements obtained from the same random sample. The measurements enter the
test through distance measures which can be chosen to capture particular
aspects of the data. An approximate null distribution is proposed to compute
p-values in closed-form and without the need to perform costly Monte Carlo
permutation procedures. Compared to the classical Mantel test for association
between distance matrices, the GRV test has been found to be more powerful in a
number of simulation settings. We also report on an application of the GRV test
to detect biological pathways in which genetic variability is associated to
variation in gene expression levels in ovarian cancer samples, and present
results obtained from two independent cohorts
Modeling heterogeneity in ranked responses by nonparametric maximum likelihood:How do Europeans get their scientific knowledge?
This paper is motivated by a Eurobarometer survey on science knowledge. As part of the survey, respondents were asked to rank sources of science information in order of importance. The official statistical analysis of these data however failed to use the complete ranking information. We instead propose a method which treats ranked data as a set of paired comparisons which places the problem in the standard framework of generalized linear models and also allows respondent covariates to be incorporated. An extension is proposed to allow for heterogeneity in the ranked responses. The resulting model uses a nonparametric formulation of the random effects structure, fitted using the EM algorithm. Each mass point is multivalued, with a parameter for each item. The resultant model is equivalent to a covariate latent class model, where the latent class profiles are provided by the mass point components and the covariates act on the class profiles. This provides an alternative interpretation of the fitted model. The approach is also suitable for paired comparison data
Bayesian analysis of ranking data with the constrained Extended Plackett-Luce model
Multistage ranking models, including the popular Plackett-Luce distribution
(PL), rely on the assumption that the ranking process is performed
sequentially, by assigning the positions from the top to the bottom one
(forward order). A recent contribution to the ranking literature relaxed this
assumption with the addition of the discrete-valued reference order parameter,
yielding the novel Extended Plackett-Luce model (EPL). Inference on the EPL and
its generalization into a finite mixture framework was originally addressed
from the frequentist perspective. In this work, we propose the Bayesian
estimation of the EPL with order constraints on the reference order parameter.
The proposed restrictions reflect a meaningful rank assignment process. By
combining the restrictions with the data augmentation strategy and the
conjugacy of the Gamma prior distribution with the EPL, we facilitate the
construction of a tuned joint Metropolis-Hastings algorithm within Gibbs
sampling to simulate from the posterior distribution. The Bayesian approach
allows to address more efficiently the inference on the additional
discrete-valued parameter and the assessment of its estimation uncertainty. The
usefulness of the proposal is illustrated with applications to simulated and
real datasets.Comment: 20 pages, 4 figures, 4 tables. arXiv admin note: substantial text
overlap with arXiv:1803.0288
A Grouping Genetic Algorithm for Joint Stratification and Sample Allocation Designs
Predicting the cheapest sample size for the optimal stratification in
multivariate survey design is a problem in cases where the population frame is
large. A solution exists that iteratively searches for the minimum sample size
necessary to meet accuracy constraints in partitions of atomic strata created
by the Cartesian product of auxiliary variables into larger strata. The optimal
stratification can be found by testing all possible partitions. However the
number of possible partitions grows exponentially with the number of initial
strata. There are alternative ways of modelling this problem, one of the most
natural is using Genetic Algorithms (GA). These evolutionary algorithms use
recombination, mutation and selection to search for optimal solutions. They
often converge on optimal or near-optimal solution more quickly than exact
methods. We propose a new GA approach to this problem using grouping genetic
operators instead of traditional operators. The results show a significant
improvement in solution quality for similar computational effort, corresponding
to large monetary savings.Comment: 22 page
MULTIPLE COMPARISONS WITH THE BEST: BAYESIAN PRECISION MEASURES OF EFFICIENCY RANKINGS
A large literature exists on measuring the allocative and technical efficiency of a set of firms. A segment of this literature uses data envelopment analysis (DEA), creating relative efficiency rankings that are nonstochastic and thus cannot be evaluated according to the precision of the rankings. A parallel literature uses econometric techniques to estimate stochastic production frontiers or distance functions, providing at least the possibility of computing the precision of the resulting efficiency rankings. Recently, Horrace and Schmidt (2000) have applied sampling theoretic statistical techniques known as multiple comparisons with control (MCC) and multiple comparisons with the best (MCB) to the issue of measuring the precision of efficiency rankings. This paper offers a Bayesian multiple comparison alternative that we argue is simpler to implement, gives the researcher increased exibility over the type of comparison made, and provides greater, and more in-tuitive, information content. We demonstrate this method on technical efficiency rankings of a set of U.S. electric generating firms derived within a distance function framework.Research Methods/ Statistical Methods,
Sparse reduced-rank regression for imaging genetics studies: models and applications
We present a novel statistical technique; the sparse reduced rank regression (sRRR) model
which is a strategy for multivariate modelling of high-dimensional imaging responses and
genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity
in the regression coefficients, identifying subsets of genetic markers that best explain
the variability observed in subsets of the phenotypes. To properly exploit the rich structure
present in each of the imaging and genetics domains, we additionally propose the use of
several structured penalties within the sRRR model. Using simulation procedures that accurately
reflect realistic imaging genetics data, we present detailed evaluations of the sRRR
method in comparison with the more traditional univariate linear modelling approach. In
all settings considered, we show that sRRR possesses better power to detect the deleterious
genetic variants. Moreover, using a simple genetic model, we demonstrate the potential
benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to
extracting averages over regions of interest in the brain. Since this entails the use of phenotypic
vectors of enormous dimensionality, we suggest the use of a sparse classification
model as a de-noising step, prior to the imaging genetics study. Finally, we present the
application of a data re-sampling technique within the sRRR model for model selection.
Using this approach we are able to rank the genetic markers in order of importance of association
to the phenotypes, and similarly rank the phenotypes in order of importance to
the genetic markers. In the very end, we illustrate the application perspective of the proposed
statistical models in three real imaging genetics datasets and highlight some potential
associations
Evaluating probabilistic forecasts with scoringRules
Probabilistic forecasts in the form of probability distributions over future
events have become popular in several fields including meteorology, hydrology,
economics, and demography. In typical applications, many alternative
statistical models and data sources can be used to produce probabilistic
forecasts. Hence, evaluating and selecting among competing methods is an
important task. The scoringRules package for R provides functionality for
comparative evaluation of probabilistic models based on proper scoring rules,
covering a wide range of situations in applied work. This paper discusses
implementation and usage details, presents case studies from meteorology and
economics, and points to the relevant background literature
- …