90,443 research outputs found
Ranking USRDS Provider-Specific SMRs from 1998-2001
Provider profiling (ranking, league tables ) is prevalent in health services research. Similarly, comparing educational institutions and identifying differentially expressed genes depend on ranking. Effective ranking procedures must be structured by a hierarchical (Bayesian) model and guided by a ranking-specific loss function, however even optimal methods can perform poorly and estimates must be accompanied by uncertainty assessments. We use the 1998-2001 Standardized Mortality Ratio (SMR) data from United States Renal Data System (USRDS) as a platform to identify issues and approaches. Our analyses extend Liu et al. (2004) by combining evidence over multiple years via an AR(1) model; by considering estimates that minimize errors in classifying providers above or below a percentile cutpoint in addition to those that minimize rank-based, squared-error loss; by considering ranks based on the posterior probability that a provider\u27s SMR exceeds a threshold; by comparing these ranks to those produced by ranking MLEs and ranking P-values associated with testing whether a provider\u27s SMR = 1; by comparing results for a parametric and a non-parametric prior; by reporting on a suite of uncertainty measures.
Results show that MLE-based and hypothesis test based ranks are far from optimal, that uncertainty measures effectively calibrate performance; that in the USRDS context ranks based on single-year data perform poorly, but that performance improves substantially when using the AR(1) model; that ranks based on posterior probabilities of exceeding a properly chosen SMR threshold are essentially identical to those produced by minimizing classification loss. These findings highlight areas requiring additional research and the need to educate stakeholders on the uses and abuses of ranks; on their proper role in science and policy; on the absolute necessity of accompanying estimated ranks with uncertainty assessments and ensuring that these uncertainties influence decisions
Enhancing the effectiveness of ligand-based virtual screening using data fusion
Data fusion is being increasingly used to combine the outputs of different types of sensor. This paper reviews the application of the approach to ligand-based virtual screening, where the sensors to be combined are functions that score molecules in a database on their likelihood of exhibiting some required biological activity. Much of the literature to date involves the combination of multiple similarity searches, although there is also increasing interest in the combination of multiple machine learning techniques. Both approaches are reviewed here, focusing on the extent to which fusion can improve the effectiveness of searching when compared with a single screening mechanism, and on the reasons that have been suggested for the observed performance enhancement
RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Power and reproducibility are key to enabling refined scientific discoveries
in contemporary big data applications with general high-dimensional nonlinear
models. In this paper, we provide theoretical foundations on the power and
robustness for the model-free knockoffs procedure introduced recently in
Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the
covariate distribution is characterized by Gaussian graphical model. We
establish that under mild regularity conditions, the power of the oracle
knockoffs procedure with known covariate distribution in high-dimensional
linear models is asymptotically one as sample size goes to infinity. When
moving away from the ideal case, we suggest the modified model-free knockoffs
method called graphical nonlinear knockoffs (RANK) to accommodate the unknown
covariate distribution. We provide theoretical justifications on the robustness
of our modified procedure by showing that the false discovery rate (FDR) is
asymptotically controlled at the target level and the power is asymptotically
one with the estimated covariate distribution. To the best of our knowledge,
this is the first formal theoretical result on the power for the knockoffs
procedure. Simulation results demonstrate that compared to existing approaches,
our method performs competitively in both FDR control and power. A real data
set is analyzed to further assess the performance of the suggested knockoffs
procedure.Comment: 37 pages, 6 tables, 9 pages supplementary materia
Two-Sided Infinite Systems of Competing Brownian Particles
Two-sided infinite systems of Brownian particles with rank-dependent
dynamics, indexed by all integers, exhibit different properties from their
one-sided infinite counterparts, indexed by positive integers, and from finite
systems. Consider the gap process, which is formed by spacings between adjacent
particles. In stark contrast with finite and one-sided infinite systems,
two-sided infinite systems can have one- or two-parameter family of stationary
gap distributions, or the gap process weakly converging to zero as time goes to
infinity.Comment: 32 pages. Keywords: Competing Brownian particles, gap process, weak
convergence, stationary distribution, named particles, ranked particles,
stochastic domination, interacting particle system
A comparison of score, rank and probability-based fusion methods for video shot retrieval
It is now accepted that the most effective video shot retrieval is based on indexing and retrieving clips using multiple, parallel modalities such as text-matching, image-matching and feature matching and then combining or fusing these parallel retrieval streams in some way. In this paper we investigate a range of fusion methods for combining based on multiple visual features (colour, edge and texture), for combining based on multiple visual examples in the query and for combining multiple modalities (text and visual). Using three TRECVid collections and the TRECVid search task, we specifically compare fusion methods based on normalised score and rank that use either the average, weighted average or maximum of retrieval results from a discrete Jelinek-Mercer smoothed language model. We also compare these results with a simple probability-based combination of the language model results that assumes all features and visual examples are fully independent
Robust Tests in Genome-Wide Scans under Incomplete Linkage Disequilibrium
Under complete linkage disequilibrium (LD), robust tests often have greater
power than Pearson's chi-square test and trend tests for the analysis of
case-control genetic association studies. Robust statistics have been used in
candidate-gene and genome-wide association studies (GWAS) when the genetic
model is unknown. We consider here a more general incomplete LD model, and
examine the impact of penetrances at the marker locus when the genetic models
are defined at the disease locus. Robust statistics are then reviewed and their
efficiency and robustness are compared through simulations in GWAS of 300,000
markers under the incomplete LD model. Applications of several robust tests to
the Wellcome Trust Case-Control Consortium [Nature 447 (2007) 661--678] are
presented.Comment: Published in at http://dx.doi.org/10.1214/09-STS314 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
An Overview of Classifier Fusion Methods
A number of classifier fusion methods have been
recently developed opening an alternative approach
leading to a potential improvement in the
classification performance. As there is little theory of
information fusion itself, currently we are faced with
different methods designed for different problems and
producing different results. This paper gives an
overview of classifier fusion methods and attempts to
identify new trends that may dominate this area of
research in future. A taxonomy of fusion methods
trying to bring some order into the existing âpudding
of diversitiesâ is also provided
- âŠ