309,739 research outputs found
Building simulated queries for known-item topics: an analysis using six european languages
There has been increased interest in the use of simulated queries for evaluation and estimation purposes in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation because their quality, in terms of retrieval performance, is unlike real queries. In this paper, we focus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting point, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six European languages) we propose a model with improved document and term selection properties, showing that simulated known-item topics can be generated that are comparable to real known-item topics. This is a significant step towards validating the potential usefulness of simulated queries: for evaluation purposes, and because building models of querying behavior provides a deeper insight into the querying process so that better retrieval mechanisms can be developed to support the user
Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth
There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.
Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis.
To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth
Selection models with monotone weight functions in meta analysis
Publication bias, the fact that studies identified for inclusion in a meta
analysis do not represent all studies on the topic of interest, is commonly
recognized as a threat to the validity of the results of a meta analysis. One
way to explicitly model publication bias is via selection models or weighted
probability distributions. We adopt the nonparametric approach initially
introduced by Dear (1992) but impose that the weight function is monotonely
non-increasing as a function of the -value. Since in meta analysis one
typically only has few studies or "observations", regularization of the
estimation problem seems sensible. In addition, virtually all parametric weight
functions proposed so far in the literature are in fact decreasing. We discuss
how to estimate a decreasing weight function in the above model and illustrate
the new methodology on two well-known examples. The new approach potentially
offers more insight in the selection process than other methods and is more
flexible than parametric approaches. Some basic properties of the
log-likelihood function and computation of a -value quantifying the evidence
against the null hypothesis of a constant weight function are indicated. In
addition, we provide an approximate selection bias adjusted profile likelihood
confidence interval for the treatment effect. The corresponding software and
the datasets used to illustrate it are provided as the R package selectMeta.
This enables full reproducibility of the results in this paper.Comment: 15 pages, 2 figures. Some minor changes according to reviewer
comment
Web-Shaped Model for Head Pose Estimation: an Approach for Best Exemplar Selection
Head pose estimation is a sensitive topic in video surveillance/smart ambient scenarios since head rotations can hide/distort discriminative features of the face. Face recognition would often tackle the problem of video frames where subjects appear in poses making it quite impossible. In this respect, the selection of the frames with the best face orientation can allow triggering recognition only on these, therefore decreasing the possibility of errors. This paper proposes a novel approach to head pose estimation for smart cities and video surveillance scenarios, aiming at this goal. The method relies on a cascade of two models: the first one predicts the positions of 68 well-known face landmarks; the second one applies a web-shaped model over the detected landmarks, to associate each of them to a specific face sector. The method can work on detected faces at a reasonable distance and with a resolution that is supported by several present devices. Results of experiments executed over some classical pose estimation benchmarks, namely Point '04, Biwi, and AFLW datasets show good performance in terms of both pose estimation and computing time. Further results refer to noisy images that are typical of the addressed settings. Finally, examples demonstrate the selection of the best frames from videos captured in video surveillance conditions
Topics In Time Series Analysis And Forecasting
This thesis contains new developments in various topics in time series analysis and forecasting. These topics include: model selec- tion, estimation, forecasting and diagnostic checking.;In the area of model selection, finite and large sample properties of the commonly used selection criteria, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), are discussed. In the finite case, the study is limited to the two sample problem. The exact probability of selection is obtained for finite samples. The risk of each criterion is evaluated in the two sample situation. Empirical evidence regarding these risks are given for autoregressive processes. The asymptotic distribution of the (\u27)h is given, where (\u27)h is the estimate of the number of extra parameters in the model selected by the AIC criterion. This derivation is based on large sample properties of the likelihood ratio test statistic. The asymptotic distribution of the AIC in PAR models is also discussed.;In estimation, an explicit expression for the efficiency of strongly consistent estimates for the ARMA(1,1) model is derived. Empirical efficiency and the empirical estimate are examined by simulation.;On the topic of forecasting, the asymptotic variance of the fore- cast error is derived for an autoregressive model of first order. In the derivation, the estimated parameter is not assumed to be independ- ent of the data. The variance of the one-step forecast error is also derived for the fractional noise model.;In the last topic, empirical results for portmanteau test statistics are studied. It is shown that the modified Portmanteau test of Ljung and Box (1980) outperforms the modified test of Li and McLeod (1981). In testing for whiteness, the modified Portmanteau test is shown to have lower power than the cumulative periodogram test against both fractional noise and standard ARMA alternatives
Survival analysis with delayed entry in selected families with application to human longevity
In the field of aging research, family-based sampling study designs are commonly used to study the lifespans of long-lived family members. However, the specific sampling procedure should be carefully taken into account in order to avoid biases. This work is motivated by the Leiden Longevity Study, a family-based cohort of long-lived siblings. Families were invited to participate in the study if at least two siblings were ‘long-lived’, where ‘long-lived’ meant being older than 89 years for men or older than 91 years for women. As a result, more than 400 families were included in the study and followed for around 10 years. For estimation of marker-specific survival probabilities and correlations among life times of family members, delayed entry due to outcome-dependent sampling mechanisms has to be taken into account. We consider shared frailty models to model left-truncated correlated survival data. The treatment of left truncation in shared frailty models is still an open issue and the literature on this topic is scarce. We show that the current approaches provide, in general, biased estimates and we propose a new method to tackle this selection problem by applying a correction on the likelihood estimation by means of inverse probability weighting at the family level
Tigers on trails: occupancy modeling for cluster sampling
Occupancy modeling focuses on inference about the distribution of organisms over space, using temporal or spatial replication to allow inference about the detection process. Inference based on spatial replication strictly requires that replicates be selected randomly and with replacement, but the importance of these design requirements is not well understood. This paper focuses on an increasingly popular sampling design based on spatial replicates that are not selected randomly and that are expected to exhibit Markovian dependence. We develop two new occupancy models for data collected under this sort of design, one based on an underlying Markov model for spatial dependence and the other based on a trap response model with Markovian detections. We then simulated data under the model for Markovian spatial dependence and fit the data to standard occupancy models and to the two new models. Bias of occupancy estimates was substantial for the standard models, smaller for the new trap response model, and negligible for the new spatial process model. We also fit these models to data from a large-scale tiger occupancy survey recently conducted in Karnataka State, southwestern India. In addition to providing evidence of a positive relationship between tiger occupancy and habitat, model selection statistics and estimates strongly supported the use of the model with Markovian spatial dependence. This new model provides another tool for the decomposition of the detection process, which is sometimes needed for proper estimation and which may also permit interesting biological inferences. In addition to designs employing spatial replication, we note the likely existence of temporal Markovian dependence in many designs using temporal replication. The models developed here will be useful either directly, or with minor extensions, for these designs as well. We believe that these new models represent important additions to the suite of modeling tools now available for occupancy estimation in conservation monitoring. More generally, this work represents a contribution to the topic of cluster sampling for situations in which there is a need for specific modeling (e.g., reflecting dependence) for the distribution of the variable(s) of interest among subunits
The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015
In this paper we retrace the recent history of statistics by analyzing all
the papers published in five prestigious statistical journals since 1970,
namely: Annals of Statistics, Biometrika, Journal of the American Statistical
Association, Journal of the Royal Statistical Society, series B and Statistical
Science. The aim is to construct a kind of "taxonomy" of the statistical papers
by organizing and by clustering them in main themes. In this sense being
identified in a cluster means being important enough to be uncluttered in the
vast and interconnected world of the statistical research. Since the main
statistical research topics naturally born, evolve or die during time, we will
also develop a dynamic clustering strategy, where a group in a time period is
allowed to migrate or to merge into different groups in the following one.
Results show that statistics is a very dynamic and evolving science, stimulated
by the rise of new research questions and types of data
Modeling Temporal Evidence from External Collections
Newsworthy events are broadcast through multiple mediums and prompt the
crowds to produce comments on social media. In this paper, we propose to
leverage on this behavioral dynamics to estimate the most relevant time periods
for an event (i.e., query). Recent advances have shown how to improve the
estimation of the temporal relevance of such topics. In this approach, we build
on two major novelties. First, we mine temporal evidences from hundreds of
external sources into topic-based external collections to improve the
robustness of the detection of relevant time periods. Second, we propose a
formal retrieval model that generalizes the use of the temporal dimension
across different aspects of the retrieval process. In particular, we show that
temporal evidence of external collections can be used to (i) infer a topic's
temporal relevance, (ii) select the query expansion terms, and (iii) re-rank
the final results for improved precision. Experiments with TREC Microblog
collections show that the proposed time-aware retrieval model makes an
effective and extensive use of the temporal dimension to improve search results
over the most recent temporal models. Interestingly, we observe a strong
correlation between precision and the temporal distribution of retrieved and
relevant documents.Comment: To appear in WSDM 201
- …