309,739 research outputs found

    Building simulated queries for known-item topics: an analysis using six european languages

    Get PDF
    There has been increased interest in the use of simulated queries for evaluation and estimation purposes in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation because their quality, in terms of retrieval performance, is unlike real queries. In this paper, we focus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting point, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six European languages) we propose a model with improved document and term selection properties, showing that simulated known-item topics can be generated that are comparable to real known-item topics. This is a significant step towards validating the potential usefulness of simulated queries: for evaluation purposes, and because building models of querying behavior provides a deeper insight into the querying process so that better retrieval mechanisms can be developed to support the user

    Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

    Get PDF
    There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth

    Selection models with monotone weight functions in meta analysis

    Full text link
    Publication bias, the fact that studies identified for inclusion in a meta analysis do not represent all studies on the topic of interest, is commonly recognized as a threat to the validity of the results of a meta analysis. One way to explicitly model publication bias is via selection models or weighted probability distributions. We adopt the nonparametric approach initially introduced by Dear (1992) but impose that the weight function ww is monotonely non-increasing as a function of the pp-value. Since in meta analysis one typically only has few studies or "observations", regularization of the estimation problem seems sensible. In addition, virtually all parametric weight functions proposed so far in the literature are in fact decreasing. We discuss how to estimate a decreasing weight function in the above model and illustrate the new methodology on two well-known examples. The new approach potentially offers more insight in the selection process than other methods and is more flexible than parametric approaches. Some basic properties of the log-likelihood function and computation of a pp-value quantifying the evidence against the null hypothesis of a constant weight function are indicated. In addition, we provide an approximate selection bias adjusted profile likelihood confidence interval for the treatment effect. The corresponding software and the datasets used to illustrate it are provided as the R package selectMeta. This enables full reproducibility of the results in this paper.Comment: 15 pages, 2 figures. Some minor changes according to reviewer comment

    Web-Shaped Model for Head Pose Estimation: an Approach for Best Exemplar Selection

    Get PDF
    Head pose estimation is a sensitive topic in video surveillance/smart ambient scenarios since head rotations can hide/distort discriminative features of the face. Face recognition would often tackle the problem of video frames where subjects appear in poses making it quite impossible. In this respect, the selection of the frames with the best face orientation can allow triggering recognition only on these, therefore decreasing the possibility of errors. This paper proposes a novel approach to head pose estimation for smart cities and video surveillance scenarios, aiming at this goal. The method relies on a cascade of two models: the first one predicts the positions of 68 well-known face landmarks; the second one applies a web-shaped model over the detected landmarks, to associate each of them to a specific face sector. The method can work on detected faces at a reasonable distance and with a resolution that is supported by several present devices. Results of experiments executed over some classical pose estimation benchmarks, namely Point '04, Biwi, and AFLW datasets show good performance in terms of both pose estimation and computing time. Further results refer to noisy images that are typical of the addressed settings. Finally, examples demonstrate the selection of the best frames from videos captured in video surveillance conditions

    Topics In Time Series Analysis And Forecasting

    Get PDF
    This thesis contains new developments in various topics in time series analysis and forecasting. These topics include: model selec- tion, estimation, forecasting and diagnostic checking.;In the area of model selection, finite and large sample properties of the commonly used selection criteria, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), are discussed. In the finite case, the study is limited to the two sample problem. The exact probability of selection is obtained for finite samples. The risk of each criterion is evaluated in the two sample situation. Empirical evidence regarding these risks are given for autoregressive processes. The asymptotic distribution of the (\u27)h is given, where (\u27)h is the estimate of the number of extra parameters in the model selected by the AIC criterion. This derivation is based on large sample properties of the likelihood ratio test statistic. The asymptotic distribution of the AIC in PAR models is also discussed.;In estimation, an explicit expression for the efficiency of strongly consistent estimates for the ARMA(1,1) model is derived. Empirical efficiency and the empirical estimate are examined by simulation.;On the topic of forecasting, the asymptotic variance of the fore- cast error is derived for an autoregressive model of first order. In the derivation, the estimated parameter is not assumed to be independ- ent of the data. The variance of the one-step forecast error is also derived for the fractional noise model.;In the last topic, empirical results for portmanteau test statistics are studied. It is shown that the modified Portmanteau test of Ljung and Box (1980) outperforms the modified test of Li and McLeod (1981). In testing for whiteness, the modified Portmanteau test is shown to have lower power than the cumulative periodogram test against both fractional noise and standard ARMA alternatives

    Survival analysis with delayed entry in selected families with application to human longevity

    Get PDF
    In the field of aging research, family-based sampling study designs are commonly used to study the lifespans of long-lived family members. However, the specific sampling procedure should be carefully taken into account in order to avoid biases. This work is motivated by the Leiden Longevity Study, a family-based cohort of long-lived siblings. Families were invited to participate in the study if at least two siblings were ‘long-lived’, where ‘long-lived’ meant being older than 89 years for men or older than 91 years for women. As a result, more than 400 families were included in the study and followed for around 10 years. For estimation of marker-specific survival probabilities and correlations among life times of family members, delayed entry due to outcome-dependent sampling mechanisms has to be taken into account. We consider shared frailty models to model left-truncated correlated survival data. The treatment of left truncation in shared frailty models is still an open issue and the literature on this topic is scarce. We show that the current approaches provide, in general, biased estimates and we propose a new method to tackle this selection problem by applying a correction on the likelihood estimation by means of inverse probability weighting at the family level

    Tigers on trails: occupancy modeling for cluster sampling

    Get PDF
    Occupancy modeling focuses on inference about the distribution of organisms over space, using temporal or spatial replication to allow inference about the detection process. Inference based on spatial replication strictly requires that replicates be selected randomly and with replacement, but the importance of these design requirements is not well understood. This paper focuses on an increasingly popular sampling design based on spatial replicates that are not selected randomly and that are expected to exhibit Markovian dependence. We develop two new occupancy models for data collected under this sort of design, one based on an underlying Markov model for spatial dependence and the other based on a trap response model with Markovian detections. We then simulated data under the model for Markovian spatial dependence and fit the data to standard occupancy models and to the two new models. Bias of occupancy estimates was substantial for the standard models, smaller for the new trap response model, and negligible for the new spatial process model. We also fit these models to data from a large-scale tiger occupancy survey recently conducted in Karnataka State, southwestern India. In addition to providing evidence of a positive relationship between tiger occupancy and habitat, model selection statistics and estimates strongly supported the use of the model with Markovian spatial dependence. This new model provides another tool for the decomposition of the detection process, which is sometimes needed for proper estimation and which may also permit interesting biological inferences. In addition to designs employing spatial replication, we note the likely existence of temporal Markovian dependence in many designs using temporal replication. The models developed here will be useful either directly, or with minor extensions, for these designs as well. We believe that these new models represent important additions to the suite of modeling tools now available for occupancy estimation in conservation monitoring. More generally, this work represents a contribution to the topic of cluster sampling for situations in which there is a need for specific modeling (e.g., reflecting dependence) for the distribution of the variable(s) of interest among subunits

    The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

    Full text link
    In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing and by clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data

    Modeling Temporal Evidence from External Collections

    Full text link
    Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major novelties. First, we mine temporal evidences from hundreds of external sources into topic-based external collections to improve the robustness of the detection of relevant time periods. Second, we propose a formal retrieval model that generalizes the use of the temporal dimension across different aspects of the retrieval process. In particular, we show that temporal evidence of external collections can be used to (i) infer a topic's temporal relevance, (ii) select the query expansion terms, and (iii) re-rank the final results for improved precision. Experiments with TREC Microblog collections show that the proposed time-aware retrieval model makes an effective and extensive use of the temporal dimension to improve search results over the most recent temporal models. Interestingly, we observe a strong correlation between precision and the temporal distribution of retrieved and relevant documents.Comment: To appear in WSDM 201
    corecore