1,107 research outputs found

    Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

    Full text link
    There are several estimators of conditional probability from observed frequencies of features. In this paper, we propose using the lower limit of confidence interval on posterior distribution determined by the observed frequencies to ascertain conditional probability. In our experiments, this method outperformed other popular estimators.Comment: The 2016 International Conference on Advanced Informatics: Concepts, Theory and Application (ICAICTA2016

    Rhetorical relations for information retrieval

    Full text link
    Typically, every part in most coherent text has some plausible reason for its presence, some function that it performs to the overall semantics of the text. Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts of a text are linked to each other. Knowledge about this socalled discourse structure has been applied successfully to several natural language processing tasks. This work studies the use of rhetorical relations for Information Retrieval (IR): Is there a correlation between certain rhetorical relations and retrieval performance? Can knowledge about a document's rhetorical relations be useful to IR? We present a language model modification that considers rhetorical relations when estimating the relevance of a document to a query. Empirical evaluation of different versions of our model on TREC settings shows that certain rhetorical relations can benefit retrieval effectiveness notably (> 10% in mean average precision over a state-of-the-art baseline)

    General Type Token Distribution

    Get PDF
    We consider the problem of estimating the number of types in a corpus using the number of types observed in a sample of tokens from that corpus. We derive exact and asymptotic distributions for the number of observed types, conditioned upon the number of tokens and the latent type distribution. We use the asymptotic distributions to derive an estimator of the latent number of types and we validate this estimator numerically.Comment: This paper is accepted in Biometrika. 5 pages and no figure in the main paper. 3 pages and 1 figure in the supplementary materia
    • …
    corecore