15 research outputs found

    Fisher's exact test explains a popular metric in information retrieval

    Full text link
    Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the one-tailed version of Fisher's exact test, also known as the hypergeometric test, corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. We then set forth a mathematical argument that suggests the tf-idf variant approximates the negative logarithm of the one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution tail probability). The Fisher's exact test interpretation of this common tf-idf variant furnishes the working statistician with a ready explanation of tf-idf's long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision

    An Ontology-Based Recommender System with an Application to the Star Trek Television Franchise

    Full text link
    Collaborative filtering based recommender systems have proven to be extremely successful in settings where user preference data on items is abundant. However, collaborative filtering algorithms are hindered by their weakness against the item cold-start problem and general lack of interpretability. Ontology-based recommender systems exploit hierarchical organizations of users and items to enhance browsing, recommendation, and profile construction. While ontology-based approaches address the shortcomings of their collaborative filtering counterparts, ontological organizations of items can be difficult to obtain for items that mostly belong to the same category (e.g., television series episodes). In this paper, we present an ontology-based recommender system that integrates the knowledge represented in a large ontology of literary themes to produce fiction content recommendations. The main novelty of this work is an ontology-based method for computing similarities between items and its integration with the classical Item-KNN (K-nearest neighbors) algorithm. As a study case, we evaluated the proposed method against other approaches by performing the classical rating prediction task on a collection of Star Trek television series episodes in an item cold-start scenario. This transverse evaluation provides insights into the utility of different information resources and methods for the initial stages of recommender system development. We found our proposed method to be a convenient alternative to collaborative filtering approaches for collections of mostly similar items, particularly when other content-based approaches are not applicable or otherwise unavailable. Aside from the new methods, this paper contributes a testbed for future research and an online framework to collaboratively extend the ontology of literary themes to cover other narrative content.Comment: 25 pages, 6 figures, 5 tables, minor revision

    Propagation connectivity of random hypergraphs

    Get PDF
    We study the concept of propagation connectivity on random 3-uniform hypergraphs. This concept is inspired by a simple linear time algorithm for solving instances of certain constraint satisfaction problems. We derive upper and lower bounds for the propagation connectivity threshold, and point out some algorithmic implications

    A Simple Message Passing Algorithm for Graph Partitioning

    No full text
    Motivated by the belief propagation, we propose a simple and deterministic message passing algorithm for the Graph Bisection problem and related problems. The running time of the main algorithm is linear w.r.t. the number of vertices and edges. For evaluating its average-case correctness, planted solution models are used. For the Graph Bisection problem under the standard planted solution model with probability pa- rameters p and r, we prove that our algorithm yields a planted solution with probability > 1 − δ if p − r = Ω(n−1/2 log(n/δ))

    Finding Most Likely Solutions

    No full text
    As a framewrok for simple but basic statistical inference problems we introduce a genetic Most Likely Solution problem, a task of finding a most likely solution (MLS in short) for a given problem instance under some given probability model. Although many MLS problems are NP-hard, we propose, for these problems, to study their average-case complexity under their assumed probability models. We show three examples of MLS problems, and explain that “message passing algorithms” (e.g., belief propagation) work reasonably well for these problems. Some of the technical results of this paper are from the author’s recent work [WY06, OW06]
    corecore