178 research outputs found

    Geometric Perspectives of the BM25

    Get PDF
    Abstract. In this paper, we present the initial findings about a possible geometric interpretation of the BM25 model and a comparison of the BM25 with the Binary Independence Model (BIM) on a two-dimensional space. A Web application was developed in R to show an example of this geometric view on a standard TREC collection. The application is accessible at the following link: http://gmdn.shinyapps.io/shinyRF0

    Generate to Understand for Representation

    Full text link
    In recent years, a significant number of high-quality pretrained models have emerged, greatly impacting Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text Representation tasks. Traditionally, these models are pretrained on custom domain corpora and finetuned for specific tasks, resulting in high costs related to GPU usage and labor. Unfortunately, recent trends in language modeling have shifted towards enhancing performance through scaling, further exacerbating the associated costs. Introducing GUR: a pretraining framework that combines language modeling and contrastive learning objectives in a single training step. We select similar text pairs based on their Longest Common Substring (LCS) from raw unlabeled documents and train the model using masked language modeling and unsupervised contrastive learning. The resulting model, GUR, achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting. Additionally, GUR maintains its language modeling ability, as demonstrated in our ablation experiment. Our code is available at \url{https://github.com/laohur/GUR}

    A Multi-criteria Decision Support System for Ph.D. Supervisor Selection: A Hybrid Approach

    Get PDF
    Selection of a suitable Ph.D. supervisor is a very important step in a student’s career. This paper presents a multi-criteria decision support system to assist students in making this choice. The system employs a hybrid method that first utilizes a fuzzy analytic hierarchy process to extract the relative importance of the identified criteria and sub-criteria to consider when selecting a supervisor. Then, it applies an information retrieval-based similarity algorithm (TF/IDF or Okapi BM25) to retrieve relevant candidate supervisor profiles based on the student’s research interest. The selected profiles are then re-ranked based on other relevant factors chosen by the user, such as publication record, research grant record, and collaboration record. The ranking method evaluates the potential supervisors objectively based on various metrics that are defined in terms of detailed domain-specific knowledge, making part of the decision making automatic. In contrast with other existing works, this system does not require the professor’s involvement and no subjective measures are employed

    A probabilistic approach for cluster based polyrepresentative information retrieval

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial ful lment of the requirements for the degree of Doctor of PhilosophyDocument clustering in information retrieval (IR) is considered an alternative to rank-based retrieval approaches, because of its potential to support user interactions beyond just typing in queries. Similarly, the Principle of Polyrepresentation (multi-evidence: combining multiple cognitively and/or functionally diff erent information need or information object representations for improving an IR system's performance) is an established approach in cognitive IR with plausible applicability in the domain of information seeking and retrieval. The combination of these two approaches can assimilate their respective individual strengths in order to further improve the performance of IR systems. The main goal of this study is to combine cognitive and cluster-based IR approaches for improving the eff ectiveness of (interactive) information retrieval systems. In order to achieve this goal, polyrepresentative information retrieval strategies for cluster browsing and retrieval have been designed, focusing on the evaluation aspect of such strategies. This thesis addresses the challenge of designing and evaluating an Optimum Clustering Framework (OCF) based model, implementing probabilistic document clustering for interactive IR. Thus, polyrepresentative cluster browsing strategies have been devised. With these strategies a simulated user based method has been adopted for evaluating the polyrepresentative cluster browsing and searching strategies. The proposed approaches are evaluated for information need based polyrepresentative clustering as well as document based polyrepresentation and the combination thereof. For document-based polyrepresentation, the notion of citation context is exploited, which has special applications in scientometrics and bibliometrics for science literature modelling. The information need polyrepresentation, on the other hand, utilizes the various aspects of user information need, which is crucial for enhancing the retrieval performance. Besides describing a probabilistic framework for polyrepresentative document clustering, one of the main fi ndings of this work is that the proposed combination of the Principle of Polyrepresentation with document clustering has the potential of enhancing the user interactions with an IR system, provided that the various representations of information need and information objects are utilized. The thesis also explores interactive IR approaches in the context of polyrepresentative interactive information retrieval when it is combined with document clustering methods. Experiments suggest there is a potential in the proposed cluster-based polyrepresentation approach, since statistically signifi cant improvements were found when comparing the approach to a BM25-based baseline in an ideal scenario. Further marginal improvements were observed when cluster-based re-ranking and cluster-ranking based comparisons were made. The performance of the approach depends on the underlying information object and information need representations used, which confi rms fi ndings of previous studies where the Principle of Polyrepresentation was applied in diff erent ways

    Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions

    Full text link
    Many open-domain questions are under-specified and thus have multiple possible answers, each of which is correct under a different interpretation of the question. Answering such ambiguous questions is challenging, as it requires retrieving and then reasoning about diverse information from multiple passages. We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia. On the challenging ASQA benchmark, which requires generating long-form answers that summarize the multiple answers to an ambiguous question, our method improves performance by 15% (relative improvement) on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs. Retrieving from the database of generated questions also gives large improvements in diverse passage retrieval (by matching user questions q to passages p indirectly, via questions q' generated from p)

    Probability models for information retrieval based on divergence from randomness

    Get PDF
    This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms. The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval. We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach. In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models

    Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models

    Full text link
    A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are different from the training data. While neural retrieval models have demonstrated impressive performance on ad-hoc search benchmarks, we still know little about how they perform in terms of interpolation and extrapolation. In this paper, we demonstrate the importance of separately evaluating the two capabilities of neural retrieval models. Firstly, we examine existing ad-hoc search benchmarks from the two perspectives. We investigate the distribution of training and test data and find a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the evaluation on these test sets is biased toward interpolation and cannot accurately reflect the extrapolation capacity. Secondly, we propose a novel evaluation protocol to separately evaluate the interpolation and extrapolation performance on existing benchmark datasets. It resamples the training and test data based on query similarity and utilizes the resampled dataset for training and evaluation. Finally, we leverage the proposed evaluation protocol to comprehensively revisit a number of widely-adopted neural retrieval models. Results show models perform differently when moving from interpolation to extrapolation. For example, representation-based retrieval models perform almost as well as interaction-based retrieval models in terms of interpolation but not extrapolation. Therefore, it is necessary to separately evaluate both interpolation and extrapolation performance and the proposed resampling method serves as a simple yet effective evaluation tool for future IR studies.Comment: CIKM 2022 Full Pape

    Timeout Reached, Session Ends?

    Get PDF
    Die Identifikation von Sessions zum Verständnis des Benutzerverhaltens ist ein Forschungsgebiet des Web Usage Mining. Definitionen und Konzepte werden seit über 20 Jahren diskutiert. Die Forschung zeigt, dass Session-Identifizierung kein willkürlicher Prozess sein sollte. Es gibt eine fragwürdige Tendenz zu vereinfachten mechanischen Sessions anstelle logischer Segmentierungen. Ziel der Dissertation ist es zu beweisen, wie unterschiedliche Session-Ansätze zu abweichenden Ergebnissen und Interpretationen führen. Die übergreifende Forschungsfrage lautet: Werden sich verschiedene Ansätze zur Session-Identifizierung auf Analyseergebnisse und Machine-Learning-Probleme auswirken? Ein methodischer Rahmen für die Durchführung, den Vergleich und die Evaluation von Sessions wird gegeben. Die Dissertation implementiert 135 Session-Ansätze in einem Jahr (2018) Daten einer deutschen Preisvergleichs-E-Commerce-Plattform. Die Umsetzung umfasst mechanische Konzepte, logische Konstrukte und die Kombination mehrerer Mechaniken. Es wird gezeigt, wie logische Sessions durch Embedding-Algorithmen aus Benutzersequenzen konstruiert werden: mit einem neuartigen Ansatz zur Identifizierung logischer Sessions, bei dem die thematische Nähe von Interaktionen anstelle von Suchanfragen allein verwendet wird. Alle Ansätze werden verglichen und quantitativ beschrieben sowie in drei Machine-Learning-Problemen (wie Recommendation) angewendet. Der Hauptbeitrag dieser Dissertation besteht darin, einen umfassenden Vergleich von Session-Identifikationsalgorithmen bereitzustellen. Die Arbeit bietet eine Methodik zum Implementieren, Analysieren und Evaluieren einer Auswahl von Mechaniken, die es ermöglichen, das Benutzerverhalten und die Auswirkungen von Session-Modellierung besser zu verstehen. Die Ergebnisse zeigen, dass unterschiedlich strukturierte Eingabedaten die Ergebnisse von Algorithmen oder Analysen drastisch verändern können.The identification of sessions as a means of understanding user behaviour is a common research area of web usage mining. Different definitions and concepts have been discussed for over 20 years: Research shows that session identification is not an arbitrary task. There is a tendency towards simplistic mechanical sessions instead of more complex logical segmentations, which is questionable. This dissertation aims to prove how the nature of differing session-identification approaches leads to diverging results and interpretations. The overarching research question asks: will different session-identification approaches impact analysis and machine learning tasks? A comprehensive methodological framework for implementing, comparing and evaluating sessions is given. The dissertation provides implementation guidelines for 135 session-identification approaches utilizing a complete year (2018) of traffic data from a German price-comparison e-commerce platform. The implementation includes mechanical concepts, logical constructs and the combination of multiple methods. It shows how logical sessions were constructed from user sequences by employing embedding algorithms on interaction logs; taking a novel approach to logical session identification by utilizing topical proximity of interactions instead of search queries alone. All approaches are compared and quantitatively described. The application in three machine-learning tasks (such as recommendation) is intended to show that using different sessions as input data has a marked impact on the outcome. The main contribution of this dissertation is to provide a comprehensive comparison of session-identification algorithms. The research provides a methodology to implement, analyse and compare a wide variety of mechanics, allowing to better understand user behaviour and the effects of session modelling. The main results show that differently structured input data may drastically change the results of algorithms or analysis
    corecore