129 research outputs found

    Database Optimization Aspects for Information Retrieval

    Get PDF
    There is a growing need for systems that can process queries, combining both structured data and text. One way to provide such functionality is to integrate information retrieval (IR) techniques in a database management system (DBMS). However, both IR and database research have been separate research fields for decades, resulting in different - even conflicting - approaches to data management. Each DBMS has a component called a "query optimizer", which plays a crucial role in the efficiency and flexibility of the system. So, for successful integration the IR techniques and data structures, as well as the DBMS query optimizer, should be adapted to enable mutual cooperation. The author concentrates on top-N queries - a common class of IR queries. An IR top-N query asks for the N best documents given a set of keywords. The author proposes processing the data in batches as a compromise between IR and DBMS query processing. Experiments with this technique show that porting IR optimization techniques is (still) not a promising option due to the additional administrative overhead. Two new mathematical models are introduced to eliminate this overhead: a model that predicts selectivity, which is a crucial factor in the execution costs, and a model that predicts the quality of the top-N

    Large Language Models for Information Retrieval: A Survey

    Full text link
    As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions within this expanding field

    Ph.D. Thesis Proprosal: Transportable Agents

    Get PDF
    One of the paradigms that has been suggested for allowing efficient access to remote resources is transportable agents. A transportable agent is a named program that can migrate from machine to machine in a heterogeneous network. The program chooses when and where to migrate. It can suspend its execution at an arbitrary point, transport to another machine and resume execution on the new machine. Transportable agents have several advantages over the traditional client/server model. Transportable agents consume less network bandwidth and do not require a connection between communicating machines -- this is attractive in all networks and particularly attractive in wireless networks. Transportable agents are a convenient paradigm for distributed computing since they hide the communication channels but not the location of the computation. Transportable agents allow clients and servers to program each other. However transportable agents pose numerous challenges such as security, privacy and efficiency. Existing transportable agent systems do not meet all of these challenges. In addition there has been no formal characterization of the performance of transportable agents. This thesis addresses these weakness. The thesis has two parts -- (1) formally characterize the performance of transportable agents through mathematical analysis and network simulation and (2) implement a complete transportable agent system

    Cross-Platform Question Answering in Social Networking Services

    Get PDF
    The last two decades have made the Internet a major source for knowledge seeking. Several platforms have been developed to find answers to one's questions such as search engines and online encyclopedias. The wide adoption of social networking services has pushed the possibilities even further by giving people the opportunity to stimulate the generation of answers that are not already present on the Internet. Some of these social media services are primarily community question answering (CQA) sites, while the others have a more general audience but can also be used to ask and answer questions. The choice of a particular platform (e.g., a CQA site, a microblogging service, or a search engine) by some user depends on several factors such as awareness of available resources and expectations from different platforms, and thus will sometimes be suboptimal. Hence, we introduce \emph{cross-platform question answering}, a framework that aims to improve our ability to satisfy complex information needs by returning answers from different platforms, including those where the question has not been originally asked. We propose to build this core capability by defining a general architecture for designing and implementing real-time services for answering naturally occurring questions. This architecture consists of four key components: (1) real-time detection of questions, (2) a set of platforms from which answers can be returned, (3) question processing by the selected answering systems, which optionally involves question transformation when questions are answered by services that enforce differing conventions from the original source, and (4) answer presentation, including ranking, merging, and deciding whether to return the answer. We demonstrate the feasibility of this general architecture by instantiating a restricted development version in which we collect the questions from one CQA website, one microblogging service or directly from the asker, and find answers from among some subset of those CQA and microblogging services. To enable the integration of new answering platforms in our architecture, we introduce a framework for automatic evaluation of their effectiveness

    Evaluating Temporal Persistence Using Replicability Measures

    Full text link
    In real-world Information Retrieval (IR) experiments, the Evaluation Environment (EE) is exposed to constant change. Documents are added, removed, or updated, and the information need and the search behavior of users is evolving. Simultaneously, IR systems are expected to retain a consistent quality. The LongEval Lab seeks to investigate the longitudinal persistence of IR systems, and in this work, we describe our participation. We submitted runs of five advanced retrieval systems, namely a Reciprocal Rank Fusion (RRF) approach, ColBERT, monoT5, Doc2Query, and E5, to both sub-tasks. Further, we cast the longitudinal evaluation as a replicability study to better understand the temporal change observed. As a result, we quantify the persistence of the submitted runs and see great potential in this evaluation method.Comment: To be published in Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece 18 - 21, 202

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Filtrage et agrégation d'informations vitales relatives à des entités

    Get PDF
    Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia représentent les sources principales pour accéder aux informations disponibles sur une grande variété d'entités (une entité est une chose qui peut être distinctement identifiée par exemple une personne, une organisation, un produit, un événement, etc.). Cependant, la mise à jour de ces sources avec des informations nouvelles en rapport avec une entité donnée se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entité n'est pas populaire. Concevoir un système qui analyse les documents dès leur publication sur le Web pour filtrer les informations importantes relatives à des entités pourra sans doute accélérer la mise à jour de ces bases de connaissances. Dans cette thèse, nous nous intéressons au filtrage d'informations pertinentes et nouvelles, appelées vitales, relatives à des entités. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi à enrichir les techniques d'ingénierie de connaissances en aidant à la sélection des informations à traiter. Nous souhaitons répondre principalement aux deux problématiques suivantes: (1) Comment détecter si un document est vital (c.à.d qu'il apporte une information pertinente et nouvelle) par rapport à une entité donnée? et (2) Comment extraire les informations vitales à partir de ces documents qui serviront comme référence pour mettre à jour des bases de connaissances? Concernant la première problématique, nous avons proposé deux méthodes. La première proposition est totalement supervisée. Elle se base sur un modèle de langue de vitalité. La deuxième proposition mesure la fraîcheur des expressions temporelles contenues dans un document afin de décider de sa vitalité. En ce qui concerne la deuxième problématique relative à l'extraction d'informations vitales à partir des documents vitaux, nous avons proposé une méthode qui sélectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la présence de mots déclencheurs récupérés automatiquement à partir de la connaissance déjà représentée dans la base de connaissances (comme la description d'entités similaires).L'évaluation des approches proposées a été effectuée dans le cadre de la campagne d'évaluation internationale TREC sur une collection de 1.2 milliard de documents avec différents types d'entités (personnes, organisations, établissements et événements). Pour les approches de filtrage de documents vitaux, nous avons mené nos expérimentations dans le cadre de la tâche "Knwoledge Base Acceleration (KBA)" pour les années 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons résultats dépassant le meilleur système proposé dans la tâche KBA 2013. Pour évaluer les contributions concernant l'extraction des informations vitales relatives à des entités, nous nous sommes basés sur le cadre expérimental de la tâche "Temporal Summarization (TS)". Nous avons montré que notre approche permet de minimiser le temps de latence des mises à jour de bases de connaissances
    corecore