22 research outputs found

    Vector Search with OpenAI Embeddings: Lucene Is All You Need

    Full text link
    We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure

    End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene

    Full text link
    The bi-encoder architecture provides a framework for understanding machine-learned retrieval models based on dense and sparse vector representations. Although these representations capture parametric realizations of the same underlying conceptual framework, their respective implementations of top-kk similarity search require the coordination of different software components (e.g., inverted indexes, HNSW indexes, and toolkits for neural inference), often knitted together in complex architectures. In this work, we ask the following question: What's the simplest design, in terms of requiring the fewest changes to existing infrastructure, that can support end-to-end retrieval with modern dense and sparse representations? The answer appears to be that Lucene is sufficient, as we demonstrate in Anserini, a toolkit for reproducible information retrieval research. That is, effective retrieval with modern single-vector neural models can be efficiently performed directly in Java on the CPU. We examine the implications of this design for information retrieval researchers pushing the state of the art as well as for software engineers building production search systems

    Graph-of-Entity: A Model for Combined Data Representation and Retrieval

    Get PDF
    Managing large volumes of digital documents along with the information they contain, or are associated with, can be challenging. As systems become more intelligent, it increasingly makes sense to power retrieval through all available data, where every lead makes it easier to reach relevant documents or entities. Modern search is heavily powered by structured knowledge, but users still query using keywords or, at the very best, telegraphic natural language. As search becomes increasingly dependent on the integration of text and knowledge, novel approaches for a unified representation of combined data present the opportunity to unlock new ranking strategies. We tackle entity-oriented search using graph-based approaches for representation and retrieval. In particular, we propose the graph-of-entity, a novel approach for indexing combined data, where terms, entities and their relations are jointly represented. We compare the graph-of-entity with the graph-of-word, a text-only model, verifying that, overall, it does not yet achieve a better performance, despite obtaining a higher precision. Our assessment was based on a small subset of the INEX 2009 Wikipedia Collection, created from a sample of 10 topics and respectively judged documents. The offline evaluation we do here is complementary to its counterpart from TREC 2017 OpenSearch track, where, during our participation, we had assessed graph-of-entity in an online setting, through team-draft interleaving

    Lucene4IR: Developing information retrieval evaluation resources using Lucene

    Get PDF
    The workshop and hackathon on developing Information Retrieval Evaluation Resources using Lucene (L4IR) was held on the 8th and 9th of September, 2016 at the University of Strathclyde in Glasgow, UK and funded by the ESF Elias Network. The event featured three main elements: (i) a series of keynote and invited talks on industry, teaching and evaluation; (ii) planning, coding and hacking where a number of groups created modules and infrastructure to use Lucene to undertake TREC based evaluations; and (iii) a number of breakout groups discussing challenges, opportunities and problems in bridging the divide between academia and industry, and how we can use Lucene for teaching and learning Information Retrieval (IR). The event was composed of a mix and blend of academics, experts and students wanting to learn, share and create evaluation resources for the community. The hacking was intense and the discussions lively creating the basis of many useful tools but also raising numerous issues. It was clear that by adopting and contributing to most widely used and supported Open Source IR toolkit, there were many benefits for academics, students, researchers, developers and practitioners - providing a basis for stronger evaluation practices, increased reproducibility, more efficient knowledge transfer, greater collaboration between academia and industry, and shared teaching and training resources

    Report on the 1st Simulation for Information Retrieval Workshop (Sim4IR 2021) at SIGIR 2021

    Get PDF
    Simulation is used as a low-cost and repeatable means of experimentation. As Information Retrieval (IR) researchers, we are no strangers to the idea of using simulation within our own field---such as the traditional means of IR system evaluation as manifested through the Cranfield paradigm. While simulation has been used in other areas of IR research (such as the study of user behaviours), we argue that the potential for using simulation has been recognised by relatively few IR researchers so far. To this end, the Sim4IR workshop was held online on July 15th, 2021 in conjunction with ACM SIGIR 2021. Building on past efforts, the goal of the workshop was to create a forum for researchers and practitioners to promote methodology and development of more widespread use of simulation for IR evaluation. Around 80 participants took part over two sessions. A total of two keynotes, three original paper presentations, and eight 'encore talks' were presented. The main conclusions from the resultant discussion were that simulation has the potential to offer solutions to the limitations of existing evaluation methodologies, but there is more research needed toward developing realistic user simulators; and the development and sharing of simulators, in the form of toolkits and online services, is critical for successful uptake.publishedVersio

    A Living Lab Architecture for Reproducible Shared Task Experimentation

    Get PDF
    No existing evaluation infrastructure for shared tasks currently supports both reproducible on- and offline experiments. In this work, we present an architecture that ties together both types of experiments with a focus on reproducibility. The readers are provided with a technical description of the infrastructure and details of how to contribute their own experiments to upcoming evaluation tasks

    Automatic keyword extraction for a partial search engine index

    Get PDF
    Full-text search engines play a critical role in many enterprise applications, where the quantity and complexity of the information are overwhelming. Promptly finding documents that contain relevant information for pressing questions is a necessity for efficient operation. This is especially the case for financial and legal teams executing Mergers and Acquisitions deals. The goal of the thesis is to provide search services for such teams without storing the sensitive documents involved, minimising the risk of potential data leaks. A literature review of related methods and concepts is presented. As search engine technologies that use encrypted indices for commercial applications are still in their early stages, the solution proposed in the thesis is the use of partial indexing by keyword extraction. A cosine similarity-based evaluation was used to measure the performance difference between the keyword-based partial index and the complete index. The partial indices were constructed using unsupervised keyword extraction methods based on term frequency, document graphs, and topic modelling. The frequency-based methods were term frequency, TF-IDF, and YAKE!. The graph-based method was TextRank. The topic modelling-based methods were NMF, LDA, and LSI. The methods were evaluated by running 51 reference queries on the LEDGAR data set, which contains 60,540 contracts. The results show that using only five keywords per document from the TF-IDF or YAKE! methods, the best matching documents in the result lists have a cosine similarity of 0.7 on average. This value is reasonably high, particularly considering the small number of keywords. The topic modelling-based methods were found to perform poorly due to being too general. The term frequency and TextRank methods were mediocre

    Continuous evaluation of large-scale information access systems : a case for living labs

    Get PDF
    A/B testing is currently being increasingly adopted for the evaluation of commercial information access systems with a large user base since it provides the advantage of observing the efficiency and effectiveness of information access systems under real conditions. Unfortunately, unless university-based researchers closely collaborate with industry or develop their own infrastructure or user base, they cannot validate their ideas in live settings with real users. Without online testing opportunities open to the research communities, academic researchers are unable to employ online evaluation on a larger scale. This means that they do not get feedback for their ideas and cannot advance their research further. Businesses, on the other hand, miss the opportunity to have higher customer satisfaction due to improved systems. In addition, users miss the chance to benefit from an improved information access system. In this chapter, we introduce two evaluation initiatives at CLEF, NewsREEL and Living Labs for IR (LL4IR), that aim to address this growing “evaluation gap” between academia and industry. We explain the challenges and discuss the experiences organizing these living labs
    corecore