Search CORE

11 research outputs found

Modelos, algoritmos y aplicaciones en búsquedas a gran escala

Author: Giordano Andrés
González Agustín
Jurán Tomás
Marrone Agustín
Ríssola Esteban A.
Tolosa Gabriel Hernán
Publication venue
Publication date: 02/09/2020
Field of study

La publicación de información digital crece día a día a tasas exponenciales. Esto exige mayores capacidades de hardware a los proveedores de servicios, e impone restricciones a los usuarios en cuanto a la facilidad de acceso. Además, teniendo en cuenta que los usuarios requieren información relevante lo más rápido posible, la alta tasa de aparición de contenido desafía a las herramientas de búsqueda, las cuales deben considerar y manejar eficientemente el tamaño, la complejidad y el dinamismo de las fuentes actuales de información digital. En el caso del procesamiento de colecciones masivas de documentos, uno de los desafíos en cuanto a la eficiencia está dado por analizar la menor cantidad de documentos posible para satisfacer una consulta. Por otro lado, si los documentos ocurren en tiempo real (flujos) se requieren estrategias eficientes de ruteo hacia los nodos de búsquedas y de indexación incremental. Estos problemas requieren, en general, procesamiento distribuido, paralelo y algoritmos altamente eficientes. En la mayoría de los casos, la partición del problema y la distribución de la carga de trabajo son aspectos de las estrategias que requieren ser optimizados de acuerdo al problema.Eje: Base de Datos y Minería de Datos.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

Estrategias algorítmicas y estructuras de datos eficientes para búsquedas en datos masivos

Author: Banchero Santiago
Delvechio Tomás
González Agustín
Lavallen Pablo J.
Ríssola Esteban A.
Tolosa Gabriel Hernán
Tonín Monzón Francisco
Publication venue
Publication date: 05/10/2022
Field of study

El mundo digital nos expone diariamente a una cantidad de datos constantemente creciente que exige contar con herramientas eficaces y muy eficientes para procesarlos y accederlos. La diversidad de aplicaciones que producen y consumen datos, sumada a un número también creciente de usuarios impone desafíos computacionales, tanto algorítmicos como del hardware disponible. Ejemplos típicos son sistemas de búsquedas de gran escala (como los motores de búsqueda web) o los servicios de búsqueda en tiempo real (como aquellos disponibles en las redes sociales). Estos escenarios no solo exigen mayores capacidades a los proveedores de servicios (lo que impacta en su operación) sino, además, mejoras conceptuales y prácticas en las estructuras de datos y los algoritmos necesarios para que los sistemas escalen adecuadamente y puedan gestionar la demanda. La eficiencia es un requerimiento fundamental en el mundo digital actual caracterizado por datos masivos, heterogéneos y dinámicos. Estas líneas de investigación abordan problemas de búsqueda en datos masivos, tanto desde las estructuras de datos como de los algoritmos necesarios para procesar documentos, publicaciones en redes sociales o consultas, con el objetivo de posibilitar la escalabilidad de los sistemas de búsqueda con el objetivo final de hacer un uso más racional de los recursos.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

Efficient query processing for scalable web search

Author: Macdonald Craig
Ounis Iadh
Tonellotto Nicola
Publication venue: 'Now Publishers'
Publication date: 01/01/2018
Field of study

Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

Archivio della Ricerca - Università di Pisa

Enlighten

CERN Document Server

Optimizing Guided Traversal for Fast Learned Sparse Retrieval

Author: Lin Haixin
Qiao Yifan
Yang Tao
Yang Yingrui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/05/2023
Field of study

Recent studies show that BM25-driven dynamic index skipping can greatly accelerate MaxScore-based document retrieval based on the learned sparse representation derived by DeepImpact. This paper investigates the effectiveness of such a traversal guidance strategy during top k retrieval when using other models such as SPLADE and uniCOIL, and finds that unconstrained BM25-driven skipping could have a visible relevance degradation when the BM25 model is not well aligned with a learned weight model or when retrieval depth k is small. This paper generalizes the previous work and optimizes the BM25 guided index traversal with a two-level pruning control scheme and model alignment for fast retrieval using a sparse representation. Although there can be a cost of increased latency, the proposed scheme is much faster than the original MaxScore method without BM25 guidance while retaining the relevance effectiveness. This paper analyzes the competitiveness of this two-level pruning scheme, and evaluates its tradeoff in ranking relevance and time efficiency when searching several test datasets.Comment: This paper is published in WWW'2

arXiv.org e-Print Archive

An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

Author: Bruch Sebastian
Ingber Amir
Liberty Edo
Nardini Franco Maria
Publication venue
Publication date: 25/01/2023
Field of study

Maximum Inner Product Search or top-k retrieval on sparse vectors is well-understood in information retrieval, with a number of mature algorithms that solve it exactly. However, all existing algorithms are tailored to text and frequency-based similarity measures. To achieve optimal memory footprint and query latency, they rely on the near stationarity of documents and on laws governing natural languages. We consider, instead, a setup in which collections are streaming -- necessitating dynamic indexing -- and where indexing and retrieval must work with arbitrarily distributed real-valued vectors. As we show, existing algorithms are no longer competitive in this setup, even against naive solutions. We investigate this gap and present a novel approximate solution, called Sinnamon, that can efficiently retrieve the top-k results for sparse real valued vectors drawn from arbitrary distributions. Notably, Sinnamon offers levers to trade-off memory consumption, latency, and accuracy, making the algorithm suitable for constrained applications and systems. We give theoretical results on the error introduced by the approximate nature of the algorithm, and present an empirical evaluation of its performance on two hardware platforms and synthetic and real-valued datasets. We conclude by laying out concrete directions for future research on this general top-k retrieval problem over sparse vectors

arXiv.org e-Print Archive

Bridging Dense and Sparse Maximum Inner Product Search

Author: Bruch Sebastian
Ingber Amir
Liberty Edo
Nardini Franco Maria
Publication venue
Publication date: 16/09/2023
Field of study

Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-

k

retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-

k

retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

arXiv.org e-Print Archive

Managing tail latency in large scale information retrieval systems

Author: Mackenzie J
Publication venue: RMIT University
Publication date
Field of study

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as &quot;what is the median latency of our search engine?&quot; to questions which more accurately capture user experience such as &quot;how many queries take more than 200ms to return answers?&quot; or &quot;what is the worst case latency that a user may be subject to, and how often might it occur?&quot; While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency

RMIT Research Repository

XXIII Edición del Workshop de Investigadores en Ciencias de la Computación : Pósters

Author: Carmona Fernanda Beatriz
Frati Fernando Emmanuel
Publication venue: Universidad Nacional de Chilecito (UNdeC)
Publication date: 31/05/2021
Field of study

Se recopilan los pósters presentados en el XXIII Workshop de Investigadores en Ciencias de la Computación (WICC), organizado por la Universidad Nacional de Chilecito y celebrado virtualmente el 15 y 16 de abril de 2021.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

Actas del XXIV Workshop de Investigadores en Ciencias de la Computación: WICC 2022

Author: Universidad Champagnat
Publication venue: FUSMA Ediciones
Publication date: 22/09/2022
Field of study

Compilación de las ponencias presentadas en el XXIV Workshop de Investigadores en Ciencias de la Computación (WICC), llevado a cabo en Mendoza en abril de 2022.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual