1,819 research outputs found

    DESSERT: An Efficient Algorithm for Vector Set Search with Vector Set Queries

    Full text link
    We study the problem of vector set search\textit{vector set search} with vector set queries\textit{vector set queries}. This task is analogous to traditional near-neighbor search, with the exception that both the query and each element in the collection are sets\textit{sets} of vectors. We identify this problem as a core subroutine for semantic search applications and find that existing solutions are unacceptably slow. Towards this end, we present a new approximate search algorithm, DESSERT (D{\bf D}ESSERT E{\bf E}ffeciently S{\bf S}earches S{\bf S}ets of E{\bf E}mbeddings via R{\bf R}etrieval T{\bf T}ables). DESSERT is a general tool with strong theoretical guarantees and excellent empirical performance. When we integrate DESSERT into ColBERT, a state-of-the-art semantic search model, we find a 2-5x speedup on the MS MARCO and LoTTE retrieval benchmarks with minimal loss in recall, underscoring the effectiveness and practical applicability of our proposal.Comment: Code available, https://github.com/ThirdAIResearch/Desser

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Engineering an Open Web Syndication Interchange with Discovery and Recommender Capabilities

    Get PDF
    Web syndication has become a popular means of delivering relevant information to people online but the complexity of standards, algorithms and applications pose considerable challenges to engineers.  This paper describes the design and development of a novel Web-based syndication intermediary called InterSynd and a simple Web client as a proof of concept. We developed format-neutral middleware that sits between content sources and the user. Additional objectives were to add feed discovery and recommendation components to the intermediary. A search-based feed discovery module helps users find relevant feed sources. Implicit collaborative recommendations of new feeds are also made to the user. The syndication software built uses open standard XML technologies and the free open source libraries. Extensibility and re-configurability were explicit goals. The experience shows that a modular architecture can combine open source modules to build state-of-the-art syndication middleware and applications. The data produced by software metrics indicate the high degree of modularity retained

    A context -and template- based data compression approach to improve resource-constrained IoT systems interoperability.

    Get PDF
    170 p.El objetivo del Internet de las Cosas (the Internet of Things, IoT) es el de interconectar todo tipo de cosas, desde dispositivos simples, como una bombilla o un termostato, a elementos más complejos y abstractoscomo una máquina o una casa. Estos dispositivos o elementos varían enormemente entre sí, especialmente en las capacidades que poseen y el tipo de tecnologías que utilizan. Esta heterogeneidad produce una gran complejidad en los procesos integración en lo que a la interoperabilidad se refiere.Un enfoque común para abordar la interoperabilidad a nivel de representación de datos en sistemas IoT es el de estructurar los datos siguiendo un modelo de datos estándar, así como formatos de datos basados en texto (e.g., XML). Sin embargo, el tipo de dispositivos que se utiliza normalmente en sistemas IoT tiene capacidades limitadas, así como recursos de procesamiento y de comunicación escasos. Debido a estas limitaciones no es posible integrar formatos de datos basados en texto de manera sencilla y e1ciente en dispositivos y redes con recursos restringidos. En esta Tesis, presentamos una novedosa solución de compresión de datos para formatos de datos basados en texto, que está especialmente diseñada teniendo en cuenta las limitaciones de dispositivos y redes con recursos restringidos. Denominamos a esta solución Context- and Template-based Compression (CTC). CTC mejora la interoperabilidad a nivel de los datos de los sistemas IoT a la vez que requiere muy pocos recursos en cuanto a ancho de banda de las comunicaciones, tamaño de memoria y potencia de procesamiento
    • …
    corecore