1,819 research outputs found
DESSERT: An Efficient Algorithm for Vector Set Search with Vector Set Queries
We study the problem of with . This task is analogous to traditional near-neighbor search, with the
exception that both the query and each element in the collection are
of vectors. We identify this problem as a core subroutine for
semantic search applications and find that existing solutions are unacceptably
slow. Towards this end, we present a new approximate search algorithm, DESSERT
(ESSERT ffeciently earches ets of mbeddings via etrieval ables). DESSERT is a general tool
with strong theoretical guarantees and excellent empirical performance. When we
integrate DESSERT into ColBERT, a state-of-the-art semantic search model, we
find a 2-5x speedup on the MS MARCO and LoTTE retrieval benchmarks with minimal
loss in recall, underscoring the effectiveness and practical applicability of
our proposal.Comment: Code available, https://github.com/ThirdAIResearch/Desser
An efficient closed frequent itemset miner for the MOA stream mining system
Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Engineering an Open Web Syndication Interchange with Discovery and Recommender Capabilities
Web syndication has become a popular means of delivering relevant information to people online but the complexity of standards, algorithms and applications pose considerable challenges to engineers. This paper describes the design and development of a novel Web-based syndication intermediary called InterSynd and a simple Web client as a proof of concept. We developed format-neutral middleware that sits between content sources and the user. Additional objectives were to add feed discovery and recommendation components to the intermediary. A search-based feed discovery module helps users find relevant feed sources. Implicit collaborative recommendations of new feeds are also made to the user. The syndication software built uses open standard XML technologies and the free open source libraries. Extensibility and re-configurability were explicit goals. The experience shows that a modular architecture can combine open source modules to build state-of-the-art syndication middleware and applications. The data produced by software metrics indicate the high degree of modularity retained
A context -and template- based data compression approach to improve resource-constrained IoT systems interoperability.
170 p.El objetivo del Internet de las Cosas (the Internet of Things, IoT) es el de interconectar todo tipo de cosas, desde dispositivos simples, como una bombilla o un termostato, a elementos más complejos y abstractoscomo una máquina o una casa. Estos dispositivos o elementos varÃan enormemente entre sÃ, especialmente en las capacidades que poseen y el tipo de tecnologÃas que utilizan. Esta heterogeneidad produce una gran complejidad en los procesos integración en lo que a la interoperabilidad se refiere.Un enfoque común para abordar la interoperabilidad a nivel de representación de datos en sistemas IoT es el de estructurar los datos siguiendo un modelo de datos estándar, asà como formatos de datos basados en texto (e.g., XML). Sin embargo, el tipo de dispositivos que se utiliza normalmente en sistemas IoT tiene capacidades limitadas, asà como recursos de procesamiento y de comunicación escasos. Debido a estas limitaciones no es posible integrar formatos de datos basados en texto de manera sencilla y e1ciente en dispositivos y redes con recursos restringidos. En esta Tesis, presentamos una novedosa solución de compresión de datos para formatos de datos basados en texto, que está especialmente diseñada teniendo en cuenta las limitaciones de dispositivos y redes con recursos restringidos. Denominamos a esta solución Context- and Template-based Compression (CTC). CTC mejora la interoperabilidad a nivel de los datos de los sistemas IoT a la vez que requiere muy pocos recursos en cuanto a ancho de banda de las comunicaciones, tamaño de memoria y potencia de procesamiento
- …