Search CORE

1,104 research outputs found

The Case for Learned Index Structures

Author: Abadi M.
Armbrust M.
Böhm M.
Chang F.
Goodfellow I.
Grossi R.
Lehman T. J.
Litwin W.
Magdon-Ismail M.
Miller D. J.
Moerkotte G.
Sutskever I.
You S.
Publication venue
Publication date: 30/04/2018
Field of study

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible

arXiv.org e-Print Archive

Crossref

More Analysis of Double Hashing for Balanced Allocations

Author: Mitzenmacher Michael
Publication venue
Publication date: 02/03/2015
Field of study

With double hashing, for a key

x

, one generates two hash values

f(x)

and

g(x)

, and then uses combinations

(f(x) +i g(x)) \bmod n

for

i=0,1,2,...

to generate multiple hash values in the range

[0,n-1]

from the initial two. For balanced allocations, keys are hashed into a hash table where each bucket can hold multiple keys, and each key is placed in the least loaded of

d

choices. It has been shown previously that asymptotically the performance of double hashing and fully random hashing is the same in the balanced allocation paradigm using fluid limit methods. Here we extend a coupling argument used by Lueker and Molodowitch to show that double hashing and ideal uniform hashing are asymptotically equivalent in the setting of open address hash tables to the balanced allocation setting, providing further insight into this phenomenon. We also discuss the potential for and bottlenecks limiting the use this approach for other multiple choice hashing schemes.Comment: 13 pages ; current draft ; will be submitted to conference shortl

arXiv.org e-Print Archive

Crossref

Private membership test protocol with low communication complexity

Author: Junnila Ville
Meskanen Tommi
Naderpour Masoud
Niemi Valtteri
Ramezanian Sara
Publication venue
Publication date: 01/08/2020
Field of study

Ramezanian S, Meskanen T, Naderpour M, Junnila V, Niemi V. Private membership test protocol with low communication complexity. Digital Communications and Networks. 2019 May 13.We introduce a practical method to perform private membership tests. In this method, clients are able to test whether an item is in a set controlled by the server without revealing their query item to the server. After executing the queries, the content of the server's set remains secret. One use case for a private membership test is to check whether a file contains any malware by checking its signature against a database of malware samples in a privacy-preserving way. We apply the Bloom filter and the Cuckoo filter in the membership test procedure. In order to achieve privacy properties, we present a novel protocol based on some homomorphic encryption schemes. In our protocol, we rearrange the data in the set into N-dimensional hypercubes. We have implemented our method in a realistic scenario where a client of an anti-malware company wants to privately check whether a hash value of a given file is in the malware database of the company. The evaluation shows that our method is feasible for real-world applications. We also have tested the performance of our protocol for databases of different sizes and data structures with different dimensions: 2-dimensional, 3-dimensional, and 4-dimensional hypercubes. We present formulas to estimate the cost of computation and communication in our protocol.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Discreet - Pub/Sub for Edge Systems

Author: Rocha Ricardo Luís Correia Gonçalves
Publication venue
Publication date: 01/05/2019
Field of study

The number of devices connected to the Internet has been growing exponentially over the last few years. Today, the amount of information available to users has reached a point that makes it impossible to consume it all, showing that we need better ways to filter what kind of information is sent our way. At the same time, while users are online and access all this information, their actions are also being collected, scrutinized and commercialized with little regard for privacy. This thesis addresses those issues in the context of a decentralized Publish/Subscribe solution for edge systems. Working at the edge of the Internet aims to prevent centralized control from a single entity and lessen the chance of abuse. Our goal was to devise a solution that achieves efficient message delivery, with good load-balancing properties, without revealing its participants subscription interests to preserve user privacy. Our solution uses cryptography and probabilistic data sets as a way to obfuscate event topics and user subscriptions. We modeled a cooperative solution, where publisher and subscriber nodes work in concert to route events among themselves, by leveraging a onehop structured overlay. By using an experimental evaluation, we attest the scalability and general performance of the proposed algorithms, including latency, false negative and false positive rates, and other useful metrics.O número de aparelhos ligados a Internet têm vindo a crescer exponencialmente ao longo dos últimos anos. Hoje em dia, a quantidade de informação que os utilizadores têm disponível, chegou a um ponto que torna impossível o seu total consumo. Isto leva a que seja necessário encontrarmos melhores formas de filtrar a informação que recebemos. Ao mesmo tempo, as ações do utilizadores estão a ser recolhidas, examinadas e comercializadas, sem qualquer respeito pela privacidade. Esta tese trata destes assuntos no contexto de um sistema Publish/Subscribe descentralizado, para sistemas na periferia. O objectivo de operar na preferia da Internet está em prevenir o controlo centralizado por uma única entidade e diminuir a oportunidade para abusos. O nosso objectivo foi conceber uma solução que realiza entrega de mensagens eficientemente, com boas propriedades na distribuição de carga e sem revelar on interesses dos participantes, de forma a preservar a sua privacidade. A nossa solução usa criptografia e estruturas de dados probabilísticas, como uma forma de ofuscar os tópicos dos eventos e as subscrições dos utilizadores. Modelamos o sistema com o objectivo de ser uma solução cooperativa, onde ambos os tipos de nós Editores e Assinantes trabalham em concertadamente para encaminhar eventos entre eles, ao fazerem uso de uma estrutura de rede sobreposta com um salto. Fazendo uma avaliação experimental testámos a escalabilidade e o desempenho geral dos algoritmos propostos, incluindo a latência, falsos negativos, falsos positivos e outras métricas úteis

Repositório da Universidade Nova de Lisboa

Face Recognition using the LCS algorithm

Author
Publication venue: Corporación Educativa SER
Publication date: 08/03/2018
Field of study

Today, the topic of human identification based on physical characteristics is a necessity in various fields. As a biometric system, a facial recognition system is fundamentally a pattern recognition system that identifies a person based on specific physiological or behavioral feature vectors. The feature vector is typically stored in a database upon extraction. The main objective of this research is to study and assess the effect of selecting the proper image attributes using the Cuckoo search algorithm. Thus, the selection of an optimal subset, given the large size of the feature vector dimensions to expedite the facial recognition algorithm is essential and substantial. Initially, by using the existing database, image characteristics are extracted and selected as a binary optimal subset of facial features using the Cuckoo algorithm. This subset of optimal features are evaluated by classifying nearest neighbor and neural networks. By calculating the accuracy of this classification, it is clear that the proposed method is of higher accuracy compared to previous methods in facial recognition based on the selection of significant features by the proposed algorithm

Revista Publicando (Corporación Educativa SER, Ecuador)

Insertion Time of Random Walk Cuckoo Hashing below the Peeling Threshold

Author: Walzer Stefan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

Similarity Digest Search: A Survey and Comparative Analysis of Strategies to Perform Known File Filtering Using Approximate Matching

Author: Marco Aurélio Amaral Henriques
Vitor Hugo Galhardo Moia
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2017
Field of study

Digital forensics is a branch of Computer Science aiming at investigating and analyzing electronic devices in the search for crime evidence. There are several ways to perform this search. Known File Filter (KFF) is one of them, where a list of interest objects is used to reduce/separate data for analysis. Holding a database of hashes of such objects, the examiner performs lookups for matches against the target device. However, due to limitations over hash functions (inability to detect similar objects), new methods have been designed, called approximate matching. This sort of function has interesting characteristics for KFF investigations but suffers mainly from high costs when dealing with huge data sets, as the search is usually done by brute force. To mitigate this problem, strategies have been developed to better perform lookups. In this paper, we present the state of the art of similarity digest search strategies, along with a detailed comparison involving several aspects, as time complexity, memory requirement, and search precision. Our results show that none of the approaches address at least these main aspects. Finally, we discuss future directions and present requirements for a new strategy aiming to fulfill current limitations

Crossref

Directory of Open Access Journals