57 research outputs found
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors
Maximum Inner Product Search or top-k retrieval on sparse vectors is
well-understood in information retrieval, with a number of mature algorithms
that solve it exactly. However, all existing algorithms are tailored to text
and frequency-based similarity measures. To achieve optimal memory footprint
and query latency, they rely on the near stationarity of documents and on laws
governing natural languages. We consider, instead, a setup in which collections
are streaming -- necessitating dynamic indexing -- and where indexing and
retrieval must work with arbitrarily distributed real-valued vectors. As we
show, existing algorithms are no longer competitive in this setup, even against
naive solutions. We investigate this gap and present a novel approximate
solution, called Sinnamon, that can efficiently retrieve the top-k results for
sparse real valued vectors drawn from arbitrary distributions. Notably,
Sinnamon offers levers to trade-off memory consumption, latency, and accuracy,
making the algorithm suitable for constrained applications and systems. We give
theoretical results on the error introduced by the approximate nature of the
algorithm, and present an empirical evaluation of its performance on two
hardware platforms and synthetic and real-valued datasets. We conclude by
laying out concrete directions for future research on this general top-k
retrieval problem over sparse vectors
Towards an efficient OLAP engine based on linear algebra
Dissertação de mestrado integrado em Computer ScienceRelational database engines associated to the widely used Structured Query Language
(SQL) are suffering unsatisfactory performance results in complex business queries, due
to ever increasing volumes of stored data. To retrieve and process data in a more efficient
way, Online Analytical Processing (OLAP) models have been proposed with an increased
focus on attributes (measures and dimensions) over records.
OLAP is based on a row-oriented theory, while a columnar-oriented theory could considerably
improve the performance of analytical systems. The Typed Linear Algebra (TLA)
approach is an example of such theory: it encodes each database attribute in a distinct matrix.
These matrices are combined in a single Linear Algebra (LA) expression to obtain the
result of a query.
This dissertation combines concepts of relational databases, OLAP, TLA and performance
engineering to design, implement and validate an efficient TLA-DB engine: SQL queries are
converted into its equivalent LA expression, using Type Diagrams (TDs), which represent
each matrix as an arrow pointing from the number of columns to the number of rows, TDs
are converted to a LA expression encoded in Linear Algebra Query language (LAQ) and
the LAQ script of a query is automatically coded in C Plus Plus (C++).
An efficient TLA-DB engine required the encoding of the sparse matrices in an adequate
format, namely Compressed Sparse Column (CSC), while the operations specified in LAQ
expressions had their performance improved by optimised algorithms and an optimised
query processor.
The functionality of the resulting LAQ engine was validated with several TPC Benchmark
H (TPC-H) queries for various dataset sizes. A comparative evaluation of the TLA-DB with
two popular Database Management Systems (DBMSs), PostgreSQL and MySQL, showed
that the developed framework outperforms both DBMSs in most TPC-H queries.As melhorias de desempenho dos sistemas de gestão de bases de dados relacionais não
têm sido suficientes para acompanhar o crescimento do volume de dados com que são
utilizados. Para colmatar a consequente necessidade de soluções mais eficientes, a teoria
OLAP foi proposta. Esta introduz as noções de medidas e dimensões, guardando préagregações
das medidas baseadas nas últimas, de forma a acelerar o processo de análise de
dados.
Contudo, ainda que com regras mais restritas, o OLAP está assente em álgebra relacional.
A proposição de uma teoria orientada à coluna pode abrir portas a grandes melhorias de
desempenho em consultas analÃticas. A álgebra linear tipada é um bom exemplo. Segundo
esta teoria, cada um dos atributos é convertido numa matriz independente, as quais são posteriormente
combinadas através de uma expressão de álgebra linear que define o resultado
da consulta.
Esta dissertação combina conceitos de bases de dados relacionais, OLAP, álgebra linear,
teoria de tipos, e computação eficiente para projetar, implementar e validar um motor
OLAP robusto e eficiente. Para tal, consultas em SQL são convertidas para a expressão de
álgebra linear equivalente, usando diagramas de tipo que representam cada matriz como
uma seta a apontar do número de colunas para o número de linhas da matriz. A expressão
que deles resulta é então codificada em LAQ e automaticamente implementada em C++.
Para garantir a eficiencia da ferramenta desenvolvida, todas as matrizes foram guardadas
num formato adequado, nomeadamente o CSC. Por sua vez, as operações especificadas na
LAQ foram implementadas recorrendo a algoritmos optimizados.
A correção do sistema implementado foi garantida através da validação dos resultados de
um grupo de consultas extraidas do TPC-H, executadas sobre bases de dados de multiplos
tamanhos. Finalmente, a comparação com dois sistemas de bases de dados convencionais
(o PostgreSQL e o MySQL) nas métricas de tempo de execução e memória utilizada, demonstrou
a maior eficiencia da ferramenta desenvolvida na maioria das consultas.This work was financed by the ERDF – European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project «POCI-01-0145-FEDER-006961», and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia as part of project «UID/EEA/50014/2013
Graph Pattern Matching on Symmetric Multiprocessor Systems
Graph-structured data can be found in nearly every aspect of today's world, be it road networks, social networks or the internet itself.
From a processing perspective, finding comprehensive patterns in graph-structured data is a core processing primitive in a variety of applications, such as fraud detection, biological engineering or social graph analytics.
On the hardware side, multiprocessor systems, that consist of multiple processors in a single scale-up server, are the next important wave on top of multi-core systems.
In particular, symmetric multiprocessor systems (SMP) are characterized by the fact, that each processor has the same architecture, e.g. every processor is a multi-core and all multiprocessors share a common and huge main memory space.
Moreover, large SMPs will feature a non-uniform memory access (NUMA), whose impact on the design of efficient data processing concepts should not be neglected.
The efficient usage of SMP systems, that still increase in size, is an interesting and ongoing research topic.
Current state-of-the-art architectural design principles provide different and in parts disjunct suggestions on which data should be partitioned and or how intra-process communication should be realized.
In this thesis, we propose a new synthesis of four of the most well-known principles Shared Everything, Partition Serial Execution, Data Oriented Architecture and Delegation, to create the NORAD architecture, which stands for NUMA-aware DORA with Delegation.
We built our research prototype called NeMeSys on top of the NORAD architecture to fully exploit the provided hardware capacities of SMPs for graph pattern matching.
Being an in-memory engine, NeMeSys allows for online data ingestion as well as online query generation and processing through a terminal based user interface.
Storing a graph on a NUMA system inherently requires data partitioning to cope with the mentioned NUMA effect.
Hence, we need to dissect the graph into a disjunct set of partitions, which can then be stored on the individual memory domains.
This thesis analyzes the capabilites of the NORAD architecture, to perform scalable graph pattern matching on SMP systems.
To increase the systems performance, we further develop, integrate and evaluate suitable optimization techniques.
That is, we investigate the influence of the inherent data partitioning, the interplay of messaging with and without sufficient locality information and the actual partition placement on any NUMA socket in the system.
To underline the applicability of our approach, we evaluate NeMeSys against synthetic datasets and perform an end-to-end evaluation of the whole system stack on the real world knowledge graph of Wikidata
Space-Efficient Data Structures for Information Retrieval
The amount of data that people and companies store has grown exponentially over the last few years. Storing this information alone is not enough, because in order to make it useful we need to be able to efficiently search inside it.
Furthermore, it is highly valuable to keep the historic data of each document stored, allowing to not only access and search inside the newest version, but also over the whole history of the documents.
Grammar-based compression has proven to be very effective for repetitive data, which is the case for versioned documents. In this thesis we present several results on representing textual information and searching in it. In particular, we present text indexes for grammar-based compressed text that support searching for a pattern and extracting substrings of the input text. These are the first
general indexes for grammar-based compressed text that support searching in sublinear time.
In order to build our indexes, we present new results on representing binary relations in a space-efficient manner, and construction algorithms that use little space to achieve their goal. These two results have a wide range of applications. In particular, the representations for binary relations can be used as a building block for several structures in computer science, such as graphs, inverted indexes, etc.
Finally, we present a new index, that uses on grammar-based compression, to solve the document listing problem. This problem deals with representing a collection of texts and searching for the documents that contain a given pattern. In spite of being similar to the classical text indexing problem, this problem has proven to be a challenge when we do not want to pay time proportional to the number of occurrences, but time proportional to the size of the result. Our proposal is designed particularly for versioned text, allowing the storage of a collection of documents with all their historic versions in little space. This is currently the smallest structure for such a purpose in practice
Secure and Efficient Comparisons between Untrusted Parties
A vast number of online services is based on users contributing their personal information. Examples are manifold, including social networks, electronic commerce, sharing websites, lodging platforms, and genealogy. In all cases user privacy depends on a collective trust upon all involved intermediaries, like service providers, operators, administrators or even help desk staff. A single adversarial party in the whole chain of trust voids user privacy. Even more, the number of intermediaries is ever growing. Thus, user privacy must be preserved at every time and stage, independent of the intrinsic goals any involved party. Furthermore, next to these new services, traditional offline analytic systems are replaced by online services run in large data centers. Centralized processing of electronic medical records, genomic data or other health-related information is anticipated due to advances in medical research, better analytic results based on large amounts of medical information and lowered costs. In these scenarios privacy is of utmost concern due to the large amount of personal information contained within the centralized data.
We focus on the challenge of privacy-preserving processing on genomic data, specifically comparing genomic sequences. The problem that arises is how to efficiently compare private sequences of two parties while preserving confidentiality of the compared data. It follows that the privacy of the data owner must be preserved, which means that as little information as possible must be leaked to any party participating in the comparison. Leakage can happen at several points during a comparison. The secured inputs for the comparing party might leak some information about the original input, or the output might leak information about the inputs. In the latter case, results of several comparisons can be combined to infer information about the confidential input of the party under observation. Genomic sequences serve as a use-case, but the proposed solutions are more general and can be applied to the generic field of privacy-preserving comparison of sequences. The solution should be efficient such that performing a comparison yields runtimes linear in the length of the input sequences and thus producing acceptable costs for a typical use-case. To tackle the problem of efficient, privacy-preserving sequence comparisons, we propose a framework consisting of three main parts.
a) The basic protocol presents an efficient sequence comparison algorithm, which transforms a sequence into a set representation, allowing to approximate distance measures over input sequences using distance measures over sets. The sets are then represented by an efficient data structure - the Bloom filter -, which allows evaluation of certain set operations without storing the actual elements of the possibly large set. This representation yields low distortion for comparing similar sequences. Operations upon the set representation are carried out using efficient, partially homomorphic cryptographic systems for data confidentiality of the inputs. The output can be adjusted to either return the actual approximated distance or the result of an in-range check of the approximated distance.
b) Building upon this efficient basic protocol we introduce the first mechanism to reduce the success of inference attacks by detecting and rejecting similar queries in a privacy-preserving way. This is achieved by generating generalized commitments for inputs. This generalization is done by treating inputs as messages received from a noise channel, upon which error-correction from coding theory is applied. This way similar inputs are defined as inputs having a hamming distance of their generalized inputs below a certain predefined threshold. We present a protocol to perform a zero-knowledge proof to assess if the generalized input is indeed a generalization of the actual input. Furthermore, we generalize a very efficient inference attack on privacy-preserving sequence comparison protocols and use it to evaluate our inference-control mechanism.
c) The third part of the framework lightens the computational load of the client taking part in the comparison protocol by presenting a compression mechanism for partially homomorphic cryptographic schemes. It reduces the transmission and storage overhead induced by the semantically secure homomorphic encryption schemes, as well as encryption latency. The compression is achieved by constructing an asymmetric stream cipher such that the generated ciphertext can be converted into a ciphertext of an associated homomorphic encryption scheme without revealing any information about the plaintext. This is the first compression scheme available for partially homomorphic encryption schemes. Compression of ciphertexts of fully homomorphic encryption schemes are several orders of magnitude slower at the conversion from the transmission ciphertext to the homomorphically encrypted ciphertext. Indeed our compression scheme achieves optimal conversion performance. It further allows to generate keystreams offline and thus supports offloading to trusted devices. This way transmission-, storage- and power-efficiency is improved.
We give security proofs for all relevant parts of the proposed protocols and algorithms to evaluate their security. A performance evaluation of the core components demonstrates the practicability of our proposed solutions including a theoretical analysis and practical experiments to show the accuracy as well as efficiency of approximations and probabilistic algorithms. Several variations and configurations to detect similar inputs are studied during an in-depth discussion of the inference-control mechanism. A human mitochondrial genome database is used for the practical evaluation to compare genomic sequences and detect similar inputs as described by the use-case.
In summary we show that it is indeed possible to construct an efficient and privacy-preserving (genomic) sequences comparison, while being able to control the amount of information that leaves the comparison. To the best of our knowledge we also contribute to the field by proposing the first efficient privacy-preserving inference detection and control mechanism, as well as the first ciphertext compression system for partially homomorphic cryptographic systems
Recommended from our members
Integration of search theories and evidential analysis to Web-wide Discovery of information for decision support
The main contribution of this research is that it addresses the issues associated with traditional information gathering and presents a novel semantic approach method to Web-based discovery of previously unknown intelligence for effective decision making. Itprovides a comprehensive theoretical background to the proposed solution together with a demonstration of the effectiveness of the method from results of the experiments, showing how the quality of collected information can be significantly enhanced by previously unknown information derived from the available known facts.
The quality of decisions made in business and government relates directly to the quality of the information used to formulate the decision. This information may be retrieved from an organisation’s knowledge base (Intranet) or from the World Wide Web. The purpose of this thesis is to investigate the specifics of information gathering from these sources. It has studied a number of search techniques that rely on statistical and semantic analysis of unstructured information, and identified benefits and limitations of these techniques. It was concluded that enterprise search technologies can efficiently manipulate Intranet held information, but require complex processing of large amount of textual information, which is not feasible and scalable when applied to the Web.
Based upon the search methods investigations, this thesis introduces a new semantic Web-based search method that automates the correlation of topic-related content for discovery of hitherto unknown information from disparate and widely diverse Web-sources. This method is in contrast to traditional search methods that are constrained to specific or narrowly defined topics. It addresses the three key aspects of the information: semantic closeness to search topic, information completeness, and quality. The method is based on algorithms from Natural Language Processing combined with techniques adapted from grounded theory and Dempster-Shafer theory to significantly enhance the discovery of topic related Web-sourced intelligence.
This thesis also describes the development of the new search solution by showing the integration of the mathematical methods used as well as the development of the working model. Real-world experiments demonstrate the effectiveness of the model with supporting performance analysis, showing that the quality of the extracted content is significantly enhanced comparing to the traditional Web-search approaches
30th International Conference on Information Modelling and Knowledge Bases
Information modelling is becoming more and more important topic for researchers, designers, and users of information systems. The amount and complexity of information itself, the number of abstraction levels of information, and the size of databases and knowledge bases are continuously growing. Conceptual modelling is one of the sub-areas of information modelling. The aim of this conference is to bring together experts from different areas of computer science and other disciplines, who have a common interest in understanding and solving problems on information modelling and knowledge bases, as well as applying the results of research to practice. We also aim to recognize and study new areas on modelling and knowledge bases to which more attention should be paid. Therefore philosophy and logic, cognitive science, knowledge management, linguistics and management science are relevant areas, too. In the conference, there will be three categories of presentations, i.e. full papers, short papers and position papers
- …