14 research outputs found
The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the
position of a record within a sorted array, a Hash-Index as a model to map a
key to a position of a record within an unsorted array, and a BitMap-Index as a
model to indicate if a data record exists or not. In this exploratory research
paper, we start from this premise and posit that all existing index structures
can be replaced with other types of models, including deep-learning models,
which we term learned indexes. The key idea is that a model can learn the sort
order or structure of lookup keys and use this signal to effectively predict
the position or existence of records. We theoretically analyze under which
conditions learned indexes outperform traditional index structures and describe
the main challenges in designing learned index structures. Our initial results
show, that by using neural nets we are able to outperform cache-optimized
B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over
several real-world data sets. More importantly though, we believe that the idea
of replacing core components of a data management system through learned models
has far reaching implications for future systems designs and that this work
just provides a glimpse of what might be possible
High Performance Computing using Infiniband-based clusters
L'abstract è presente nell'allegato / the abstract is in the attachmen
Runtime protection of software programs against control- and data-oriented attacks
Software programs are everywhere and continue to create value for us at an incredible pace. But this comes at the cost of facing new risks as our well-being and the stability of societies become strongly dependent on their correctness. Even if the software loaded in the memory is considered legitimate or benign, this does not mean that the code will execute as expected at runtime. Software programs, particularly the ones developed in unsafe languages (e.g., C/C++), inevitably contain many memory bugs. Attackers exploiting these bugs can achieve malicious computations outside the original specification of the program by corrupting its control and data variables in the memory.
A potential solution to such runtime attacks must either ensure the integrity of those variables or check the validity of the values they hold. A complete version of the former method, which requires inspection of all memory accesses, can eliminate all the performance benefits of the language used. Alternatively, checking whether specific variables constitute a legitimate state is a non-trivial task that needs to handle state explosion and over-approximation issues. Regardless of the method preferred, most runtime protections are subject to common challenges. For example, as the scope of protection widens, such as the inclusion of data-oriented attacks (in addition to control-oriented attacks), performance costs inevitably increase as well. This is especially true for software-based methods that also suffer from weaker security guarantees. On the contrary, most hardware-based techniques promise better security and performance. But they face substantial deployment challenges without offering any solution to existing devices already out there.
In this thesis, we aim to tackle these research challenges by delivering multiple runtime protections in different settings. First, the thesis presents the design of a non-invasive hardware module that can enable attesting runtime correctness on critical embedded systems in real-time. Second, we address the performance burden of covering data-oriented attacks, by suggesting a novel technique to distinguish critical variables from those that are unlikely to be attacked. This is to develop a selective protection scheme with practical performance overheads, without having to check all data variables or corresponding memory accesses. Third, the thesis presents a software-based solution that promises hardware-level protection for critical variables. For this purpose, it leverages the CPU registers available in any architecture with extra help from cryptography. Lastly, we explore the use of runtime interactions with the operating system to identify malicious software executions
Communication Architectures for Scalable GPU-centric Computing Systems
In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities.
Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed.
First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address.
The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending.
The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture
Performance Benchmarking of State-of-the-Art Software Switches for NFV
With the ultimate goal of replacing proprietary hardware appliances with
Virtual Network Functions (VNFs) implemented in software, Network Function
Virtualization (NFV) has been gaining popularity in the past few years.
Software switches route traffic between VNFs and physical Network Interface
Cards (NICs). It is of paramount importance to compare the performance of
different switch designs and architectures. In this paper, we propose a
methodology to compare fairly and comprehensively the performance of software
switches. We first explore the design spaces of seven state-of-the-art software
switches and then compare their performance under four representative test
scenarios. Each scenario corresponds to a specific case of routing NFV traffic
between NICs and/or VNFs. In our experiments, we evaluate the throughput and
latency between VNFs in two of the most popular virtualization environments,
namely virtual machines (VMs) and containers. Our experimental results show
that no single software switch prevails in all scenarios. It is, therefore,
crucial to choose the most suitable solution for the given use case. At the
same time, the presented results and analysis provide a deeper insight into the
design tradeoffs and identifies potential performance bottlenecks that could
inspire new designs.Comment: 17 page
Analysis and application of hash-based similarity estimation techniques for biological sequence analysis
In Bioinformatics, a large group of problems requires the computation or estimation of sequence similarity. However, the analysis of biological sequence data has, among many others, three capital challenges: a large amount generated data which contains technology-specific errors (that can be mistaken for biological signals), and that might need to be analyzed without access to a reference genome. Through the use of locality sensitive hashing methods, both the efficient estimation of sequence similarity and tolerance against the errors specific to biological data can be achieved.
We developed a variant of the winnowing algorithm for local minimizer computation, which is specifically geared to deal with repetitive regions within biological sequences. Through compressing redundant information, we can both reduce the size of the hash tables required to save minimizer sketches, as well as reduce the amount of redundant low quality alignment candidates.
Analyzing the distribution of segment lengths generated by this approach, we can better judge the size of required data structures, as well as identify hash functions feasible for this technique.
Our evaluation could verify that simple and fast hash functions, even when using small hash value spaces (hash functions with small codomain), are sufficient to compute compressed minimizers and perform comparable to uniformly randomly chosen hash values. We also outlined an index for a taxonomic protein database using multiple compressed winnowings to identify alignment candidates. To store MinHash values, we present a cache-optimized implementation of a hash table using Hopscotch hashing to resolve collisions.
As a biological application of similarity based analysis, we describe the analysis of double digest restriction site associated DNA sequencing (ddRADseq). We implemented a simulation software able to model the biological and technological influences of this technology to allow better development and testing of ddRADseq analysis software. Using datasets generated by our software, as well as data obtained from population genetic experiments, we developed an analysis workflow for ddRADseq data, based on the Stacks software. Since the quality of results generated by Stacks strongly depends on how well the used parameters are adapted to the specific dataset, we developed a Snakemake workflow that automates preprocessing tasks while also allowing the automatic exploration of different parameter sets. As part of this workflow, we developed a PCR deduplication approach able to generate consensus reads incorporating the base quality values (as reported by the sequencing device), without performing an alignment first.
As an outlook, we outline a MinHashing approach that can be used for a faster and more robust clustering, while addressing incomplete digestion and null alleles, two effects specific for ddRADseq that current analysis tools cannot reliably detect
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration
Large scale parallel state space search utilizing graphics processing units and solid state disks
The evolution of science is a double-track process composed of theoretical insights on
the one hand and practical inventions on the other one. While in most cases new theoretical
insights motivate hardware developers to produce systems following the theory,
in some cases the shown hardware solutions force theoretical research to forecast the
results to expect.
Progress in computer science rely on two aspects, processing information and storing
it. Improving one side without touching the other will evidently impose new problems
without producing a real alternative solution to the problem. While decreasing
the time to solve a challenge may provide a solution to long term problems it will fail
in solving problems which require much storage. In contrast, increasing the available
amount of space for information storage will definitively allow harder problems to be
solved by offering enough time.
This work studies two recent developments in the hardware to utilize them in the
domain of graph searching. The trend to discontinue information storage on magnetic
disks and use electronic media instead and the tendency to parallelize the computation
to speed up information processing are analyzed.
Storing information on rotating magnetic disk has become the standard way since
a couple of years and has reached a point where the storage capacity can be seen as
infinite due to the possibility of adding new drives instantly with low costs. However,
while the possible storage capacity increases every year, the transferring speed does
not. At the beginning of this work, solid state media appeared on the market, slowly
suppressing hard disks in speed demanding applications. Today, when finishing this
work solid state drives are replacing magnetic disks in mobile computing, and computing
centers use them as caching media to increase information retrieving speed.
The reason is the huge advantage in random access where the speed does not drop so
significantly as with magnetic drives.
While storing and retrieving huge amounts of information is one side of the medal,
the other one is the processing speed. Here the trend from increasing the clock frequency
of single processors stagnated in 2006 and the manufacturers started to combine
multiple cores in one processor. While a CPU is a general purpose processor the
manufacturers of graphics processing units (GPUs) encounter the challenge to perform
the same computation for a large number of image points. Here, a parallelization offers
huge advantages, so modern graphics cards have evolved to highly parallel computing
instances with several hundreds of cores. The challenge is to utilize these processors
in other domains than graphics processing.
One of the vastly used tasks in computer science is search. Not only disciplines with
an obvious search but also in software testing searching a graph is the crucial aspect.
Strategies which enable to examine larger graphs, be it by reducing the number of
considered nodes or by increasing the searching speed, have to be developed to battle
the rising challenges. This work enhances searching in multiple scientific domains
like explicit state Model Checking, Action Planning, Game Solving and Probabilistic
Model Checking proposing strategies to find solutions for the search problems.
Providing an universal search strategy which can be used in all environments to
utilize solid state media and graphics processing units is not possible due to the
heterogeneous aspects of the domains. Thus, this work presents a tool kit of strategies tied
together in an universal three stage strategy. In the first stage the edges leaving a node
are determined, in the second stage the algorithm follows the edges to generate nodes.
The duplicate detection in stage three compares all newly generated nodes to existing
once and avoids multiple expansions.
For each stage at least two strategies are proposed and decision hints are given to
simplify the selection of the proper strategy. After describing the strategies the kit is
evaluated in four domains explaining the choice for the strategy, evaluating its outcome
and giving future clues on the topic
Towards Network-Accelerated Databases
Throughout the last years, data processing systems have seen substantial changes, notably moving towards disaggregation of resources. This shift separates compute and storage resources into distinct servers for better resource utilization, as they can now be scaled independently based on demand. This development is crucial for cloud-native Database Management Systems (DBMS), which mainly build on such disaggregated structures.
This thesis examines two significant hardware trends in disaggregated architectures for DBMSs: modern networks and heterogeneous computing.
Modern networks such as Remote Direct Memory Access (RDMA) are critical for efficient, high-throughput, low-latency data transfer, but present challenges for achieving optimal performance for DBMSs. The reason for this is that RDMA comes with a low-level interface with a plentitude of performance-critical aspects to consider. To address this challenge, this thesis introduces a high-level programming interface, the Data Flow Interface, specifically targeting the needs of data-intensive processing systems.
In addition, this thesis highlights the emerging trend toward programmable network devices that offer data processing capabilities in the network. This trend is especially interesting for distributed DBMSs as they have to transfer large amounts of data over the network due to the disaggregated architecture, but also typical distributed data processing operations such as joins have to shuffle data between compute nodes. In the thesis, in-network processing devices are evaluated with typical DBMS operations to investigate the benefits and potential shortcomings.
Another trend in the data center is the increasing heterogeneity of computing units such as GPUs and FPGAs due to their fast processing capabilities.
Incorporating these heterogeneous devices into disaggregated architectures with fast networks has many merits. The reason is that specialized compute units can be exposed as network-attached disaggregated accelerator pools and thus provide flexible and scalable high-performance data processing.
This integration of heterogeneous compute units and fast RDMA-capable networks is however non-trivial since networks like RDMA are typically not directly supported for devices besides CPUs and are as such non-trivial to integrate efficiently.
The challenge of how to achieve efficient communication between different types of compute devices is addressed by proposing a network-driven communication scheme that leverages a programmable switch to carry out the network communication on behalf of the compute devices
Collaborative autonomy in heterogeneous multi-robot systems
As autonomous mobile robots become increasingly connected and widely deployed in different domains, managing multiple robots and their interaction is key to the future of ubiquitous autonomous systems. Indeed, robots are not individual entities anymore. Instead, many robots today are deployed as part of larger fleets or in teams. The benefits of multirobot collaboration, specially in heterogeneous groups, are multiple. Significantly higher degrees of situational awareness and understanding of their environment can be achieved when robots with different operational capabilities are deployed together. Examples of this include the Perseverance rover and the Ingenuity helicopter that NASA has deployed in Mars, or the highly heterogeneous robot teams that explored caves and other complex environments during the last DARPA Sub-T competition.
This thesis delves into the wide topic of collaborative autonomy in multi-robot systems, encompassing some of the key elements required for achieving robust collaboration: solving collaborative decision-making problems; securing their operation, management and interaction; providing means for autonomous coordination in space and accurate global or relative state estimation; and achieving collaborative situational awareness through distributed perception and cooperative planning. The thesis covers novel formation control algorithms, and new ways to achieve accurate absolute or relative localization within multi-robot systems. It also explores the potential of distributed ledger technologies as an underlying framework to achieve collaborative decision-making in distributed robotic systems.
Throughout the thesis, I introduce novel approaches to utilizing cryptographic elements and blockchain technology for securing the operation of autonomous robots, showing that sensor data and mission instructions can be validated in an end-to-end manner. I then shift the focus to localization and coordination, studying ultra-wideband (UWB) radios and their potential. I show how UWB-based ranging and localization can enable aerial robots to operate in GNSS-denied environments, with a study of the constraints and limitations. I also study the potential of UWB-based relative localization between aerial and ground robots for more accurate positioning in areas where GNSS signals degrade. In terms of coordination, I introduce two new algorithms for formation control that require zero to minimal communication, if enough degree of awareness of neighbor robots is available. These algorithms are validated in simulation and real-world experiments. The thesis concludes with the integration of a new approach to cooperative path planning algorithms and UWB-based relative localization for dense scene reconstruction using lidar and vision sensors in ground and aerial robots