212 research outputs found

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Dimension reduction of image and audio space

    Full text link
    The reduction of data necessary for storage or transmission is a desirable goal in the digital video and audio domain. Compression schemes strive to reduce the amount of storage space or bandwidth necessary to keep or move the data. Data reduction can be accomplished so that visually or audibly unnecessary data is removed or recoded thus aiding the compression phase of the data processing. The characterization and identification of data that can be successfully removed or reduced is the purpose of this work. New philosophy, theory and methods for data processing are presented towards the goal of data reduction. The philosophy and theory developed in this work establish a foundation for high speed data reduction suitable for multi-media applications. The developed methods encompass motion detection and edge detection as features of the systems. The philosophy of energy flow analysis in video processing enables the consideration of noise in digital video data. Research into noise versus motion leads to an efficient and successful method of identifying motion in a sequence. The research of the underlying statistical properties of vector quantization provides an insight into the performance characteristics of vector quantization and leads to successful improvements in application. The underlying statistical properties of the vector quantization process are analyzed and three theorems are developed and proved. The theorems establish the statistical distributions and probability densities of various metrics of the vector quantization process. From these properties, an intelligent and efficient algorithm design is developed and tested. The performance improvements in both time and quality are established through algorithm analysis and empirical testing. The empirical results are presented

    Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

    Get PDF
    This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author

    Algorithm Engineering for fundamental Sorting and Graph Problems

    Get PDF
    Fundamental Algorithms build a basis knowledge for every computer science undergraduate or a professional programmer. It is a set of basic techniques one can find in any (good) coursebook on algorithms and data structures. In this thesis we try to close the gap between theoretically worst-case optimal classical algorithms and the real-world circumstances one face under the assumptions imposed by the data size, limited main memory or available parallelism

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms

    Time of flight simulation and reconstruction in Hybrid MR-PET Systems

    Get PDF
    Tese de mestrado integrado, Engenharia Biomédica e Biofísica (Engenharia Clínica e Instrumentação Médica) Universidade de Lisboa, Faculdade de Ciências, 2018In traditional PET, coincidence electronics are used to determine the line of response along which an an- nihilation has occurred. With time-of-flight(TOF), the approximate position of the annihilation along the line of annihilation is calculated by measuring the difference between the arrival time of the photons in the detectors. In the literature, TOF images show (in general) a lower level of noise and better resolution compared to non-TOF images. The lower noise and amplified sensitivity of TOF reconstruction could favour a better use of the full resolution potential of PET scanners. The first part of this thesis focuses on the possibility of using faster simulating methods, more specifi- cally, the possibility of replacing the time consuming GATE simulations by a script (from Paola Solevi from Otto-von-Guericke-Universität Magdeburg) was studied. The results show that the values obtained in the simulations with the Hoffman Brain Phantom are very similar between the two methods, showing the viability of this script with this phantom. Then, the same procedure was performed using a Voxelized Brain Phantom. This time the results were different from the ones obtained before because the values obtained with the two methods are very different. Therefore, it is important to know if there is some kind of problem with the phantom used that origins those results or if the problem comes from the script. The second part of this thesis focuses on the development of reconstruction procedures for simulations done with the GE Signa PET-MR scanner. The methods includes the simulation of three phantoms (of- fcenter cylinder and Hoffman Brain Phantom to reconstruct and a large cylinder for the normalisation), a coordinates algorithm developed in MATLAB that can calculate the correct coordinates, for the sino- grams, from the GATE coordinate output and a method that, from an uncorrected sinogram, obtains an arc corrected sinogram that can be used in reconstructions. The results show that the reconstructions were successful, without any artifacts. The reconstructions done without each one of the corrections, show artifacts in both phantoms. These results show the importance of doing corrections before recon- structing the data

    Switching techniques for broadband ISDN

    Get PDF
    The properties of switching techniques suitable for use in broadband networks have been investigated. Methods for evaluating the performance of such switches have been reviewed. A notation has been introduced to describe a class of binary self-routing networks. Hence a technique has been developed for determining the nature of the equivalence between two networks drawn from this class. The necessary and sufficient condition for two packets not to collide in a binary self-routing network has been obtained. This has been used to prove the non-blocking property of the Batcher-banyan switch. A condition for a three-stage network with channel grouping and link speed-up to be nonblocking has been obtained, of which previous conditions are special cases. A new three-stage switch architecture has been proposed, based upon a novel cell-level algorithm for path allocation in the intermediate stage of the switch. The algorithm is suited to hardware implementation using parallelism to achieve a very short execution time. An array of processors is required to implement the algorithm The processor has been shown to be of simple design. It must be initialised with a count representing the number of cells requesting a given output module. A fast method has been described for performing the request counting using a non-blocking binary self-routing network. Hardware is also required to forward routing tags from the processors to the appropriate data cells, when they have been allocated a path through the intermediate stage. A method of distributing these routing tags by means of a non-blocking copy network has been presented. The performance of the new path allocation algorithm has been determined by simulation. The rate of cell loss can increase substantially in a three-stage switch when the output modules are non-uniformly loaded. It has been shown that the appropriate use of channel grouping in the intermediate stage of the switch can reduce the effect of non-uniform loading on performance

    Hardware acceleration of the trace transform for vision applications

    Get PDF
    Computer Vision is a rapidly developing field in which machines process visual data to extract meaningful information. Digitised images in their pixels and bits serve no purpose of their own. It is only by interpreting the data, and extracting higher level information that a scene can be understood. The algorithms that enable this process are often complex, and data-intensive, limiting the processing rate when implemented in software. Hardware-accelerated implementations provide a significant performance boost that can enable real- time processing. The Trace Transform is a newly proposed algorithm that has been proven effective in image categorisation and recognition tasks. It is flexibly defined allowing the mathematical details to be tailored to the target application. However, it is highly computationally intensive, which limits its applications. Modern heterogeneous FPGAs provide an ideal platform for accelerating the Trace transform for real-time performance, while also allowing an element of flexibility, which highly suits the generality of the Trace transform. This thesis details the implementation of an extensible Trace transform architecture for vision applications, before extending this architecture to a full flexible platform suited to the exploration of Trace transform applications. As part of the work presented, a general set of architectures for large-windowed median and weighted median filters are presented as required for a number of Trace transform implementations. Finally an acceleration of Pseudo 2-Dimensional Hidden Markov Model decoding, usable in a person detection system, is presented. Such a system can be used to extract frames of interest from a video sequence, to be subsequently processed by the Trace transform. All these architectures emphasise the need for considered, platform-driven design in achieving maximum performance through hardware acceleration

    An e-science infrastructure for collecting, sharing, retrieving, and analyzing heterogeneous scientific data

    Get PDF
    The process of collecting, sharing, retrieving, and analyzing data is common in many areas of scientific work. While each field has its own workflows and best practices, the general process can be aided by an e-Science infrastructure. The contribution of this thesis is to support the workflow of the scientists which can be split in four parts: In the first part, we introduce xBook, a framework which aids the creation of database application to collect, back-up, and share data. In the second part, we describe the synchronization which is a vital part of the xBook framework that, with the use of timestamps, allows data to be entered offline. The data then can be shared with coworkers for analyses or further processing. It also can be used as a backup system to avoid data loss. Third, we present an architecture allowing data from distributed data sources to be retrieved without a central managing instance. This is achieved with the use of minimal search parameters which are guaranteed to exist in all connected data sources. This architecture is based on the concept of mediators, but gives data owners full control over their data sources as opposed to the traditional mediator where the connected data sources are managed by a central administrator. Fourth we describe an embeddable analysis tool which can be integrated into a base application where the data is gathered. With the aid of simple modules, called ``Workers'', this tool empowers domain experts to easily create analyses particularly designed for their area of work in a familiar working environment. Additionally, we present another tool which allows the graphical display of temporal and spatial information of archaeological excavations. This tool uses an interactive Harris Matrix to order findings temporally and allows the comparison with their spatial location.Viele wissenschaftliche Bereiche haben gemeinsame Vorgänge, wie die Erfassung, das Teilen, das Abrufen und das Analysieren von Daten. Jeder Bereich hat zwar seine eigenen Vorgänge und bewährte Verfahren, jedoch kann der allgemeine Prozess durch eine e-Science Infrastruktur unterstützt werden. Der Beitrag dieser Dissertation ist die Unterstützung des typischen wissenschaftlichen Arbeitsablaufes, der in vier Teile unterteilt werden kann: Im ersten Teil stellen wir xBook vor, ein Framework zur Erstellung von Datenbankanwendungen, das Wissenschaftler dabei unterstützt, Daten zu erfassen, zu sichern und zu teilen. Im zweiten Teil beschreiben wir die Synchronisation, die ein wichtiger Teil des xBook Frameworks ist. Diese erlaubt, dass Daten offline bearbeitet werden können, indem Änderungen über Zeitstempel protokolliert werden. Diese Daten können dann mit Kollegen für Analysen oder weitere Eingaben geteilt werden. Die Synchronisation kann zusätzlich als Sicherungssystem verwendet werden, um Datenverlust zu verhindern. Im dritten Teil präsentieren wir eine Architektur, um Daten aus verteilten Datenquellen ohne ein zentrales Verwaltungssystem abrufen zu können. Dies wird mit Hilfe eines minimalen Suchparameters, der in allen angeschlossenen Datenquellen existieren muss, ermöglicht. Diese Architektur basiert auf dem Konzept des Mediators, benötigt aber im Gegensatz zum traditionellen Mediator keinen zentralen Administrator zur Verwaltung der Datenquellen und gibt deren Besitzern volle Kontrolle über ihre Daten. Abschließend, im vierten Teil, beschreiben wir ein einbettbares Analyse Tool, das in eine Hauptanwendung integriert werden kann, in der Daten erfasst werden. Dieses Tool ermöglicht Fachexperten auf einfache Weise, mit Hilfe von speziellen Modulen, Analysen in einer vertrauten Arbeitsumgebung zu erstellen, die genau für ihr Fachgebiet benötigt werden. Zusätzlich stellen wir ein weiteres Tool vor, das die zeitlichen und räumlichen Informationen archäologischer Ausgrabungen visualisiert. Dieses Tool verwendet eine interaktive Harris Matrix, um Funde zeitlich zu ordnen und erlaubt den Vergleich ihrer räumlichen Position
    corecore