Search CORE

7 research outputs found

Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

Author: Weber Lukas Max
Publication venue
Publication date: 19/10/2023
Field of study

Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures

TUbiblio

tuprints

PQC-HA: A Framework for Prototyping and In-Hardware Evaluation of Post-Quantum Cryptography Hardware Accelerators

Author: Heinz Carsten
Koch Andreas
Sattel Richard
Spang Christoph
Publication venue
Publication date: 12/08/2023
Field of study

In the third round of the NIST Post-Quantum Cryptography standardization project, the focus is on optimizing software and hardware implementations of candidate schemes. The winning schemes are CRYSTALS Kyber and CRYSTALS Dilithium, which serve as a Key Encapsulation Mechanism (KEM) and Digital Signature Algorithm (DSA), respectively. This study utilizes the TaPaSCo open-source framework to create hardware building blocks for both schemes using High-level Synthesis (HLS) from minimally modified ANSI C software reference implementations across all security levels. Additionally, a generic TaPaSCo host runtime application is developed in Rust to verify their functionality through the standard NIST interface, utilizing the corresponding Known Answer Test mechanism on actual hardware. Building on this foundation, the communication overhead for TaPaSCo hardware accelerators on PCIe-connected FPGA devices is evaluated and compared with previous work and optimized AVX2 software reference implementations. The results demonstrate the feasibility of verifying and evaluating the performance of Post-Quantum Cryptography accelerators on real hardware using TaPaSCo. Furthermore, the off-chip accelerator communication overhead of the NIST standard interface is measured, which, on its own, outweighs the execution wall clock time of the optimized software reference implementation of Kyber at Security Level 1.Comment: 20 pages, 6 figures, Open Source Software availabl

arXiv.org e-Print Archive

A comparative survey of open-source application-class RISC-V processor implementations

Author: Albers Mark
Berekovic Mladen
Blochwitz Christopher
Dörflinger Alexander
Guan Yejun
Kleinbeck Benedikt
Klink Raphael
Michalik Harald
Nechi Anouar
Publication venue: ACM
Publication date: 01/01/2021
Field of study

The numerous emerging implementations of RISC-V processors and frameworks underline the success of this Instruction Set Architecture (ISA) specification. The free and open source character of many implementations facilitates their adoption in academic and commercial projects. As yet it is not easy to say which implementation fits best for a system with given requirements such as processing performance or power consumption. With varying backgrounds and histories, the developed RISC-V processors are very different from each other. Comparisons are difficult, because results are reported for arbitrary technologies and configuration settings. Scaling factors are used to draw comparisons, but this gives only rough estimates. In order to give more substantiated results, this paper compares the most prominent open-source application-class RISC-V projects by running identical benchmarks on identical platforms with defined configuration settings. The Rocket, BOOM, CVA6, and SHAKTI C-Class implementations are evaluated for processing performance, area and resource utilization, power consumption as well as efficiency. Results are presented for the Xilinx Virtex UltraScale+ family and GlobalFoundries 22FDX ASIC technology

Digitale Bibliothek Braunschweig

A comparative survey of open-source application-class RISC-V processor implementations

Author: Albers Mark
Berekovic Mladen
Blochwitz Christopher
Dörflinger Alexander
Guan Yejun
Kleinbeck Benedikt
Klink Raphael
Michalik Harald
Nechi Anouar
Publication venue: ACM
Publication date: 01/01/2021
Field of study

Revision notice: This version does not contain CVA6 SPEC CPU2017 scores. There is an updated version available with additional CVA6 SPEC CPU2017 scores: https://doi.org/10.24355/dbbs.084-202105101615-

Digitale Bibliothek Braunschweig

A Modular Platform for Adaptive Heterogeneous Many-Core Architectures

Author: Atef Ahmed Kamaleldin
Publication venue
Publication date: 18/12/2023
Field of study

Multi-/many-core heterogeneous architectures are shaping current and upcoming generations of compute-centric platforms which are widely used starting from mobile and wearable devices to high-performance cloud computing servers. Heterogeneous many-core architectures sought to achieve an order of magnitude higher energy efficiency as well as computing performance scaling by replacing homogeneous and power-hungry general-purpose processors with multiple heterogeneous compute units supporting multiple core types and domain-specific accelerators. Drifting from homogeneous architectures to complex heterogeneous systems is heavily adopted by chip designers and the silicon industry for more than a decade. Recent silicon chips are based on a heterogeneous SoC which combines a scalable number of heterogeneous processing units from different types (e.g. CPU, GPU, custom accelerator). This shifting in computing paradigm is associated with several system-level design challenges related to the integration and communication between a highly scalable number of heterogeneous compute units as well as SoC peripherals and storage units. Moreover, the increasing design complexities make the production of heterogeneous SoC chips a monopoly for only big market players due to the increasing development and design costs. Accordingly, recent initiatives towards agile hardware development open-source tools and microarchitecture aim to democratize silicon chip production for academic and commercial usage. Agile hardware development aims to reduce development costs by providing an ecosystem for open-source hardware microarchitectures and hardware design processes. Therefore, heterogeneous many-core development and customization will be relatively less complex and less time-consuming than conventional design process methods. In order to provide a modular and agile many-core development approach, this dissertation proposes a development platform for heterogeneous and self-adaptive many-core architectures consisting of a scalable number of heterogeneous tiles that maintain design regularity features while supporting heterogeneity. The proposed platform hides the integration complexities by supporting modular tile architectures for general-purpose processing cores supporting multi-instruction set architectures (multi-ISAs) and custom hardware accelerators. By leveraging field-programmable-gate-arrays (FPGAs), the self-adaptive feature of the many-core platform can be achieved by using dynamic and partial reconfiguration (DPR) techniques. This dissertation realizes the proposed modular and adaptive heterogeneous many-core platform through three main contributions. The first contribution proposes and realizes a many-core architecture for heterogeneous ISAs. It provides a modular and reusable tilebased architecture for several heterogeneous ISAs based on open-source RISC-V ISA. The modular tile-based architecture features a configurable number of processing cores with different RISC-V ISAs and different memory hierarchies. To increase the level of heterogeneity to support the integration of custom hardware accelerators, a novel hybrid memory/accelerator tile architecture is developed and realized as the second contribution. The hybrid tile is a modular and reusable tile that can be configured at run-time to operate as a scratchpad shared memory between compute tiles or as an accelerator tile hosting a local hardware accelerator logic. The hybrid tile is designed and implemented to be seamlessly integrated into the proposed tile-based platform. The third contribution deals with the self-adaptation features by providing a reconfiguration management approach to internally control the DPR process through processing cores (RISC-V based). The internal reconfiguration process relies on a novel DPR controller targeting FPGA design flow for RISC-V-based SoC to change the types and functionalities of compute tiles at run-time

Technische Universität Dresden: Qucosa

Data-intensive Systems on Modern Hardware : Leveraging Near-Data Processing to Counter the Growth of Data

Author: Vinçon Tobias
Publication venue
Publication date: 01/01/2022
Field of study

Over the last decades, a tremendous change toward using information technology in almost every daily routine of our lives can be perceived in our society, entailing an incredible growth of data collected day-by-day on Web, IoT, and AI applications. At the same time, magneto-mechanical HDDs are being replaced by semiconductor storage such as SSDs, equipped with modern Non-Volatile Memories, like Flash, which yield significantly faster access latencies and higher levels of parallelism. Likewise, the execution speed of processing units increased considerably as nowadays server architectures comprise up to multiple hundreds of independently working CPU cores along with a variety of specialized computing co-processors such as GPUs or FPGAs. However, the burden of moving the continuously growing data to the best fitting processing unit is inherently linked to today’s computer architecture that is based on the data-to-code paradigm. In the light of Amdahl's Law, this leads to the conclusion that even with today's powerful processing units, the speedup of systems is limited since the fraction of parallel work is largely I/O-bound. Therefore, throughout this cumulative dissertation, we investigate the paradigm shift toward code-to-data, formally known as Near-Data Processing (NDP), which relieves the contention on the I/O bus by offloading processing to intelligent computational storage devices, where the data is originally located. Firstly, we identified Native Storage Management as the essential foundation for NDP due to its direct control of physical storage management within the database. Upon this, the interface is extended to propagate address mapping information and to invoke NDP functionality on the storage device. As the former can become very large, we introduce Physical Page Pointers as one novel NDP abstraction for self-contained immutable database objects. Secondly, the on-device navigation and interpretation of data are elaborated. Therefore, we introduce cross-layer Parsers and Accessors as another NDP abstraction that can be executed on the heterogeneous processing capabilities of modern computational storage devices. Thereby, the compute placement and resource configuration per NDP request is identified as a major performance criteria. Our experimental evaluation shows an improvement in the execution durations of 1.4x to 2.7x compared to traditional systems. Moreover, we propose a framework for the automatic generation of Parsers and Accessors on FPGAs to ease their application in NDP. Thirdly, we investigate the interplay of NDP and modern workload characteristics like HTAP. Therefore, we present different offloading models and focus on an intervention-free execution. By propagating the Shared State with the latest modifications of the database to the computational storage device, it is able to process data with transactional guarantees. Thus, we achieve to extend the design space of HTAP with NDP by providing a solution that optimizes for performance isolation, data freshness, and the reduction of data transfers. In contrast to traditional systems, we experience no significant drop in performance when an OLAP query is invoked but a steady and 30% faster throughput. Lastly, in-situ result-set management and consumption as well as NDP pipelines are proposed to achieve flexibility in processing data on heterogeneous hardware. As those produce final and intermediary results, we continue investigating their management and identified that an on-device materialization comes at a low cost but enables novel consumption modes and reuse semantics. Thereby, we achieve significant performance improvements of up to 400x by reusing once materialized results multiple times

TUbiblio

Repositorium und Bibliografie der Hochschule Reutlingen

tuprints

Advances in ILP-based Modulo Scheduling for High-Level Synthesis

Author: Oppermann Julian
Publication venue
Publication date: 01/01/2019
Field of study

In today's heterogenous computing world, field-programmable gate arrays (FPGA) represent the energy-efficient alternative to generic processor cores and graphics accelerators. However, due to their radically different computing model, automatic design methods, such as high-level synthesis (HLS), are needed to harness their full power. HLS raises the abstraction level to behavioural descriptions of algorithms, thus freeing designers from dealing with tedious low-level concerns, and enabling a rapid exploration of different microarchitectures for the same input specification. In an HLS tool, scheduling is the most influential step for the performance of the generated accelerator. Specifically, modulo schedulers enable a pipelined execution, which is a key technique to speed up the computation by extracting more parallelism from the input description. In this thesis, we make a case for the use of integer linear programming (ILP) as a framework for modulo scheduling approaches. First, we argue that ILP-based modulo schedulers are practically usable in the HLS context. Secondly, we show that the ILP framework enables a novel approach for the automatic design of FPGA accelerators. We substantiate the first claim by proposing a new, flexible ILP formulation for the modulo scheduling problem, and evaluate it experimentally with a diverse set of realistic test instances. While solving an ILP may incur an exponential runtime in the worst case, we observe that simple countermeasures, such as setting a time limit, help to contain the practical impact of outlier instances. Furthermore, we present an algorithm to compress problems before the actual scheduling. An HLS-generated microarchitecture is comprised of operators, i.e. single-purpose functional units such as a floating-point multiplier. Usually, the allocation of operators is determined before scheduling, even though both problems are interdependent. To that end, we investigate an extension of the modulo scheduling problem that combines both concerns in a single model. Based on the extension, we present a novel multi-loop scheduling approach capable of finding the fastest microarchitecture that still fits on a given FPGA device - an optimisation problem that current commercial HLS tools cannot solve. This proves our second claim

TUbiblio

tuprints