733 research outputs found

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    FPGA Implementation of a Resource Efficient Vedic Multiplier using SPST Adders

    Get PDF
    Nowadays, the requirement for very high-speed operations in processors constantly increases. Multiplication is a crucial operation in high power-consuming processes such as image and signal processing. The main characteristics of a multiplier are good accuracy, speed, reduction in area, and little power consumption. Speed plays a major role in multiplication operations, and an increase in speed can be obtained by reducing the number of steps involved in the computation process. Since a multiplier has the largest delay among the basic blocks in a digital system, the critical path is determined by it. Furthermore, the multiplier consumes more area and dissipates more power. Hence, designing multipliers that offer high speed, lower power consumption, less area, or a combination of them is of prime concern. Hence, an attempt is made in this paper to achieve the above design metrics using a Spurious Power Suppression Technique (SPST) adder. A resource-efficient SPST-based Vedic multiplier is developed and implemented using Artix 7 FPGA and is finally compared with the ripple carry adder-based Vedic multiplier

    One Size Fits All—Venomics of the Iberian Adder (Vipera seoanei, Lataste 1878) reveals low levels of venom variation across its distributional range

    Get PDF
    European vipers (genus Vipera) are medically important snakes displaying considerable venom variation, occurring at different levels in this group. The presence of intraspecific venom variation, however, remains understudied in several Vipera species. Vipera seoanei is a venomous snake endemic to the northern Iberian Peninsula and south-western France, presenting notable phenotypic variation and inhabiting several diverse habitats across its range. We analysed the venoms of 49 adult specimens of V. seoanei from 20 localities across the species’ Iberian distribution. We used a pool of all individual venoms to generate a V. seoanei venom reference proteome, produced SDS-PAGE profiles of all venom samples, and visualised patterns of variation using NMDS. By applying linear regression, we then assessed presence and nature of venom variation between localities, and investigated the effect of 14 predictors (biological, eco-geographic, genetic) on its occurrence. The venom comprised at least 12 different toxin families, of which five (i.e., PLA2, svSP, DI, snaclec, svMP) accounted for about 75% of the whole proteome. The comparative analyses of the SDS-PAGE venom profiles showed them to be remarkably similar across the sampled localities, suggesting low geographic variability. The regression analyses suggested significant effects of biological and habitat predictors on the little variation we detected across the analysed V. seoanei venoms. Other factors were also significantly associated with the presence/absence of individual bands in the SDS-PAGE profiles. The low levels of venom variability we detected within V. seoanei might be the result of a recent population expansion, or of processes other than directional positive selection

    Heterogeneous Acceleration for 5G New Radio Channel Modelling Using FPGAs and GPUs

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications

    Full text link
    The challenging deployment of compute-intensive applications from domains such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or performance. This radical paradigm shift has attracted interest from both academia and industry, resulting in significant research on approximation techniques and methodologies at different design layers (from system down to integrated circuits). Motivated by the wide appeal of Approximate Computing over the last 10 years, we conduct a two-part survey to cover key aspects (e.g., terminology and applications) and review the state-of-the art approximation techniques from all layers of the traditional computing stack. In Part II of our survey, we classify and present the technical details of application-specific and architectural approximation techniques, which both target the design of resource-efficient processors/accelerators & systems. Moreover, we present a detailed analysis of the application spectrum of Approximate Computing and discuss open challenges and future directions.Comment: Under Review at ACM Computing Survey

    Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

    Get PDF
    Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    LightRW: FPGA Accelerated Graph Dynamic Random Walks

    Full text link
    Graph dynamic random walks (GDRWs) have recently emerged as a powerful paradigm for graph analytics and learning applications, including graph embedding and graph neural networks. Despite the fact that many existing studies optimize the performance of GDRWs on multi-core CPUs, massive random memory accesses and costly synchronizations cause severe resource underutilization, and the processing of GDRWs is usually the key performance bottleneck in many graph applications. This paper studies an alternative architecture, FPGA, to address these issues in GDRWs, as FPGA has the ability of hardware customization so that we are able to explore fine-grained pipeline execution and specialized memory access optimizations. Specifically, we propose {LightRW}, a novel FPGA-based accelerator for GDRWs. LightRW embraces a series of optimizations to enable fine-grained pipeline execution on the chip and to exploit the massive parallelism of FPGA while significantly reducing memory accesses. As current commonly used sampling methods in GDRWs do not efficiently support fine-grained pipeline execution, we develop a parallelized reservoir sampling method to sample multiple vertices per cycle for efficient pipeline execution. To address the random memory access issues, we propose a degree-aware configurable caching method that buffers hot vertices on-chip to alleviate random memory accesses and a dynamic burst access engine that efficiently retrieves neighbors. Experimental results show that our optimization techniques are able to improve the performance of GDRWs on FPGA significantly. Moreover, LightRW delivers up to 9.55x and 9.10x speedup over the state-of-the-art CPU-based MetaPath and Node2vec random walks, respectively. This work is open-sourced on GitHub at https://github.com/Xtra-Computing/LightRW.Comment: Accepted to SIGMOD 202
    corecore