120 research outputs found

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    Improving GPU Shared Memory Access Efficiency

    Get PDF
    Graphic Processing Units (GPUs) often employ shared memory to provide efficient storage for threads within a computational block. This shared memory includes multiple banks to improve performance by enabling concurrent accesses across the memory banks. Conflicts occur when multiple memory accesses attempt to simultaneously access a particular bank, resulting in serialized access and concomitant performance reduction. Identifying and eliminating these memory bank access conflicts becomes critical for achieving high performance on GPUs; however, for common 1D and 2D access patterns, understanding the potential bank conflicts can prove difficult. Current GPUs support memory bank accesses with configurable bit-widths; optimizing these bitwidths could result in data layouts with fewer conflicts and better performance. This dissertation presents a framework for bank conflict analysis and automatic optimization. Given static access pattern information for a kernel, this tool analyzes the conflict number of each pattern, and then searches for an optimized solution for all shared memory buffers. This data layout solution is based on parameters for inter-padding, intrapadding, and the bank access bit-width. The experimental results show that static bank conflict analysis is a practical solution and independent of the workload size of a given access pattern. For 13 kernels from 6 benchmarks suites (RODINIA and NVIDIA CUDA SDK) facing shared memory bank conflicts, tests indicated this approach can gain 5%- 35% improvement in runtime

    Greedy Coordinate Descent CMP Multi-Level Cache Resizing

    Get PDF
    Hardware designers are constantly looking for ways to squeeze waste out of architectures to achieve better power efficiency. Cache resizing is a technique that can remove wasteful power consumption in caches. The idea is to determine the minimum cache a program needs to run at near-peak performance, and then reconfigure the cache to implement this efficient capacity. While there has been significant previous work on cache resizing, existing techniques have focused on controlling resizing for a single level of cache only. This sacrifices significant opportunities for power savings in modern CPU hierarchies which routinely employ 3 levels of cache. Moreover, as CMP scaling will likely continue for the foreseeable future, eliminating wasteful power consumption from a CMP multi-level cache hierarchy is crucial to achieve better power efficiency. In this dissertation, we propose a noble technique, greedy coordinate descent CMP multi-level cache resizing, that minimizes a power consumption while maintaining a high performance. We simutaneously resizes all caches in a modern CMP cache hierarchy to minimize the power consumption. Specifically, our approach predicts the power consumption and the performance level without direct evaluations. We also develop greedy coordinate descent method to search an optimal cache configuration utilizing power efficiency gain (PEG) that we propose in this dissertation. This dissertation makes three contributions for a CMP multi-level cache resizing. First, we discover the limits of power savings and performance. This limit study identifies the potential power savings in a CMP multi-level cache hierarchy when wasteful power consumption is eliminated. Second, we propose a prediction-based greedy coordinate descent (GCD) method to find an optimal cache configuration and to orchestrate them. Third, we implement online GCD techniques for a CMP multi-level cache resizing. Our approach exhibits 13.9% power savings and achieves 91% of the power savings of the static oracle cache hierarchy configuration

    Parallel machine architecture and compiler design facilities

    Get PDF
    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

    Automatic synthesis and optimization of floating point hardware.

    Get PDF
    Ho Chun Hok.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 74-78).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.2 --- Aims --- p.3Chapter 1.3 --- Contributions --- p.3Chapter 1.4 --- Thesis Organization --- p.4Chapter 2 --- Background and Literature Review --- p.5Chapter 2.1 --- Introduction --- p.5Chapter 2.2 --- Field Programmable Gate Arrays --- p.5Chapter 2.3 --- Traditional design flow and VHDL --- p.6Chapter 2.4 --- Single Description for Hardware-Software Systems --- p.7Chapter 2.5 --- Parameterized Floating Point Arithmetic Implementation --- p.8Chapter 2.6 --- Function Approximations by Table Lookup and Addition --- p.9Chapter 2.7 --- Summary --- p.10Chapter 3 --- Floating Point Arithmetic --- p.11Chapter 3.1 --- Introduction --- p.11Chapter 3.2 --- Floating Point Number Representation --- p.11Chapter 3.3 --- Rounding Error --- p.12Chapter 3.4 --- Floating Point Number Arithmetic --- p.14Chapter 3.4.1 --- Addition and Subtraction --- p.14Chapter 3.4.2 --- Multiplication --- p.17Chapter 3.5 --- Summary --- p.17Chapter 4 --- FLY - Hardware Compiler --- p.18Chapter 4.1 --- Introduction --- p.18Chapter 4.2 --- The Fly Programming Language --- p.18Chapter 4.3 --- Implementation details --- p.19Chapter 4.3.1 --- Compilation Technique --- p.19Chapter 4.3.2 --- Statement --- p.21Chapter 4.3.3 --- Assignment --- p.21Chapter 4.3.4 --- Conditional Branch --- p.22Chapter 4.3.5 --- While --- p.22Chapter 4.3.6 --- Parallel Statement --- p.22Chapter 4.4 --- Development Environment --- p.24Chapter 4.4.1 --- From Fly to Bitstream --- p.24Chapter 4.4.2 --- Host Interface --- p.24Chapter 4.5 --- Summary --- p.26Chapter 5 --- Float - Floating Point Design Environment --- p.27Chapter 5.1 --- Introduction --- p.27Chapter 5.2 --- Floating Point Tools --- p.28Chapter 5.2.1 --- Float Class --- p.29Chapter 5.2.2 --- Optimization --- p.31Chapter 5.3 --- Digital Sine-Cosine Generator --- p.33Chapter 5.4 --- VHDL Floating Point operator generator --- p.35Chapter 5.4.1 --- Floating Point Multiplier Module --- p.35Chapter 5.4.2 --- Floating Point Adder Module --- p.36Chapter 5.5 --- Application to Solving Differential Equations --- p.38Chapter 5.6 --- Summary --- p.40Chapter 6 --- Function Approximation using Lookup Table --- p.42Chapter 6.1 --- Table Lookup Approximations --- p.42Chapter 6.1.1 --- Taylor Expansion --- p.42Chapter 6.1.2 --- Symmetric Bipartite Table Method (SBTM) --- p.43Chapter 6.1.3 --- Symmetric Table Addition Method (STAM) --- p.45Chapter 6.1.4 --- Input Range Scaling --- p.46Chapter 6.2 --- VHDL Extension --- p.47Chapter 6.3 --- Floating Point Extension --- p.49Chapter 6.4 --- The N-body Problem --- p.52Chapter 6.5 --- Implementation --- p.54Chapter 6.6 --- Summary --- p.56Chapter 7 --- Results --- p.58Chapter 7.1 --- Introduction --- p.58Chapter 7.2 --- GCD coprocessor --- p.58Chapter 7.3 --- Floating Point Module Library --- p.59Chapter 7.4 --- Digital sine-cosine generator (DSCG) --- p.60Chapter 7.5 --- Optimization --- p.62Chapter 7.6 --- Ordinary Differential Equation (ODE) --- p.63Chapter 7.7 --- N Body Problem Simulation (Nbody) --- p.63Chapter 7.8 --- Summary --- p.64Chapter 8 --- Conclusion --- p.66Chapter 8.1 --- Future Work --- p.68Chapter A --- Fly Formal Grammar --- p.70Chapter B --- Original Fly Source Code --- p.71Bibliography --- p.7

    Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators

    Full text link
    To employ a Convolutional Neural Network (CNN) in an energy-constrained embedded system, it is critical for the CNN implementation to be highly energy efficient. Many recent studies propose CNN accelerator architectures with custom computation units that try to improve energy-efficiency and performance of CNNs by minimizing data transfers from DRAM-based main memory. However, in these architectures, DRAM is still responsible for half of the overall energy consumption of the system, on average. A key factor of the high energy consumption of DRAM is the refresh overhead, which is estimated to consume 40% of the total DRAM energy. In this paper, we propose a new mechanism, Refresh Triggered Computation (RTC), that exploits the memory access patterns of CNN applications to reduce the number of refresh operations. We propose three RTC designs (min-RTC, mid-RTC, and full-RTC), each of which requires a different level of aggressiveness in terms of customization to the DRAM subsystem. All of our designs have small overhead. Even the most aggressive RTC design (i.e., full-RTC) imposes an area overhead of only 0.18% in a 16 Gb DRAM chip and can have less overhead for denser chips. Our experimental evaluation on six well-known CNNs show that RTC reduces average DRAM energy consumption by 24.4% and 61.3%, for the least aggressive and the most aggressive RTC implementations, respectively. Besides CNNs, we also evaluate our RTC mechanism on three workloads from other domains. We show that RTC saves 31.9% and 16.9% DRAM energy for Face Recognition and Bayesian Confidence Propagation Neural Network (BCPNN), respectively. We believe RTC can be applied to other applications whose memory access patterns remain predictable for a sufficiently long time

    Extempore: The design, implementation and application of a cyber-physical programming language

    Get PDF
    There is a long history of experimental and exploratory programming supported by systems that expose interaction through a programming language interface. These live programming systems enable software developers to create, extend, and modify the behaviour of executing software by changing source code without perceptual breaks for recompilation. These live programming systems have taken many forms, but have generally been limited in their ability to express low-level programming concepts and the generation of efficient native machine code. These shortcomings have limited the effectiveness of live programming in domains that require highly efficient numerical processing and explicit memory management. The most general questions addressed by this thesis are what a systems language designed for live programming might look like and how such a language might influence the development of live programming in performance sensitive domains requiring real-time support, direct hardware control, or high performance computing. This thesis answers these questions by exploring the design, implementation and application of Extempore, a new systems programming language, designed specifically for live interactive programming

    Development of FPGA based Standalone Tunable Fuzzy Logic Controllers

    Get PDF
    Soft computing techniques differ from conventional (hard) computing, in that unlike hard computing, it is tolerant of imprecision, uncertainty, partial truth, and approximation. In effect, the role model for soft computing is the human mind and its ability to address day-to-day problems. The principal constituents of Soft Computing (SC) are Fuzzy Logic (FL), Evolutionary Computation (EC), Machine Learning (ML) and Artificial Neural Networks (ANNs). This thesis presents a generic hardware architecture for type-I and type-II standalone tunable Fuzzy Logic Controllers (FLCs) in Field Programmable Gate Array (FPGA). The designed FLC system can be remotely configured or tuned according to expert operated knowledge and deployed in different applications to replace traditional Proportional Integral Derivative (PID) controllers. This re-configurability is added as a feature to existing FLCs in literature. The FLC parameters which are needed for tuning purpose are mainly input range, output range, number of inputs, number of outputs, the parameters of the membership functions like slope and center points, and an If-Else rule base for the fuzzy inference process. Online tuning enables users to change these FLC parameters in real-time and eliminate repeated hardware programming whenever there is a need to change. Realization of these systems in real-time is difficult as the computational complexity increases exponentially with an increase in the number of inputs. Hence, the challenge lies in reducing the rule base significantly such that the inference time and the throughput time is perceivable for real-time applications. To achieve these objectives, Modified Rule Active 2 Overlap Membership Function (MRA2-OMF), Modified Rule Active 3 Overlap Membership Function (MRA3-OMF), Modified Rule Active 4 Overlap Membership Function (MRA4-OMF), and Genetic Algorithm (GA) base rule optimization methods are proposed and implemented. These methods reduce the effective rules without compromising system accuracy and improve the cycle time in terms of Fuzzy Logic Inferences Per Second (FLIPS). In the proposed system architecture, the FLC is segmented into three independent modules, fuzzifier, inference engine with rule base, and defuzzifier. Fuzzy systems employ fuzzifier to convert the real world crisp input into the fuzzy output. In type 2 fuzzy systems there are two fuzzifications happen simultaneously from upper and lower membership functions (UMF and LMF) with subtractions and divisions. Non-restoring, very high radix, and newton raphson approximation are most widely used division algorithms in hardware implementations. However, these prevalent methods have a cost of more latency. In order to overcome this problem, a successive approximation division algorithm based type 2 fuzzifier is introduced. It has been observed that successive approximation based fuzzifier computation is faster than the other type 2 fuzzifier. A hardware-software co-design is established on Virtex 5 LX110T FPGA board. The MATLAB Graphical User Interface (GUI) acquires the fuzzy (type 1 or type 2) parameters from users and a Universal Asynchronous Receiver/Transmitter (UART) is dedicated to data communication between the hardware and the fuzzy toolbox. This GUI is provided to initiate control, input, rule transfer, and then to observe the crisp output on the computer. A proposed method which can support canonical fuzzy IF-THEN rules, which includes special cases of the fuzzy rule base is included in Digital Fuzzy Logic Controller (DFLC) architecture. For this purpose, a mealy state machine is incorporated into the design. The proposed FLCs are implemented on Xilinx Virtex-5 LX110T. DFLC peripheral integration with Micro-Blaze (MB) processor through Processor Logic Bus (PLB) is established for Intellectual Property (IP) core validation. The performance of the proposed systems are compared to Fuzzy Toolbox of MATLAB. Analysis of these designs is carried out by using Hardware-In-Loop (HIL) test to control various plant models in MATLAB/Simulink environments
    corecore