3 research outputs found
Recommended from our members
Algorithm and Hardware Co-Design for Local/Edge Computing
Advances in VLSI manufacturing and design technology over the decades have created many computing paradigms for disparate computing needs. With concerns for transmission cost, security, latency of centralized computing, edge/local computing are increasingly prevalent in the faster growing sectors like Internet-of-Things (IoT) and other sectors that require energy/connectivity autonomous systems such as biomedical and industrial applications.
Energy and power efficient are the main design constraints in local and edge computing. While there exists a wide range of low power design techniques, they are often underutilized in custom circuit designs as the algorithms are developed independent of the hardware. Such compartmentalized design approach fails to take advantage of the many compatible algorithmic and hardware techniques that can improve the efficiency of the entire system. Algorithm hardware co-design is to explore the design space with whole stack awareness.
The main goal of the algorithm hardware co-design methodology is the enablement and improvement of small form factor edge and local VLSI systems operating under strict constraints of area and energy efficiency. This thesis presents selected works of application specific digital and mixed-signal integrated circuit designs. The application space ranges from implantable biomedical devices to edge machine learning acceleration
Compilers for portable programming of heterogeneous parallel & approximate computing systems
Programming heterogeneous systems such as the System-on-chip (SoC) processors in modern mobile devices can be extremely complex because a single system may include multiple different parallelism models, instruction sets, memory hierarchies, and systems use different combinations of these features. This is further complicated by software and hardware approximate computing optimizations. Different compute units on an SoC use different approximate computing methods and an application would usually be composed of multiple compute kernels, each one specialized to run on a different hardware. Determining how best to map such an application to a modern heterogeneous system is an open research problem.
First, we propose a parallel abstraction of heterogeneous hardware that is a carefully chosen combination of well-known parallel models and is able to capture the parallelism in a wide range of popular parallel hardware. This abstraction uses a hierarchical dataflow graph with side effects and vector SIMD instructions. We use this abstraction to define a parallel program representation called HPVM that aims to address both functional portability and performance portability across heterogeneous systems.
Second, we further extend HPVM representation to enable accuracy-aware performance and energy tuning on heterogeneous systems with multiple compute units and approximation methods. We call it ApproxHPVM, and it automatically translates end-to-end application-level accuracy constraints into accuracy requirements for individual operations. ApproxHPVM uses a hardware-agnostic accuracy-tuning phase to do this translation, which greatly speeds up the analysis, enables greater portability, and enables future capabilities like accuracy-aware dynamic scheduling and design space exploration. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel’s AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Furthermore, our evaluation of ApproxHPVM shows that our framework can offload chunks of approximable computations to special purpose accelerators that provide significant gains in performance and energy, while staying within a user-specified application-level accuracy constraint with high probability