3,540 research outputs found
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
HMC-Based Accelerator Design For Compressed Deep Neural Networks
Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation
- …