1,911 research outputs found
\u3cem\u3eHP-DAEMON\u3c/em\u3e: \u3cem\u3eH\u3c/em\u3eigh \u3cem\u3eP\u3c/em\u3eerformance \u3cem\u3eD\u3c/em\u3eistributed \u3cem\u3eA\u3c/em\u3edaptive \u3cem\u3eE\u3c/em\u3energy-efficient \u3cem\u3eM\u3c/em\u3eatrix-multiplicati\u3cem\u3eON\u3c/em\u3e
The demands of improving energy efficiency for high performance scientific applications arise crucially nowadays. Software-controlled hardware solutions directed by Dynamic Voltage and Frequency Scaling (DVFS) have shown their effectiveness extensively. Although DVFS is beneficial to green computing, introducing DVFS itself can incur non-negligible overhead, if there exist a large number of frequency switches issued by DVFS. In this paper, we propose a strategy to achieve the optimal energy savings for distributed matrix multiplication via algorithmically trading more computation and communication at a time adaptively with user-specified memory costs for less DVFS switches, which saves 7.5% more energy on average than a classic strategy. Moreover, we leverage a high performance communication scheme for fully exploiting network bandwidth via pipeline broadcast. Overall, the integrated approach achieves substantial energy savings (up to 51.4%) and performance gain (28.6% on average) compared to ScaLAPACK pdgemm() on a cluster with an Ethernet switch, and outperforms ScaLAPACK and DPLASMA pdgemm() respectively by 33.3% and 32.7% on average on a cluster with an Infiniband switch
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
A recent trend in DNN development is to extend the reach of deep learning
applications to platforms that are more resource and energy constrained, e.g.,
mobile devices. These endeavors aim to reduce the DNN model size and improve
the hardware processing efficiency, and have resulted in DNNs that are much
more compact in their structures and/or have high data sparsity. These compact
or sparse models are different from the traditional large ones in that there is
much more variation in their layer shapes and sizes, and often require
specialized hardware to exploit sparsity for performance improvement. Thus,
many DNN accelerators designed for large DNNs do not perform well on these
models. In this work, we present Eyeriss v2, a DNN accelerator architecture
designed for running compact and sparse DNNs. To deal with the widely varying
layer shapes and sizes, it introduces a highly flexible on-chip network, called
hierarchical mesh, that can adapt to the different amounts of data reuse and
bandwidth requirements of different data types, which improves the utilization
of the computation resources. Furthermore, Eyeriss v2 can process sparse data
directly in the compressed domain for both weights and activations, and
therefore is able to improve both processing speed and energy efficiency with
sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS
process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J
at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than
the original Eyeriss running MobileNet. We also present an analysis methodology
called Eyexam that provides a systematic way of understanding the performance
limits for DNN processors as a function of specific characteristics of the DNN
model and accelerator design; it applies these characteristics as sequential
steps to increasingly tighten the bound on the performance limits.Comment: accepted for publication in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems. This extended version on arXiv also includes
Eyexam in the appendi
MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor
As the core of artificial intelligence applications, the research of
convolution has become a hot topic in high performance computing. With the
rapid development of the emerging SW26010 processor in artificial intelligence,
there is an urgent need for high-performance convolution algorithms on the
processor. However, the current support of convolution on SW26010 is still
rudimentary. The only studies provide sufficient runtime peak performance but
lack the adaptability to various convolution scenes. To perfect convolution
algorithms on SW26010, we propose a multi-grained matrix-multiplication-mapping
convolution algorithm called MG3MConv, which targets the architectural features
of SW26010. MG3MConv supports diversified mapping schemes of convolution tasks
based on the concept of the thread block proposed in this paper. All the
architecture-oriented optimization methods are elaborately designed from four
levels to fully exploit the hardware efficiency of SW26010. The experiments
show that the hardware efficiency of MG3MConv can reach 84.78% in max, which is
1.75 times compared with that of cuDNN based on NVIDIA K80m GPU. Moreover,
MG3MConv can overperform cuDNN in most convolution scenes. We also use six
representative CNNs as real-world cases, and the hardware efficiency of
MG3MConv reaches up to 67.04% on the VGG network model, which is 1.37 times and
1.96 times that of cuDNN and swDNN, respectively
Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture
Many modern workloads, such as neural networks, databases, and graph
processing, are fundamentally memory-bound. For such workloads, the data
movement between main memory and CPU cores imposes a significant overhead in
terms of both latency and energy. A major reason is that this communication
happens through a narrow bus with high latency and limited bandwidth, and the
low data reuse in memory-bound workloads is insufficient to amortize the cost
of main memory access. Fundamentally addressing this data movement bottleneck
requires a paradigm where the memory system assumes an active role in computing
by integrating processing capabilities. This paradigm is known as
processing-in-memory (PIM).
Recent research explores different forms of PIM architectures, motivated by
the emergence of new 3D-stacked memory technologies that integrate memory with
a logic layer where processing elements can be easily placed. Past works
evaluate these architectures in simulation or, at best, with simplified
hardware prototypes. In contrast, the UPMEM company has designed and
manufactured the first publicly-available real-world PIM architecture.
This paper provides the first comprehensive analysis of the first
publicly-available real-world PIM architecture. We make two key contributions.
First, we conduct an experimental characterization of the UPMEM-based PIM
system using microbenchmarks to assess various architecture limits such as
compute throughput and memory bandwidth, yielding new insights. Second, we
present PrIM, a benchmark suite of 16 workloads from different application
domains (e.g., linear algebra, databases, graph processing, neural networks,
bioinformatics).Comment: Our open source software is available at
https://github.com/CMU-SAFARI/prim-benchmark
5G enabled dual vision and speech enhancement architecture for multimodal hearing-aids
This paper presents the algorithmic framework for a multimodal hearing aid (HA) prototype designed on a Field Programmable Gate Array (FPGA), specifically the RFSOC4*2 AMD FPGA, and evaluates the transmitter performance through simulation studies. The proposed architecture integrates audio and video inputs, processes them using advanced algorithms, and employs the 5G New Radio (NR) communication protocol for uploading the processed signal to the cloud. The core transmission utilizes Orthogonal Frequency Division Multiplexing (OFDM), an algorithm that effectively multiplexes the processed signals onto various orthogonal frequencies, enhancing bandwidth efficiency and reducing interference. The design is divided into different modules such as Sound reference signal (SRS), demodulation reference signal (DMRS), physical broadcast channel (PBCH), and physical uplink shared channel (PUSCH). The modulation algorithm has been optimized for FPGA parallel processing capabilities, making it better suited for the hearing aid requirements for low latency. The optimized algorithm achieves a transmission time of only 4.789 ms and uses fewer hardware resources, enhancing performance in a cost-effective and energy-efficient manner.</p
- …