1,194 research outputs found
Data Cache-Energy and Throughput Models: Design Exploration for Embedded Processors
Most modern 16-bit and 32-bit embedded processors contain cache memories to further increase instruction throughput of the device. Embedded processors that contain cache memories open an opportunity for the low-power research community to model the impact of cache energy consumption and throughput gains. For optimal cache memory configuration mathematical models have been proposed in the past. Most of these models are complex enough to be adapted for modern applications like run-time cache reconfiguration. This paper improves and validates previously proposed energy and throughput models for a data cache, which could be used for overhead analysis for various cache types with relatively small amount of inputs. These models analyze the energy and throughput of a data cache on an application basis, thus providing the hardware and software designer with the feedback vital to tune the cache or application for a given energy budget. The models are suitable for use at design time in the cache optimization process for embedded processors considering time and energy overhead or could be employed at runtime for reconfigurable architectures
swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture
The flourish of deep learning frameworks and hardware platforms has been
demanding an efficient compiler that can shield the diversity in both software
and hardware in order to provide application portability. Among the exiting
deep learning compilers, TVM is well known for its efficiency in code
generation and optimization across diverse hardware devices. In the meanwhile,
the Sunway many-core processor renders itself as a competitive candidate for
its attractive computational power in both scientific and deep learning
applications. This paper combines the trends in these two directions.
Specifically, we propose swTVM that extends the original TVM to support
ahead-of-time compilation for architecture requiring cross-compilation such as
Sunway. In addition, we leverage the architecture features during the
compilation such as core group for massive parallelism, DMA for high bandwidth
memory transfer and local device memory for data locality, in order to generate
efficient code for deep learning application on Sunway. The experimental
results show the ability of swTVM to automatically generate code for various
deep neural network models on Sunway. The performance of automatically
generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup
on average than hand-optimized OpenACC implementations on convolution and fully
connected layers respectively. This work is the first attempt from the compiler
perspective to bridge the gap of deep learning and high performance
architecture particularly with productivity and efficiency in mind. We would
like to open source the implementation so that more people can embrace the
power of deep learning compiler and Sunway many-core processor
Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review
The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER
IndexMAC: A Custom RISC-V Vector Instruction to Accelerate Structured-Sparse Matrix Multiplications
Structured sparsity has been proposed as an efficient way to prune the
complexity of modern Machine Learning (ML) applications and to simplify the
handling of sparse data in hardware. The acceleration of ML models - for both
training and inference - relies primarily on equivalent matrix multiplications
that can be executed efficiently on vector processors or custom matrix engines.
The goal of this work is to incorporate the simplicity of structured sparsity
into vector execution, thereby accelerating the corresponding matrix
multiplications. Toward this objective, a new vector index-multiply-accumulate
instruction is proposed, which enables the implementation of lowcost indirect
reads from the vector register file. This reduces unnecessary memory traffic
and increases data locality. The proposed new instruction was integrated in a
decoupled RISCV vector processor with negligible hardware cost. Extensive
evaluation demonstrates significant speedups of 1.80x-2.14x, as compared to
state-of-the-art vectorized kernels, when executing layers of varying sparsity
from state-of-the-art Convolutional Neural Networks (CNNs).Comment: DATE 202
Advanced information processing system for advanced launch system: Hardware technology survey and projections
The major goals of this effort are as follows: (1) to examine technology insertion options to optimize Advanced Information Processing System (AIPS) performance in the Advanced Launch System (ALS) environment; (2) to examine the AIPS concepts to ensure that valuable new technologies are not excluded from the AIPS/ALS implementations; (3) to examine advanced microprocessors applicable to AIPS/ALS, (4) to examine radiation hardening technologies applicable to AIPS/ALS; (5) to reach conclusions on AIPS hardware building blocks implementation technologies; and (6) reach conclusions on appropriate architectural improvements. The hardware building blocks are the Fault-Tolerant Processor, the Input/Output Sequencers (IOS), and the Intercomputer Interface Sequencers (ICIS)
Understanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main
memory have improved tremendously, by 128x and 20x, respectively. These
improvements are mainly due to the continuous technology scaling of DRAM
(dynamic random-access memory), which has been used as the physical substrate
for main memory. In stark contrast with capacity and bandwidth, DRAM latency
has remained almost constant, reducing by only 1.3x in the same time frame.
Therefore, long DRAM latency continues to be a critical performance bottleneck
in modern systems. Increasing core counts, and the emergence of increasingly
more data-intensive and latency-critical applications further stress the
importance of providing low-latency memory access.
In this dissertation, we identify three main problems that contribute
significantly to long latency of DRAM accesses. To address these problems, we
present a series of new techniques. Our new techniques significantly improve
both system performance and energy efficiency. We also examine the critical
relationship between supply voltage and latency in modern DRAM chips and
develop new mechanisms that exploit this voltage-latency trade-off to improve
energy efficiency.
The key conclusion of this dissertation is that augmenting DRAM architecture
with simple and low-cost features, and developing a better understanding of
manufactured DRAM chips together lead to significant memory latency reduction
as well as energy efficiency improvement. We hope and believe that the proposed
architectural techniques and the detailed experimental data and observations on
real commodity DRAM chips presented in this dissertation will enable
development of other new mechanisms to improve the performance, energy
efficiency, or reliability of future memory systems.Comment: PhD Dissertatio
Programming MPSoC platforms: Road works ahead
This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer´s viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud
deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy
Supporting Custom Instructions with the LLVM Compiler for RISC-V Processor
The rise of hardware accelerators with custom instructions necessitates
custom compiler backends supporting these accelerators. This study provides
detailed analyses of LLVM and its RISC-V backend, supplemented with case
studies providing end-to-end overview of the mentioned transformations.
We discuss that instruction design should consider both hardware and software
design space. The necessary compiler modifications may mean that the
instruction is not well designed and need to be reconsidered. We discuss that
RISC-V standard extensions provide exemplary instructions that can guide
instruction designers.
In this study, the process of adding a custom instruction to compiler is
split into two parts as Assembler support and pattern matching support. Without
pattern matching support, conventional software requires manual entries of
inline Assembly for the accelerator which is not scalable. While it is trivial
to add Assembler support regardless of the instruction semantics, pattern
matching support is on the contrary. Pattern matching support and choosing the
right stage for the modification, requires the knowledge of the internal
transformations in the compiler. This study delves deep into pattern matching
and presents multiple ways to approach the problem of pattern matching support.
It is discussed that depending on the pattern's complexity, higher level
transformations, e.g. IR level, can be more maintainable compared to
Instruction Selection phase.Comment: Electronics and Communication Engineering B.Sc. Graduation Project.
Source can be found in https://github.com/eymay/Senior-Design-Projec
- …