71 research outputs found
Simulation Environment with Customized RISC-V Instructions for Logic-in-Memory Architectures
Nowadays, various memory-hungry applications like machine learning algorithms
are knocking "the memory wall". Toward this, emerging memories featuring
computational capacity are foreseen as a promising solution that performs data
process inside the memory itself, so-called computation-in-memory, while
eliminating the need for costly data movement. Recent research shows that
utilizing the custom extension of RISC-V instruction set architecture to
support computation-in-memory operations is effective. To evaluate the
applicability of such methods further, this work enhances the standard GNU
binary utilities to generate RISC-V executables with Logic-in-Memory (LiM)
operations and develop a new gem5 simulation environment, which simulates the
entire system (CPU, peripherals, etc.) in a cycle-accurate manner together with
a user-defined LiM module integrated into the system. This work provides a
modular testbed for the research community to evaluate potential LiM solutions
and co-designs between hardware and software
SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions
Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or
x86 SIMD Intrinsics to optimize programs for performance. With the emergence of
RISC-V Vector Extensions (RVV), there is a need to migrate these performance
legacy codes for RVV. Currently, the migration of NEON code to RVV code
requires manual rewriting, which is a time-consuming and error-prone process.
In this work, we use the open source tool, "SIMD Everywhere" (SIMDe), to
automate the migration. Our primary task is to enhance SIMDe to enable the
conversion of ARM NEON Intrinsics types and functions to their corresponding
RVV Intrinsics types and functions. For type conversion, we devise strategies
to convert Neon Intrinsics types to RVV Intrinsics by considering the vector
length agnostic (vla) architectures. With function conversions, we analyze
commonly used conversion methods in SIMDe and develop customized conversions
for each function based on the results of RVV code generations. In our
experiments with Google XNNPACK library, our enhanced SIMDe achieves speedup
ranging from 1.51x to 5.13x compared to the original SIMDe, which does not
utilize customized RVV implementations for the conversions
Effective Code Generation for Distributed and Ping-Pong Register Files: a Case Study on PAC VLIW DSP Cores
[[abstract]]The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.[[fileno]]2030216010013[[department]]資訊工程學
Copy Propagation Optimizations for VLIW DSP Processors with Distributed Register Files
[[abstract]]High-performance and low-power VLIW DSP processors are increasingly deployed on embedded devices to process video and multimedia applications. For reducing power and cost in designs of VLIW DSP processors, distributed register files and multi-bank register architectures are being adopted to eliminate the amount of read/write ports in register files. This presents new challenges for devising compiler optimization schemes for such architectures. In our research work, we address the compiler optimization issues for PAC architecture, which is a 5-way issue DSP processor with distributed register files. We show how to support an important class of compiler optimization problems, known as copy propagations, for such architecture. We illustrate that a naive deployment of copy propagations in embedded VLIW DSP processors with distributed register files might result in performance anomaly. In our proposed scheme, we derive a communication cost model by cluster distance, register port pressures, and the movement type of register sets. This cost model is used to guide the data flow analysis for supporting copy propagations over PAC architecture. Experimental results show that our schemes are effective to prevent performance anomaly with copy propagations over embedded VLIW DSP processors with distributed files.[[fileno]]2030216030005[[department]]資訊工程學
Language and Environment Support for Parallel Object I/O on Distributed Memory Environments
[[abstract]]The paper describes a parallel file object environment to support distributed array store on shared nothing distributed computing environments. Our environment enables programmers to extend the concept of array distribution from memory levels to file levels. It allows parallel I/O according to the distribution of objects in an application. When objects are read and/or written by multiple applications using different distributions, we present a novel scheme to help programmers to select the best data distribution pattern according to minimum amount of remote data movements for the store of array objects on distributed file systems[[fileno]]2030216030014[[department]]資訊工程學
Parallel Array Object I/O Support on Distributed Environments
This paper presents a parallel le object environment to support distributed array store on shared-nothing distributed computing environments. Our environment enables programmers to extend the concept of array distributions from memory levels to le levels. It allows parallel I/O that facilitates the distribution of objects in an application. When objects are read and/or written by multiple applications using di erent distributions, we present anovel scheme to help programmers to select the best data distribution pattern according to minimum amount of remote data movements for the store of array objects on distributed le systems. Our selection scheme, to our best knowledge, is the rst work to attempt to optimize the distribution patterns in the secondary storage for HPF-like programs with inter-application cases. This is especially important for a class of problems called multiple disciplinary optimization (MDO) problems. Our testbed is built on an 8-node DEC Farm connected with an ethernet, FDDI, or ATM switch. Our experimental results with scienti c applications show that not only our parallel le system can provide aggregate bandwidths, but also our selection scheme e ectively reduce the communication tra cs for the system.
Object Oriented Parallel Programming Experiments and Results
We present an object-oriented, parallel programming paradigm, called the distributed collection model and an experimental language PC++ based on this model. In the distributed collection model, programmers can describe the data distribution of elements among processors to utilize memory locality and a collection construct is employed to build distributed structures. The model also supports the express of massive parallelism and a new mechanism for building hierarchies of abstractions. We have implemented PC++ on a variety of machines including VAX8800, Alliant FX/8, Alliant FX/2800, and BBN GP1000. Our experience with application programs in these environments as well as performance results are also described in the paper. 1 Introduction Massively parallel systems consisting of thousands of processors offer huge aggregate computing power. Unfortunately, a new machine of this class is almost useless unless there is a reasonable mechanism for porting software to it so that the resulting..
Object Oriented Parallel Programming Experiments and Results
We present an object-oriented, parallel programming paradigm, called the distributed collection model and an experimental language PC++ based on this model. In the distributed collection model, programmers can describe the data distribution of elements among processors to utilize memory locality and a collection construct is employed to build distributed structures. The model also supports the express of massive parallelism and a new mechanism for building hierarchies of abstractions. We have implemented PC++ on a variety of machines including VAX8800, Alliant FX/8, Alliant FX/2800, and BBN GP1000. Our experience with application programs in these environments as well as performance results are also described in the paper. 1 Introduction Massively parallel systems consisting of thousands of processors offer huge aggregate computing power. Unfortunately, a new machine of this class is almost useless unless there is a reasonable mechanism for porting software to it so that the resulting..
- …