71 research outputs found

    Simulation Environment with Customized RISC-V Instructions for Logic-in-Memory Architectures

    Get PDF
    Nowadays, various memory-hungry applications like machine learning algorithms are knocking "the memory wall". Toward this, emerging memories featuring computational capacity are foreseen as a promising solution that performs data process inside the memory itself, so-called computation-in-memory, while eliminating the need for costly data movement. Recent research shows that utilizing the custom extension of RISC-V instruction set architecture to support computation-in-memory operations is effective. To evaluate the applicability of such methods further, this work enhances the standard GNU binary utilities to generate RISC-V executables with Logic-in-Memory (LiM) operations and develop a new gem5 simulation environment, which simulates the entire system (CPU, peripherals, etc.) in a cycle-accurate manner together with a user-defined LiM module integrated into the system. This work provides a modular testbed for the research community to evaluate potential LiM solutions and co-designs between hardware and software

    SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

    Full text link
    Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or x86 SIMD Intrinsics to optimize programs for performance. With the emergence of RISC-V Vector Extensions (RVV), there is a need to migrate these performance legacy codes for RVV. Currently, the migration of NEON code to RVV code requires manual rewriting, which is a time-consuming and error-prone process. In this work, we use the open source tool, "SIMD Everywhere" (SIMDe), to automate the migration. Our primary task is to enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. For type conversion, we devise strategies to convert Neon Intrinsics types to RVV Intrinsics by considering the vector length agnostic (vla) architectures. With function conversions, we analyze commonly used conversion methods in SIMDe and develop customized conversions for each function based on the results of RVV code generations. In our experiments with Google XNNPACK library, our enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe, which does not utilize customized RVV implementations for the conversions

    Effective Code Generation for Distributed and Ping-Pong Register Files: a Case Study on PAC VLIW DSP Cores

    No full text
    [[abstract]]The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.[[fileno]]2030216010013[[department]]資訊工程學

    Copy Propagation Optimizations for VLIW DSP Processors with Distributed Register Files

    No full text
    [[abstract]]High-performance and low-power VLIW DSP processors are increasingly deployed on embedded devices to process video and multimedia applications. For reducing power and cost in designs of VLIW DSP processors, distributed register files and multi-bank register architectures are being adopted to eliminate the amount of read/write ports in register files. This presents new challenges for devising compiler optimization schemes for such architectures. In our research work, we address the compiler optimization issues for PAC architecture, which is a 5-way issue DSP processor with distributed register files. We show how to support an important class of compiler optimization problems, known as copy propagations, for such architecture. We illustrate that a naive deployment of copy propagations in embedded VLIW DSP processors with distributed register files might result in performance anomaly. In our proposed scheme, we derive a communication cost model by cluster distance, register port pressures, and the movement type of register sets. This cost model is used to guide the data flow analysis for supporting copy propagations over PAC architecture. Experimental results show that our schemes are effective to prevent performance anomaly with copy propagations over embedded VLIW DSP processors with distributed files.[[fileno]]2030216030005[[department]]資訊工程學

    Language and Environment Support for Parallel Object I/O on Distributed Memory Environments

    No full text
    [[abstract]]The paper describes a parallel file object environment to support distributed array store on shared nothing distributed computing environments. Our environment enables programmers to extend the concept of array distribution from memory levels to file levels. It allows parallel I/O according to the distribution of objects in an application. When objects are read and/or written by multiple applications using different distributions, we present a novel scheme to help programmers to select the best data distribution pattern according to minimum amount of remote data movements for the store of array objects on distributed file systems[[fileno]]2030216030014[[department]]資訊工程學

    Parallel Array Object I/O Support on Distributed Environments

    No full text
    This paper presents a parallel le object environment to support distributed array store on shared-nothing distributed computing environments. Our environment enables programmers to extend the concept of array distributions from memory levels to le levels. It allows parallel I/O that facilitates the distribution of objects in an application. When objects are read and/or written by multiple applications using di erent distributions, we present anovel scheme to help programmers to select the best data distribution pattern according to minimum amount of remote data movements for the store of array objects on distributed le systems. Our selection scheme, to our best knowledge, is the rst work to attempt to optimize the distribution patterns in the secondary storage for HPF-like programs with inter-application cases. This is especially important for a class of problems called multiple disciplinary optimization (MDO) problems. Our testbed is built on an 8-node DEC Farm connected with an ethernet, FDDI, or ATM switch. Our experimental results with scienti c applications show that not only our parallel le system can provide aggregate bandwidths, but also our selection scheme e ectively reduce the communication tra cs for the system.

    Object Oriented Parallel Programming Experiments and Results

    No full text
    We present an object-oriented, parallel programming paradigm, called the distributed collection model and an experimental language PC++ based on this model. In the distributed collection model, programmers can describe the data distribution of elements among processors to utilize memory locality and a collection construct is employed to build distributed structures. The model also supports the express of massive parallelism and a new mechanism for building hierarchies of abstractions. We have implemented PC++ on a variety of machines including VAX8800, Alliant FX/8, Alliant FX/2800, and BBN GP1000. Our experience with application programs in these environments as well as performance results are also described in the paper. 1 Introduction Massively parallel systems consisting of thousands of processors offer huge aggregate computing power. Unfortunately, a new machine of this class is almost useless unless there is a reasonable mechanism for porting software to it so that the resulting..

    Object Oriented Parallel Programming Experiments and Results

    No full text
    We present an object-oriented, parallel programming paradigm, called the distributed collection model and an experimental language PC++ based on this model. In the distributed collection model, programmers can describe the data distribution of elements among processors to utilize memory locality and a collection construct is employed to build distributed structures. The model also supports the express of massive parallelism and a new mechanism for building hierarchies of abstractions. We have implemented PC++ on a variety of machines including VAX8800, Alliant FX/8, Alliant FX/2800, and BBN GP1000. Our experience with application programs in these environments as well as performance results are also described in the paper. 1 Introduction Massively parallel systems consisting of thousands of processors offer huge aggregate computing power. Unfortunately, a new machine of this class is almost useless unless there is a reasonable mechanism for porting software to it so that the resulting..
    corecore