13,445 research outputs found

    Introduction to SVE Architecture evaluation in gem5

    Get PDF
    REUMEN: La arquitectura SVE, del inglés Scalable Vector Extension, es una extensión de la ISA ARM para el procesamiento vectorial que permite escalar el tamaño de los registros vectoriales con flexibilidad. El simulador gem5 posibilita el modelado arquitectónico de computadores mediante la simulación de diferentes configuraciones en diversas ISAs entre las que se encuentra ARM. En este Trabajo Fin de Grado se ha realizado una introducción a la evaluación de la arquitectura SVE en gem5. Para ello, se ha realizado una descripción minuciosa de la metodología necesaria para la realización de simulaciones Full-System en el entorno gem5, con las herramientas desarrolladas por el Grupo de Arquitectura y Tecnología de computadores (ATC) de la Universidad de Cantabria. Estas simulaciones permiten la evaluación del rendimiento de diferentes benchmarks tras el escalado de la longitud de vector de SVE y el número de cores. Para dicha evaluación, se han desarrollado dos benchmarks; matrix, que realiza la multiplicación de dos matrices, y gauss, que propone la aplicación de un filtrado gaussiano a una matriz de píxeles. Los resultados preliminares obtenidos en el proceso referentes al código vectorizado reflejan, por lo general, un mejor rendimiento al escalar el tamaño de vector antes que el número de cores.ABSTRACT: The Scalable Vector Extension (SVE) is an ARM ISA architecture extension for vectorization that supports flexible vector length scaling. The gem5 simulator allows computer architectural modelling by simulating different configurations on various ISAs including ARM. In this Final Degree Project, the work to evaluate the SVE architecture in gem5 has been introduced. For this purpose, a detailed description of the necessary methodology to carry out Full-System simulations in the gem5 environment, using the tools developed by the Computer Architecture and Technology Group (ATC) of the University of Cantabria, has been provided. These simulations allow the evaluation of the performance of different benchmarks after scaling both SVE vector length and number of cores. Two benchmarks have been developed for such evaluation; matrix, which performs the multiplication of two matrices, and gauss, which applies a Gaussian filter to a pixel matrix. The preliminary results obtained through the process concerning the vectorized code generally provide better performance when scaling the vector length rather than the number of cores.Grado en Ingeniería Informátic

    The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

    Get PDF
    The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft

    Towards a Scalable Hardware/Software Co-Design Platform for Real-time Pedestrian Tracking Based on a ZYNQ-7000 Device

    Get PDF
    Currently, most designers face a daunting task to research different design flows and learn the intricacies of specific software from various manufacturers in hardware/software co-design. An urgent need of creating a scalable hardware/software co-design platform has become a key strategic element for developing hardware/software integrated systems. In this paper, we propose a new design flow for building a scalable co-design platform on FPGA-based system-on-chip. We employ an integrated approach to implement a histogram oriented gradients (HOG) and a support vector machine (SVM) classification on a programmable device for pedestrian tracking. Not only was hardware resource analysis reported, but the precision and success rates of pedestrian tracking on nine open access image data sets are also analysed. Finally, our proposed design flow can be used for any real-time image processingrelated products on programmable ZYNQ-based embedded systems, which benefits from a reduced design time and provide a scalable solution for embedded image processing products

    Is Arm software ecosystem ready for HPC?

    Get PDF
    In recent years, the HPC community has increasingly grown its interest towards the Arm architecture with research projects targeting primarily the installation of Arm-based clusters. State of the art research project examples are the European Mont-Blanc, the Japanese Post-K, and the UKs GW4/EPSRC. Primarily attention is usually given to hardware platforms, and the Arm HPC community is growing as the hardware is evolving towards HPC workloads via solutions borrowed from mobile market e.g., big.LITTLE and additions such as Armv8-A Scalable Vector Extension (SVE) technology. However the availability of a mature software ecosystem and the possibility of running large and complex HPC applications plays a key role in the consolidation process of a new technology, especially in a conservative market like HPC. For this reason in this poster we present a preliminary evaluation of the Arm system software ecosystem, limited here to the Arm HPC Compiler and the Arm Performance Libraries, together with a porting and testing of three fairly complex HPC code suites: QuantumESPRESSO, WRF and FEniCS. The selection of these codes has not been totally random: they have been in fact proposed as HPC challenges during the last two editions of the Student Cluster Competition at ISC where all the authors have been involved operating an Arm-based cluster and awarded with the Fan Favorite award.The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [3], grant agreements n. 288777, 610402 and 671697. The authors would also like to thank E4 Computer Engineering for providing part of the hardware resources needed for the evaluation carried out in this poster as well as for greatly supporting the Student Cluster Competition team.Postprint (author's final draft

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient
    corecore