63 research outputs found

    A similarity study of I/O traces via string kernels

    Get PDF
    Understanding I/O for data-intense applications is the foundation for the optimization of these applications. The classification of the applications according to the expressed I/O access pattern eases the analysis. An access pattern can be seen as fingerprint of an application. In this paper, we address the classification of traces. Firstly, we convert them first into a weighted string representation. Due to the fact that string objects can be easily compared using kernel methods, we explore their use for fingerprinting I/O patterns. To improve accuracy, we propose a novel string kernel function called kast2 spectrum kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using kernel principal component analysis and hierarchical clustering. The evaluation showed that two out of four I/O access pattern groups were completely identified, while the other two groups conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data

    Comparison of Clang Abstract Syntax Trees using string kernels

    Get PDF
    Abstract Syntax Trees (ASTs) are intermediate representations widely used by compiler frameworks. One of their strengths is that they can be used to determine the similarity among a collection of programs. In this paper we propose a novel comparison method that converts ASTs into weighted strings in order to get similarity matrices and quantify the level of correlation among codes. To evaluate the approach, we leveraged the corresponding strings derived from the Clang ASTs of a set of 100 source code examples written in C. Our kernel and two other string kernels from the literature were used to obtain similarity matrices among those examples. Next, we used Hierarchical Clustering to visualize the results. Our solution was able to identify different clusters conformed by examples that shared similar semantics. We demonstrated that the proposed strategy can be promisingly applied to similarity problems involving trees or strings

    Convolutional neural nets for estimating the run time and energy consumption of the sparse matrix-vector product

    Get PDF
    Modeling the performance and energy consumption of the sparse matrix-vector product (SpMV) is essential to perform off-line analysis and, for example, choose a target computer architecture that delivers the best performance-energy consumption ratio. However, this task is especially complex given the memory-bounded nature and irregular memory accesses of the SpMV, mainly dictated by the input sparse matrix. In this paper, we propose a Machine Learning (ML)-driven approach that leverages Convolutional Neural Networks (CNNs) to provide accurate estimations of the performance and energy consumption of the SpMV kernel. The proposed CNN-based models use a blockwise approach to make the CNN architecture independent of the matrix size. These models are trained to estimate execution time as well as total, package, and DRAM energy consumption at different processor frequencies. The experimental results reveal that the overall relative error ranges between 0.5% and 14%, while at matrix level is not superior to 10%. To demonstrate the applicability and accuracy of the SpMV CNN-based models, this study is complemented with an ad-hoc time-energy model for the PageRank algorithm, a popular algorithm for web information retrieval used by search engines, which internally realizes the SpMV kernel

    A Pipeline for the QR Update in Digital Signal Processing

    Full text link
    [EN] The input and output signals of a digital signal processing system can often be represented by a rectangular matrix as it is the case of the beamformer algorithm, a very useful particular algorithm that allows extraction of the original input signal once it is cleaned from noise and room reverberation. We use a version of this algorithm in which the system matrix must be factorized to solve a least squares problem. The matrix changes periodically according to the input signal sampled; therefore, the factorization needs to be recalculated as fast as possible. In this paper, we propose to use parallelism through a pipeline pattern. With our pipeline, some partial computations are advanced so that the final time required to update the factorization is highly reducedThis work was supported by the Spanish Ministry of Economy and Competitiveness under MINECO and FEDER projects TIN2014-53495-R and TEC2015-67387-C4-1-R.Dolz, MF.; Alventosa, FJ.; Alonso-Jordá, P.; Vidal Maciá, AM. (2019). A Pipeline for the QR Update in Digital Signal Processing. Computational and Mathematical Methods. 1:1-13. https://doi.org/10.1002/cmm4.1022S113

    An analytical methodology to derive power models based on hardware and software metrics

    Get PDF
    The use of models to predict the power consumption of a system is an appealing alternative to wattmeters since they avoid hardware costs and are easy to deploy. In this paper, we present an analytical methodology to build models with a reduced number of features in order to estimate power consumption at node level. We aim at building simple power models by performing a per-component analysis (CPU, memory, network, I/O) through the execution of four standard benchmarks. While they are executed, information from all the available hardware counters and resource utilization metrics provided by the system is collected. Based on correlations among the recorded metrics and their correlation with the instantaneous power, our methodology allows (i) to identify the significant metrics; and (ii) to assign weights to the selected metrics in order to derive reduced models. The reduction also aims at extracting models that are based on a set of hardware counters and utilization metrics that can be obtained simultaneously and, thus, can be gathered and computed on-line. The utility of our procedure is validated using real-life applications on an Intel Sandy Bridge architecture

    Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor

    Get PDF
    Ponència presentada a 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) celebrat a Bordeaux, França.The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset

    Study of the turbocharger shaft motion by means of infrared sensors

    Full text link
    This work describes a technique for measuring the precession movement of the shaft of small automotive turbochargers. The main novelty is that the technique is based on infrared light diode sensors. With presented technique it is possible to perform secure mounting of electronics and also to measure, with good accuracy, far enough from the turbocharger shaft. Both advantages allow applying it even in critical lubrication conditions and when blade contact occurs. The technique's main difficulties arise from the small size of the turbocharger shaft and the high precession movement in critical conditions. In order to generate the optimum albedo reflection for infrared measurement, a special cylindrical nut with larger diameter than the original one is assembled at the shaft tip in the compressor side. Following, shaft balancing, the calibration of the sensors and the compensation of errors from different sources are needed steps before the method is able to identify the main frequencies of shaft motion. Once synchronous and sub-synchronous frequencies have been obtained it is possible to reconstruct the instantaneous position of the shaft to determine its precession movement.This research has also been partially supported by the Programa de Desarrollo del Talento Humano de la Secretaria Nacional de Educacion Superior, Ciencia, Tecnologia e Innovacion del Gobierno Ecuatoriano No. 20100289.Serrano Cruz, JR.; Guardiola, C.; Dolz García, VM.; López, M.; Bouffaud, F. (2015). Study of the turbocharger shaft motion by means of infrared sensors. Mechanical Systems and Signal Processing. 56-57:246-258. https://doi.org/10.1016/j.ymssp.2014.11.006S24625856-5

    Efficient and portable Winograd convolutions for multi-core processors

    Get PDF
    We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

    Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

    Get PDF
    For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance of the system increases yet eventually becomes limited by the interconnection network. This is the case for distributed data-parallel training of convolutional neural networks (CNNs), which usually proceeds on a cluster with a small to moderate number of nodes. In this paper, we analyze the performance of the Allreduce collective communication primitive, a key to the efficient data-parallel distributed training of CNNs. Our study targets the distinct realizations of this primitive in three high performance instances of Message Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a cluster equipped with state-of-the-art processor and network technologies. In addition, we apply the insights gained from the experimental analysis to the optimization of the TensorFlow framework when running on top of Horovod. Our study reveals that a careful selection of the most convenient MPI library and Allreduce (ARD) realization accelerates the training throughput by a factor of 1.2× compared with the default algorithm in the same MPI library, and up to 2.8× when comparing distinct MPI libraries in a number of relevant combinations of CNN model+dataset

    A simulator to assess energy saving strategies and policies in HPC workloads

    Get PDF
    In recent years power consumption of high performance computing (HPC) clusters has become a growing problem due, e.g., to the economic cost of electricity, the emission of car- bon dioxide (with negative impact on the environment), and the generation of heat (which reduces hardware reliability). In past work, we developed EnergySaving cluster , a software package that regulates the number of active nodes in an HPC facility to match the users’ demands. In this paper, we extend this work by presenting a simulator for this tool that allows the evaluation and analysis of the benefits of applying different energy-saving strategies and policies, under realistic workloads, to different cluster configurations
    corecore