1,032 research outputs found

    The Evolution of Neural Network-Based Chart Patterns: A Preliminary Study

    Full text link
    A neural network-based chart pattern represents adaptive parametric features, including non-linear transformations, and a template that can be applied in the feature space. The search of neural network-based chart patterns has been unexplored despite its potential expressiveness. In this paper, we formulate a general chart pattern search problem to enable cross-representational quantitative comparison of various search schemes. We suggest a HyperNEAT framework applying state-of-the-art deep neural network techniques to find attractive neural network-based chart patterns; These techniques enable a fast evaluation and search of robust patterns, as well as bringing a performance gain. The proposed framework successfully found attractive patterns on the Korean stock market. We compared newly found patterns with those found by different search schemes, showing the proposed approach has potential.Comment: 8 pages, In proceedings of Genetic and Evolutionary Computation Conference (GECCO 2017), Berlin, German

    Performance analysis and optimization of automotive GPUs

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) have drastically increased the performance demands of automotive systems. Suitable highperformance platforms building upon Graphic Processing Units (GPUs) have been developed to respond to this demand, being NVIDIA Jetson TX2 a relevant representative. However, whether high-performance GPU configurations are appropriate for automotive setups remains as an open question. This paper aims at providing light on this question by modelling an automotive GPU (Jetson TX2), analyzing its microarchitectural parameters against relevant benchmarks, and identifying specific configurations able to meaningfully increase performance within similar cost envelopes, or to decrease costs preserving original performance levels. Overall, our analysis opens the door to the optimization of automotive GPUs for further system efficiency.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772773) and the HiPEAC Network of Excellence. Pedro Benedicte and Jaume Abella have been partially supported by the MINECO under FPU15/01394 grant and Ramon y Cajal postdoctoral fellowship number RYC-2013-14717 respectively and Leonidas Kosmidis under Juan de la Cierva-Formacin postdoctoral fellowship (FJCI-2017-34095).Peer ReviewedPostprint (author's final draft

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    A virtualized software based on the NVIDIA cuFFT library for image denoising:performance analysis

    Get PDF
    Generic Virtualization Service (GVirtuS) is a new solution for enabling GPGPU on Virtual Machines or low powered devices. This paper focuses on the performance analysis that can be obtained using a GPGPU virtualized software. Recently, GVirtuS has been extended in order to support CUDA ancillary libraries with good results. Here, our aim is to analyze the applicability of this powerful tool to a real problem, which uses the NVIDIA cuFFT library. As case study we consider a simple denoising algorithm, implementing a virtualized GPU-parallel software based on the convolution theorem in order to perform the noise removal procedure in the frequency domain. We report some preliminary tests in both physical and virtualized environments to study and analyze the potential scalability of such an algorithm. Peer-review under responsibility of the Conference Program Chairs

    Verifying a Systematic Application to Accelerator Roadmap using Shallow Water Wave Equations

    Get PDF
    With the advent of parallel computing, a number of hardware architectures have become available for data parallel applications. Every architecture is unique with respect to characteristics such as floating point operations per second, memory bandwidth and synchronization costs. Data parallel applications possess inherent parallelism that needs to be studied and the hardware that can best exploit this parallelism can be identified and selected for large-scale implementation. The application that I have considered for my thesis is - numerical solution of shallow water wave equations using finite difference method. These equations are a set of partial differential equations that model the propagation of disturbances in water and other incompressible liquids. This application fits in the category of a Synchronous Iterative Algorithm (SIA) and hence, the Synchronous Iterative GPGPU Execution (SIGE) model can be directly applied for performance modeling. In the high performance computing community, Graphical Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) have become highly popular architectures. Homogeneous clusters comprising of multiple processors and heterogeneous clusters that have nodes consisting of both CPU and GPU, are the architectures of interest for this thesis. An initial or high level comparison between the two architectures is performed with regards to the chosen application using a technique known as the Initial Application to Accelerator (A2A) mapping which ranks which architecture delivers the best performance with respect to execution time for large scale implementation. The subsequent part of the thesis will focus on a low level abstraction of the application of interest to accurately predict the runtime using the multi-level SIGE performance-modeling suite. Through this abstraction, performance modeling of the computation and communication portion of the application is undertaken. The behavior of the computation and communication portions is captured through several instrumented iterations of the application and regression analysis is performed on the execution times. The predicted run time is the sum of the computation and communication run time predictions and is validated by executing the application at higher data sizes. The thesis concludes with the pros and cons of applying the A2A fitness model and the low level abstraction for run time prediction to the chosen application. A critique of the SIGE model is presented and a Strength, Weakness, Opportunities (SWO) analysis is presented

    DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

    Full text link
    Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement
    corecore