253 research outputs found

    GPU 프로그램을위한 성능 모델링, 성능 튜닝 및 양자화

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·컴퓨터공학부, 2021.8. 이재진.GPUs have played an important role in solving many scientific problems that range across different domains. Writing GPU programs might be easy, but writing them efficiently is much more difficult. To achieve the best performance, it is necessary that the compiler and runtime have advanced techniques to compile and run the program efficiently. These techniques should be transparent to the programmers and help them avoid the burden of having to know many details of the underlying architecture. Among the most important aspects that help improve the performance of a GPU program, we focus on the problem of performance modeling, performance tuning and quantization. Performance modeling estimates the execution time of the program and can be useful in analyzing the program characteristics or partitioning the workload in a heterogenous system. Performance tuning finds the optimal solution from an optimization space in a reasonable time. Quantization reduces the precision needed to execute the program without losing significant output accuracy. The proposed techniques can be integrated into GPU compilers and runtimes to help them be more efficient.1 Introduction 1 1.1 Introduction 1 2 Performance Modeling 4 2.1 Introduction 4 2.2 Related Work8 2.3 Background 10 2.3.1 OpenCL Framework 10 2.3.2 GPU Architecture 11 2.3.3 Support Vector Regression.14 2.4 Prerequisites to efficient profiling: An insight to warp scheduling 16 2.5 Performance Estimation.23 2.5.1 Linear Model 24 2.5.2 Model based on Machine Learning 25 2.6 Evaluation 29 2.6.1 Evaluation Setup 29 2.6.2 Performance estimation results. 30 2.6.3 The ML-based model on different classes of kernels 37 2.6.4 The performance at different saturation points. 37 2.7 Conclusions 39 3 Performance Auto-tuning 41 3.1 Introduction 42 3.2 Related Work45 3.3 OpenCL and GPU Architectures 47 3.4 Effects of the Work-group Size 49 3.4.1 Occupancy50 3.4.2 Global Memory Coalescing 51 3.4.3 Cache Contention 56 3.4.4 Amount of Work.57 3.4.5 Work-group Scheduling and Barriers 58 3.4.6 Benchmark Applications 59 3.5 Auto-tuning Work-group Size.61 3.5.1 Workload Tuner.62 3.5.2 Non-coalescing Factor Tuner 64 3.5.3 Concurrency Tuner 66 3.5.4 Exhaustive-search Tuner 70 3.6 Evaluation 70 3.6.1 Overall Tuning Quality 70 3.6.2 Overall Tuning Cost 75 3.6.3 Effect of the Workload Tuner 76 3.6.4 Effect of the Non-coalescing Factor Tuner 77 3.6.5 Effect of the Concurrency Tuner 77 3.7 Conclusions 79 4 Quantization for Deep Learning Programs 80 4.1 Introduction 81 4.2 Related Work83 4.3 Background 85 4.3.1 Integer Quantization 85 4.3.2 Standard Techniques Used 87 4.4 Quantization Framework.88 4.4.1 Inference Phase 88 4.4.2 Training Phase 89 4.4.3 Adding Noise to the Scale 89 4.4.4 Adaptively Adjusting Precisions 93 4.4.5 Computation of Histogram.97 4.5 Experiments 97 4.5.1 Image Classification Tasks 100 4.5.2 Natural Language Processing 105 4.6 Conclusions 106 5 Conculsion 107 Acknowledgements 123박

    Activity recognition from videos with parallel hypergraph matching on GPUs

    Full text link
    In this paper, we propose a method for activity recognition from videos based on sparse local features and hypergraph matching. We benefit from special properties of the temporal domain in the data to derive a sequential and fast graph matching algorithm for GPUs. Traditionally, graphs and hypergraphs are frequently used to recognize complex and often non-rigid patterns in computer vision, either through graph matching or point-set matching with graphs. Most formulations resort to the minimization of a difficult discrete energy function mixing geometric or structural terms with data attached terms involving appearance features. Traditional methods solve this minimization problem approximately, for instance with spectral techniques. In this work, instead of solving the problem approximatively, the exact solution for the optimal assignment is calculated in parallel on GPUs. The graphical structure is simplified and regularized, which allows to derive an efficient recursive minimization algorithm. The algorithm distributes subproblems over the calculation units of a GPU, which solves them in parallel, allowing the system to run faster than real-time on medium-end GPUs

    Soft MIMO Detection on Graphics Processing Units and Performance Study of Iterative MIMO Decoding

    Get PDF
    In this thesis we have presented an implementation of soft Multi Input Multi Output (MIMO) detection, single tree search algorithm on Graphics Processing Units (GPUs). We have compared its performance on different GPUs and a Central Processing Unit (CPU). We have also done a performance study of iterative decoding algorithms. We have shown that by increasing the number of outer iterations error rate performance can be further improved. GPUs are specialized devices specially designed to accelerate graphics processing. They are massively parallel devices which can run thousands of threads simultaneously. Because of their tremendous processing power there is an increasing interest in using them for scientific and general purpose computations. Hence companies like Nvidia, Advanced Micro Devices (AMD) etc. have started their support for General Purpose GPU (GPGPU) applications. Nvidia came up with Compute Unified Device Architecture (CUDA) to program its GPUs. Efforts are made to come up with a standard language for parallel computing that can be used across platforms. OpenCL is the first such language which is supported by all major GPU and CPU vendors. MIMO detector has a high computational complexity. We have implemented a soft MIMO detector on GPUs and studied its throughput and latency performance. We have shown that a GPU can give throughput of up to 4Mbps for a soft detection algorithm which is more than sufficient for most general purpose tasks like voice communication etc. Compare to CPU a throughput increase of ~7x is achieved. We also compared the performances of two GPUs one with low computational power and one with high computational power. These comparisons show effect of thread serialization on algorithms with the lower end GPU's execution time curve shows a slope of 1/2. To further improve error rate performance iterative decoding techniques are employed where a feedback path is employed between detector and decoder. With an eye towards GPU implementation we have explored these algorithms. Better error rate performance however, comes at a price of higher power dissipation and more latency. By simulations we have shown that one can predict based on the Signal to Noise Ratio (SNR) values how many iterations need to be done before getting an acceptable Bit Error Rate (BER) and Frame Error Rate (FER) performance. Iterative decoding technique shows that a SNR gain of ~1:5dB is achieved when number of outer iterations is increased from zero. To reduce the complexity one can adjust number of possible candidates the algorithm can generate. We showed that where a candidate list of 128 is not sufficient for acceptable error rate performance for a 4x4 MIMO system using 16-QAM modulation scheme, performances are comparable with the list size of 512 and 1024 respectively

    Simulation methodologies for mobile GPUs

    Get PDF
    GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration

    Efficient Algorithms for Coastal Geographic Problems

    Get PDF
    The increasing performance of computers has made it possible to solve algorithmically problems for which manual and possibly inaccurate methods have been previously used. Nevertheless, one must still pay attention to the performance of an algorithm if huge datasets are used or if the problem iscomputationally difficult. Two geographic problems are studied in the articles included in this thesis. In the first problem the goal is to determine distances from points, called study points, to shorelines in predefined directions. Together with other in-formation, mainly related to wind, these distances can be used to estimate wave exposure at different areas. In the second problem the input consists of a set of sites where water quality observations have been made and of the results of the measurements at the different sites. The goal is to select a subset of the observational sites in such a manner that water quality is still measured in a sufficient accuracy when monitoring at the other sites is stopped to reduce economic cost. Most of the thesis concentrates on the first problem, known as the fetch length problem. The main challenge is that the two-dimensional map is represented as a set of polygons with millions of vertices in total and the distances may also be computed for millions of study points in several directions. Efficient algorithms are developed for the problem, one of them approximate and the others exact except for rounding errors. The solutions also differ in that three of them are targeted for serial operation or for a small number of CPU cores whereas one, together with its further developments, is suitable also for parallel machines such as GPUs.Tietokoneiden suorituskyvyn kasvaminen on tehnyt mahdolliseksi ratkaista algoritmisesti ongelmia, joita on aiemmin tarkasteltu paljon ihmistyötä vaativilla, mahdollisesti epätarkoilla, menetelmillä. Algoritmien suorituskykyyn on kuitenkin toisinaan edelleen kiinnitettävä huomiota lähtömateriaalin suuren määrän tai ongelman laskennallisen vaikeuden takia. Väitöskirjaansisältyvissäartikkeleissatarkastellaankahtamaantieteellistä ongelmaa. Ensimmäisessä näistä on määritettävä etäisyyksiä merellä olevista pisteistä lähimpään rantaviivaan ennalta määrätyissä suunnissa. Etäisyyksiä ja tuulen voimakkuutta koskevien tietojen avulla on mahdollista arvioida esimerkiksi aallokon voimakkuutta. Toisessa ongelmista annettuna on joukko tarkkailuasemia ja niiltä aiemmin kerättyä tietoa erilaisista vedenlaatua kuvaavista parametreista kuten sameudesta ja ravinteiden määristä. Tehtävänä on valita asemajoukosta sellainen osa joukko, että vedenlaatua voidaan edelleen tarkkailla riittävällä tarkkuudella, kun mittausten tekeminen muilla havaintopaikoilla lopetetaan kustannusten säästämiseksi. Väitöskirja keskittyy pääosin ensimmäisen ongelman, suunnattujen etäisyyksien, ratkaisemiseen. Haasteena on se, että tarkasteltava kaksiulotteinen kartta kuvaa rantaviivan tyypillisesti miljoonista kärkipisteistä koostuvana joukkonapolygonejajaetäisyyksiäonlaskettavamiljoonilletarkastelupisteille kymmenissä eri suunnissa. Ongelmalle kehitetään tehokkaita ratkaisutapoja, joista yksi on likimääräinen, muut pyöristysvirheitä lukuun ottamatta tarkkoja. Ratkaisut eroavat toisistaan myös siinä, että kolme menetelmistä on suunniteltu ajettavaksi sarjamuotoisesti tai pienellä määrällä suoritinytimiä, kun taas yksi menetelmistä ja siihen tehdyt parannukset soveltuvat myös voimakkaasti rinnakkaisille laitteille kuten GPU:lle. Vedenlaatuongelmassa annetulla asemajoukolla on suuri määrä mahdollisia osajoukkoja. Lisäksi tehtävässä käytetään aikaa vaativia operaatioita kuten lineaarista regressiota, mikä entisestään rajoittaa sitä, kuinka monta osajoukkoa voidaan tutkia. Ratkaisussa käytetäänkin heuristiikkoja, jotkaeivät välttämättä tuota optimaalista lopputulosta.Siirretty Doriast

    Performance engineering of data-intensive applications

    Get PDF
    Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms. In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases. Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%. For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability. Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment. Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications

    PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

    Full text link
    With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 20