71,167 research outputs found
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
We present a new parallel code for computing the dynamical evolution of
collisional N-body systems with up to N~10^7 particles. Our code is based on
the the Henon Monte Carlo method for solving the Fokker-Planck equation, and
makes assumptions of spherical symmetry and dynamical equilibrium. The
principal algorithmic developments involve optimizing data structures, and the
introduction of a parallel random number generation scheme, as well as a
parallel sorting algorithm, required to find nearest neighbors for interactions
and to compute the gravitational potential. The new algorithms we introduce
along with our choice of decomposition scheme minimize communication costs and
ensure optimal distribution of data and workload among the processing units.
The implementation uses the Message Passing Interface (MPI) library for
communication, which makes it portable to many different supercomputing
architectures. We validate the code by calculating the evolution of clusters
with initial Plummer distribution functions up to core collapse with the number
of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find
that our results are in good agreement with self-similar core-collapse
solutions, and the core collapse times generally agree with expectations from
the literature. Also, we observe good total energy conservation, within less
than 0.04% throughout all simulations. We analyze the performance of the code,
and demonstrate near-linear scaling of the runtime with the number of
processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7.
The runtime reaches a saturation with the addition of more processors beyond
these limits which is a characteristic of the parallel sorting algorithm. The
resulting maximum speedups we achieve are approximately 60x, 100x, and 220x,
respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement
Towards a verified compiler prototype for the synchronous language SIGNAL
International audienceSIGNAL belongs to the synchronous languages family which are widely used in the design of safety-critical real-time systems such as avionics, space systems, and nuclear power plants. This paper reports a compiler prototype for SIGNAL. Compared with the existing SIGNAL compiler, we propose a new intermediate representation (named S-CGA, a variant of clocked guarded actions), to integrate more synchronous programs into our compiler prototype in the future. The front-end of the compiler, i.e., the translation from SIGNAL to S-CGA, is presented. As well, the proof of semantics preservation is mechanized in the theorem prover Coq. Moreover, we present the back-end of the compiler, including sequential code generation and multithreaded code generation with time-predictable properties. With the rising importance of multi-core processors in safety-critical embedded systems or cyber-physical systems (CPS), there is a growing need for model-driven generation of multithreaded code and thus mapping on multi-core. We propose a time-predictable multi-core architecture model in architecture analysis and design language (AADL), and map the multi-threaded code to this model
Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors
학위논문 (석사) -- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2020. 8. 전동석.단일 ISA 이기종 멀티 코어 프로세서가 모바일 컴퓨팅에 널리 사용되는 반면, 코드 생성은 단일 대상 코어에 대한 최적화된 코드를 생성하기에 프로세서의 다른 코어에서는 적합하지 않을 수 있다. 본 논문에서는 이 문제를 완화하기 위해 마이크로 아키텍처를 인식할 수 있는 코드 생성 방법을 제시한다. 우선 마이크로 아키텍처와 상관없이 애플리케이션 코드를 실행하기 위해 모든 코어를 최대한 활용할 수 있도록 FMV (Function-Multi-Versioning)를 제안한다.
또한 Cortex-A55 / 75 코어의 성능을 더욱 향상시키기 위해 컴파일러에 간단하지만 강력한 백엔드 최적화 패스를 추가할 것을 제안한다. 이러한 기술을 기반으로 프로그램을 분석하고 서로 다른 마이크로 아키텍처에 맞게 여러 버전의 Function을 생성하는 Automated Flow를 개발하였다. 이를 통해 프로그램 실행 시, 실행 중인 코어는 연산 성능을 극대화하기 위해 최적 버전의 Function을 선택한다.
본 논문에서 제시한 방법론을 통해, TensorFlow Lite를 실행하는 동안 삼성의 Exynos 9820 프로세서의 Cortex-A55 및 Cortex-A75 코어에서 성능을 CNN 모델에서 11.2 % 및 17.9 %, NLP 모델에서 10.9 %, 4.5 % 향상시키는 것을 확인하였다.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1
1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1
1.2 Proposed approach 3
Chapter 2 Related works and Motivation 6
2.1 Function Multi Versioning (FMV) 6
2.2 General Matrix Multiplication (GEMM) 8
2.3 Function Multi Versioning (FMV) 10
Chapter 3 Microarchitecture-Aware Code Generation 12
3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12
3.1.1 Runtime Selector 14
3.1.2 Multi-Versioned Functions 15
3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16
3.2.1 LLVM Backend Passes 17
3.2.2 Load Split Optimization Pass 18
3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20
Chapter 4 Experimental Evaluation 22
4.1 Experimental Setup 22
4.2 Floating-point CNN Models 23
4.3 Quantized CNN Models 27
4.4 Question Answering Models 28
Chapter 5 Conclusion 31
Bibliography 32
Abstract in Korean 35Maste
Multi-core Code Generation from Polychronous Programs with Time-Predictable Properties (ACVI 2014)
Workshop of ACM/IEEE 17th International Conference on Model Driven Engineering Languages and Systems (MoDELS 2014)International audienceSynchronous programming models capture concurrency in computation quite naturally, especially in its dataflow multi-clock (polychronous) flavor. With the rising importance of multi-core processors in safety-critical embedded systems or cyber-physical systems (CPS), there is a growing need for model-driven generation of multi-threaded code for multi-core systems. This paper proposes a build method of timepredictable system on multi-core, based on synchronous-model development. At the modeling level, the synchronous abstraction allows deterministic time semantics. Thus synchronous programming is a good choice for time-predictable system design. At the compiler level, the verified compiler from the synchronous language SIGNAL to our intermediate representation (S-CGA, a variant of guarded actions) and to multi-threaded code, preserves the time predictability. At the platform level, we propose a time-predictable multi-core architecture model in AADL (Architecture Analysis and Design Language), and then we map the multi-threaded code to this model. Therefore, our method integrates time predictability across several design layers
Spherical harmonic transform with GPUs
We describe an algorithm for computing an inverse spherical harmonic
transform suitable for graphic processing units (GPU). We use CUDA and base our
implementation on a Fortran90 routine included in a publicly available parallel
package, S2HAT. We focus our attention on the two major sequential steps
involved in the transforms computation, retaining the efficient parallel
framework of the original code. We detail optimization techniques used to
enhance the performance of the CUDA-based code and contrast them with those
implemented in the Fortran90 version. We also present performance comparisons
of a single CPU plus GPU unit with the S2HAT code running on either a single or
4 processors. In particular we find that use of the latest generation of GPUs,
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms
by as much as 18 times with respect to S2HAT executed on one core, and by as
much as 5.5 with respect to S2HAT on 4 cores, with the overall performance
being limited by the Fast Fourier transforms. The work presented here has been
performed in the context of the Cosmic Microwave Background simulations and
analysis. However, we expect that the developed software will be of more
general interest and applicability
0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit
Near-term quantum computers will soon reach sizes that are challenging to
directly simulate, even when employing the most powerful supercomputers. Yet,
the ability to simulate these early devices using classical computers is
crucial for calibration, validation, and benchmarking. In order to make use of
the full potential of systems featuring multi- and many-core processors, we use
automatic code generation and optimization of compute kernels, which also
enables performance portability. We apply a scheduling algorithm to quantum
supremacy circuits in order to reduce the required communication and simulate a
45-qubit circuit on the Cori II supercomputer using 8,192 nodes and 0.5
petabytes of memory. To our knowledge, this constitutes the largest quantum
circuit simulation to this date. Our highly-tuned kernels in combination with
the reduced communication requirements allow an improvement in time-to-solution
over state-of-the-art simulations by more than an order of magnitude at every
scale
Efficient Parallelization of Short-Range Molecular Dynamics Simulations on Many-Core Systems
This article introduces a highly parallel algorithm for molecular dynamics
simulations with short-range forces on single node multi- and many-core
systems. The algorithm is designed to achieve high parallel speedups for
strongly inhomogeneous systems like nanodevices or nanostructured materials. In
the proposed scheme the calculation of the forces and the generation of
neighbor lists is divided into small tasks. The tasks are then executed by a
thread pool according to a dependent task schedule. This schedule is
constructed in such a way that a particle is never accessed by two threads at
the same time.Benchmark simulations on a typical 12 core machine show that the
described algorithm achieves excellent parallel efficiencies above 80 % for
different kinds of systems and all numbers of cores. For inhomogeneous systems
the speedups are strongly superior to those obtained with spatial
decomposition. Further benchmarks were performed on an Intel Xeon Phi
coprocessor. These simulations demonstrate that the algorithm scales well to
large numbers of cores.Comment: 12 pages, 8 figure
- …