712 research outputs found
๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ GPU๋ฅผ ์ฌ์ฉํ ๋ฒ์ฉ์ ๊ณ์ฐ ์์ฉ์ ํจ์จ์ ์ธ ์์คํ ์์ ํ์ฉ์ ์ํ GPU ์์คํ ์ต์ ํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. ์ผํ์.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a
i
multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ต๊ทผ ๋ฒ์ฉ GPU (GPGPU) ์์ฉ ํ๋ก๊ทธ๋จ์ ๊ณ ์ฑ๋ฅ ์ปดํจํ
(HPC) ๋ฐ ๋ฅ ๋ฌ๋ (DL)๊ณผ ๊ฐ์ ๋ค์ํ ์ฐ๊ตฌ ๋ถ์ผ์์ ํต์ฌ์ ์ธ ์ญํ ์ ์ํํ๊ณ ์๋ค. ์ด๋ฌํ ์ ์ฉ ๋ถ์ผ์ ๊ณตํต์ ์ธ ํน์ฑ์ ๊ฑฐ๋ํ ๊ณ์ฐ ์ฑ๋ฅ์ด ํ์ํ ๊ฒ์ด๋ฉฐ ๊ทธ๋ํฝ ์ฒ๋ฆฌ ์ฅ์น (GPU)์ ๋์ ๋ณ๋ ฌ ์ฒ๋ฆฌ ํน์ฑ๊ณผ ๋งค์ฐ ์ ํฉํ๋ค. ๊ทธ๋ฌ๋ GPU ์์คํ
์ ํน์ ์ ํ์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ต์ ํํ๋ ๋์ ๋ชจ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์คํ
์์ค์ ๊ณต์ ์ฑ์ ์ ๊ณตํ๋๋ก ์ค๊ณ๋์ด ์์ผ๋ฉฐ ๊ฐ GPGPU ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์ฌ์ฉ ํจํด์ด ๋ค์ํ๊ธฐ ๋๋ฌธ์ ๋จ์ผ ์์ฉ ํ๋ก๊ทธ๋จ์ด GPU ์์คํ
์ ๋ฆฌ์์ค๋ฅผ ์์ ํ ํ์ฉํ์ฌ GPU์ ์ต๊ณ ์ฑ๋ฅ์ ๋ฌ์ฑ ํ ์๋ ์๋ค.
๋ฐ๋ผ์ GPU ๋ฉํฐ ํ์คํน์ ๋ค์ํ ๋ฆฌ์์ค ์ฌ์ฉ ํจํด์ ๊ฐ์ง ์ฌ๋ฌ ์์ฉ ํ๋ก๊ทธ ๋จ์ ํจ๊ป ๋ฐฐ์นํ์ฌ GPU ๋ฆฌ์์ค๋ฅผ ๊ณต์ ํจ์ผ๋ก์จ GPU ์์ ์ฌ์ฉ๋ฅ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์๋ค. ๊ทธ๋ฌ๋ ๊ธฐ์กด GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์์ ์ฌ์ฉ๋ฅ ๊ด์ ์์ ์ ์ฉ ํ๋ก๊ทธ๋จ์ ํจ์จ์ ์ธ ์คํ๋ณด๋ค ๊ณต๋์ผ๋ก ์คํํ๋ ๋ฐ ์ค์ ์ ๋๋ค. ๋ํ ํ์ฌ GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์คํ ์์ค๊ฐ ์๋๋ฏ๋ก ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ GPU ์์คํ
์ด ์๋ก์ ๊ธฐ๋ฅ์ ์ธ์ํ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ ์ต์ ํํ๊ธฐ๊ฐ ๋ ์ด๋ ค์ธ ์๋ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์์ด GPU ์์คํ
๊ณผ GPGPU ์์ฉ ์ฌ ์ด์ ํ๋ ์์ํฌ๋ฅผ ํตํด ์ฌ์ฉํ๋ฉด ๋ณด๋ค ๋์ ์์ฉ์ฑ๋ฅ๊ณผ ์์ ์ฌ์ฉ์ ๋ณด์ผ ์ ์์์ ์ฆ๋ช
ํ๊ณ ์ ํ๋ค. ๊ทธ๋ฌ๊ธฐ ์ํด GPU ํ์คํฌ ๊ด๋ฆฌ ํ๋ ์์ํฌ๋ฅผ ๊ฐ๋ฐํ์ฌ GPU ๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ ๋ฐ์ํ๋ ๋ ๊ฐ์ง ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ์๋ค. ์ฒซ์งธ, ๋ฉํฐ ํ ์คํน ํ๊ฒฝ์์ GPU ๋ฉ๋ชจ๋ฆฌ ์ด๊ณผ ํ ๋นํ ์ ์๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ํธ์คํธ ๋ฉ๋ชจ๋ฆฌ์ ๋๋ฐ์ด์ค ๋ฉ๋ชจ๋ฆฌ์ ์ฒดํฌํฌ์ธํธ ๋ฐฉ์์ ๋์
ํ์๋ค. ๋์งธ, ๋ฉํฐ ํ์คํน ํ ๊ฒฝ์์ GPU ์์ ์ฌ์ฉ์จ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋์ฑ ์ธ๋ถํ ๋ GPU ์ปค๋ ๊ด๋ฆฌ ์์คํ
์ ์ ์ํ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ ์ํ ๋ฐฉ๋ฒ๋ค์ ํจ๊ณผ๋ฅผ ์ฆ๋ช
ํ๊ธฐ ์ํด ์ค์ GPU ์์คํ
์
92
๊ตฌํํ๊ณ ๊ทธ ์ฑ๋ฅ์ ํ๊ฐํ์๋ค. ์ ์ํ ์ ๊ทผ๋ฐฉ์์ด ๊ธฐ์กด ์ ๊ทผ ๋ฐฉ์๋ณด๋ค GPGPU ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ ๊ด๋ จ๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์์ผ๋ฉฐ ๋ ๋์ ์ฑ๋ฅ์ ์ ๊ณตํ ์ ์์์ ํ์ธํ ์ ์์๋ค.Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Contribution . 7
1.3 Outline 8
Chapter 2 Background 10
2.1 GraphicsProcessingUnit(GPU) and CUDA 10
2.2 CheckpointandRestart . 11
2.3 ResourceSharingModel. 11
2.4 CUDAContext 12
2.5 GPUThreadBlockScheduling . 13
2.6 Multi-ProcessServicewithHyper-Q 13
Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16
3.1 Motivation 16
3.2 RelatedWork. 18
3.3 DesignandImplementation . 20
3.3.1 System Design 21
3.3.2 CUDAAPIwrappingmodule 22
3.3.3 Scheduler . 28
3.4 Evaluation. 31
3.4.1 Evaluationsetup . 31
3.4.2 OverheadofFlexGPU 32
3.4.3 Performance with GPU Benchmark Suits 34
3.4.4 Performance with Real-world Workloads 36
3.4.5 Performance of workloads composed of multiple applications 39
3.5 Summary 42
Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43
4.1 Motivation 43
4.2 RelatedWork. 45
4.2.1 GPUresourcesharing 45
4.2.2 GPUscheduling . 46
4.3 DesignandImplementation . 47
4.3.1 SystemArchitecture . 47
4.3.2 CUDAAPIWrappingModule . 49
4.3.3 smCompactorRuntime . 50
4.3.4 ImplementationDetails . 57
4.4 Analysis on the relation between performance and workload usage pattern 60
4.4.1 WorkloadDefinition . 60
4.4.2 Analysisonperformancesaturation 60
4.4.3 Predict the necessary SMs and thread blocks for best performance . 64
4.5 Evaluation. 69
4.5.1 EvaluationMethodology. 70
4.5.2 OverheadofsmCompactor . 71
4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72
4.5.4 Performance with Concurrent Kernel and Resource Sharing 74
4.6 Summary . 79
Chapter 5 Conclusion. 81
์์ฝ. 92Docto
ะะพัะพััะฑะฐ ัะบัะฐัะฝััะบะพั ะณัะพะผะฐะดััะบะพััั ะทะฐ ัะพะทะฒโัะทะฐะฝะฝั ะผะพะฒะฝะพั ะฟัะพะฑะปะตะผะธ ะฒ ะฝะฐัะพะดะฝะธั ัะบะพะปะฐั (ะดััะณะฐ ะฟะพะปะพะฒะธะฝะฐ ะฅะะฅ โ ะฟะพัะฐัะพะบ ะฅะฅ ัั.)
(uk) ะฃ ััะฐััั ะฒะธัะฒััะปะตะฝะพ ะผะฐะปะพะฒัะดะพะผั ััะพััะฝะบะธ ัััะพััั ะฑะพัะพััะฑะธ ัะบัะฐัะฝััะบะพั ะณัะพะผะฐะดััะบะพััั ะทะฐ ัะพะทะฒโัะทะฐะฝะฝั ะผะพะฒะฝะพั ะฟัะพะฑะปะตะผะธ ะฒ ะฝะฐัะพะดะฝะธั
ัะบะพะปะฐั
ั ะดััะณัะน ะฟะพะปะพะฒะธะฝั ะฅะะฅ โ ะฝะฐ ะฟะพัะฐัะบั ะฅะฅ ัั. ะะฐะถะปะธะฒั ัะพะปั ั ัะธั
ะทะผะฐะณะฐะฝะฝัั
ะฒัะดัะณัะฐะปะธ ะฟะตะดะฐะณะพะณััะฝั ะทโัะทะดะธ ัะฐ ะทโัะทะดะธ ะพัััะฒ-ะทะฐะบะพะฝะพะฒัะธัะตะปัะฒ.(en) The article covers the little-known pages of the history of the struggle of the Ukrainian Community for the decision of the language problem in national schools in the second half of the 19th century and at the beginning of the 20th century. Pedagogical congresses played an important part in those contests
A posteriori error bounds for the block-Lanczos method for matrix function approximation
We extend the error bounds from [SIMAX, Vol. 43, Iss. 2, pp. 787-811 (2022)]
for the Lanczos method for matrix function approximation to the block
algorithm. Numerical experiments suggest that our bounds are fairly robust to
changing block size and have the potential for use as a practical stopping
criteria. Further experiments work towards a better understanding of how
certain hyperparameters should be chosen in order to maximize the quality of
the error bounds, even in the previously studied block-size one case
Evaluation of diffuse mismatch model for phonon scattering at disordered interfaces
Diffuse phonon scattering strongly affects the phonon transport through a
disordered interface. The often-used diffuse mismatch model assumes that
phonons lose memory of their origin after being scattered by the interface.
Using mode-resolved atomic Green's function simulation, we demonstrate that
diffuse phonon scattering by a single disordered interface cannot make a phonon
lose its memory and thus the applicability of diffusive mismatch model is
limited. An analytical expression for diffuse scattering probability based on
the continuum approximation is also derived and shown to work reasonably well
at low frequencies.Comment: 13 pages, 7 figure
SpinView: General Interactive Visual Analysis Tool for Multiscale Computational Magnetism
Multiscale magnetic simulations, including micromagnetic and atomistic spin
dynamics simulations, are widely used in the study of complex magnetic systems
over a wide range of spatial and temporal scales. The advances in these
simulation technologies have generated considerable amounts of data. However, a
versatile and general tool for visualization, filtering, and denoising this
data is largely lacking. To overcome these limitations, we have developed
SpinView, a general interactive visual analysis tool for graphical exploration
and data distillation. Combined with dynamic filters and a built-in database,
it is possible to generate reproducible publication-quality images, videos, or
portable interactive webpages within seconds. Since the basic input to SpinView
is a vector field, it can be directly integrated with any spin dynamics
simulation tool. With minimal effort on the part of the user, SpinView delivers
a simplified workflow, speeds up analysis of complex datasets and trajectories,
and enables new types of analysis and insight. SpinView is available from
https://mxjk851.github.io/SpinView
Finding influential users of web event in social media
Users of social media have different influences on the evolution of a Web event. Finding influential users could benefit such information services as recommendation and market analysis. However, most of the existing methods are only based on social networks of users or user behaviors while the role of the contents contributed by users in social media is ignored. In fact, a Web event evolves with both user behaviors and the contents. This paper proposes an approach to find influential users by extracting user behavior network and association network of words within the contents and then uses PageRank algorithm and HITS algorithm to calculate the influence of users on the integration of two networks. The proposed approach is effective on several real-world datasets
Autoscaling Method for Docker Swarm Towards Bursty Workload
The autoscaling mechanism of cloud computing can automatically adjust computing resources according to user needs, improve quality of service (QoS) and avoid over-provision. However, the traditional autoscaling methods suffer from oscillation and degradation of QoS when dealing with burstiness. Therefore, the autoscaling algorithm should consider the effect of bursty workloads. In this paper, we propose a novel AmRP (an autoscaling method that combines reactive and proactive mechanisms) that uses proactive scaling to launch some containers in advance, and then the reactive module performs vertical scaling based on existing containers to increase resources rapidly. Our method also integrates burst detection to alleviate the oscillation of the scaling algorithm and improve the QoS. Finally, we evaluated our approach with state-of-the-art baseline scaling methods under different workloads in a Docker Swarm cluster. Compared with the baseline methods, the experimental results show that AmRP has fewer SLA violations when dealing with bursty workloads, and its resource cost is also lower
- โฆ