412 research outputs found
Recommended from our members
Magnetic resonance multitasking for motion-resolved quantitative cardiovascular imaging.
Quantitative cardiovascular magnetic resonance (CMR) imaging can be used to characterize fibrosis, oedema, ischaemia, inflammation and other disease conditions. However, the need to reduce artefacts arising from body motion through a combination of electrocardiography (ECG) control, respiration control, and contrast-weighting selection makes CMR exams lengthy. Here, we show that physiological motions and other dynamic processes can be conceptualized as multiple time dimensions that can be resolved via low-rank tensor imaging, allowing for motion-resolved quantitative imaging with up to four time dimensions. This continuous-acquisition approach, which we name cardiovascular MR multitasking, captures - rather than avoids - motion, relaxation and other dynamics to efficiently perform quantitative CMR without the use of ECG triggering or breath holds. We demonstrate that CMR multitasking allows for T1 mapping, T1-T2 mapping and time-resolved T1 mapping of myocardial perfusion without ECG information and/or in free-breathing conditions. CMR multitasking may provide a foundation for the development of setup-free CMR imaging for the quantitative evaluation of cardiovascular health
๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ GPU๋ฅผ ์ฌ์ฉํ ๋ฒ์ฉ์ ๊ณ์ฐ ์์ฉ์ ํจ์จ์ ์ธ ์์คํ ์์ ํ์ฉ์ ์ํ GPU ์์คํ ์ต์ ํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. ์ผํ์.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a
i
multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ต๊ทผ ๋ฒ์ฉ GPU (GPGPU) ์์ฉ ํ๋ก๊ทธ๋จ์ ๊ณ ์ฑ๋ฅ ์ปดํจํ
(HPC) ๋ฐ ๋ฅ ๋ฌ๋ (DL)๊ณผ ๊ฐ์ ๋ค์ํ ์ฐ๊ตฌ ๋ถ์ผ์์ ํต์ฌ์ ์ธ ์ญํ ์ ์ํํ๊ณ ์๋ค. ์ด๋ฌํ ์ ์ฉ ๋ถ์ผ์ ๊ณตํต์ ์ธ ํน์ฑ์ ๊ฑฐ๋ํ ๊ณ์ฐ ์ฑ๋ฅ์ด ํ์ํ ๊ฒ์ด๋ฉฐ ๊ทธ๋ํฝ ์ฒ๋ฆฌ ์ฅ์น (GPU)์ ๋์ ๋ณ๋ ฌ ์ฒ๋ฆฌ ํน์ฑ๊ณผ ๋งค์ฐ ์ ํฉํ๋ค. ๊ทธ๋ฌ๋ GPU ์์คํ
์ ํน์ ์ ํ์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ต์ ํํ๋ ๋์ ๋ชจ๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์คํ
์์ค์ ๊ณต์ ์ฑ์ ์ ๊ณตํ๋๋ก ์ค๊ณ๋์ด ์์ผ๋ฉฐ ๊ฐ GPGPU ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์ฌ์ฉ ํจํด์ด ๋ค์ํ๊ธฐ ๋๋ฌธ์ ๋จ์ผ ์์ฉ ํ๋ก๊ทธ๋จ์ด GPU ์์คํ
์ ๋ฆฌ์์ค๋ฅผ ์์ ํ ํ์ฉํ์ฌ GPU์ ์ต๊ณ ์ฑ๋ฅ์ ๋ฌ์ฑ ํ ์๋ ์๋ค.
๋ฐ๋ผ์ GPU ๋ฉํฐ ํ์คํน์ ๋ค์ํ ๋ฆฌ์์ค ์ฌ์ฉ ํจํด์ ๊ฐ์ง ์ฌ๋ฌ ์์ฉ ํ๋ก๊ทธ ๋จ์ ํจ๊ป ๋ฐฐ์นํ์ฌ GPU ๋ฆฌ์์ค๋ฅผ ๊ณต์ ํจ์ผ๋ก์จ GPU ์์ ์ฌ์ฉ๋ฅ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์๋ค. ๊ทธ๋ฌ๋ ๊ธฐ์กด GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์์ ์ฌ์ฉ๋ฅ ๊ด์ ์์ ์ ์ฉ ํ๋ก๊ทธ๋จ์ ํจ์จ์ ์ธ ์คํ๋ณด๋ค ๊ณต๋์ผ๋ก ์คํํ๋ ๋ฐ ์ค์ ์ ๋๋ค. ๋ํ ํ์ฌ GPU ๋ฉํฐ ํ์คํน ๊ธฐ์ ์ ์คํ ์์ค๊ฐ ์๋๋ฏ๋ก ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ GPU ์์คํ
์ด ์๋ก์ ๊ธฐ๋ฅ์ ์ธ์ํ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ ์ต์ ํํ๊ธฐ๊ฐ ๋ ์ด๋ ค์ธ ์๋ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์์ฉ ํ๋ก๊ทธ๋จ์ ์์ ์์ด GPU ์์คํ
๊ณผ GPGPU ์์ฉ ์ฌ ์ด์ ํ๋ ์์ํฌ๋ฅผ ํตํด ์ฌ์ฉํ๋ฉด ๋ณด๋ค ๋์ ์์ฉ์ฑ๋ฅ๊ณผ ์์ ์ฌ์ฉ์ ๋ณด์ผ ์ ์์์ ์ฆ๋ช
ํ๊ณ ์ ํ๋ค. ๊ทธ๋ฌ๊ธฐ ์ํด GPU ํ์คํฌ ๊ด๋ฆฌ ํ๋ ์์ํฌ๋ฅผ ๊ฐ๋ฐํ์ฌ GPU ๋ฉํฐ ํ์คํน ํ๊ฒฝ์์ ๋ฐ์ํ๋ ๋ ๊ฐ์ง ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ์๋ค. ์ฒซ์งธ, ๋ฉํฐ ํ ์คํน ํ๊ฒฝ์์ GPU ๋ฉ๋ชจ๋ฆฌ ์ด๊ณผ ํ ๋นํ ์ ์๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ํธ์คํธ ๋ฉ๋ชจ๋ฆฌ์ ๋๋ฐ์ด์ค ๋ฉ๋ชจ๋ฆฌ์ ์ฒดํฌํฌ์ธํธ ๋ฐฉ์์ ๋์
ํ์๋ค. ๋์งธ, ๋ฉํฐ ํ์คํน ํ ๊ฒฝ์์ GPU ์์ ์ฌ์ฉ์จ ์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋์ฑ ์ธ๋ถํ ๋ GPU ์ปค๋ ๊ด๋ฆฌ ์์คํ
์ ์ ์ํ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ ์ํ ๋ฐฉ๋ฒ๋ค์ ํจ๊ณผ๋ฅผ ์ฆ๋ช
ํ๊ธฐ ์ํด ์ค์ GPU ์์คํ
์
92
๊ตฌํํ๊ณ ๊ทธ ์ฑ๋ฅ์ ํ๊ฐํ์๋ค. ์ ์ํ ์ ๊ทผ๋ฐฉ์์ด ๊ธฐ์กด ์ ๊ทผ ๋ฐฉ์๋ณด๋ค GPGPU ์์ฉ ํ๋ก๊ทธ๋จ๊ณผ ๊ด๋ จ๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์์ผ๋ฉฐ ๋ ๋์ ์ฑ๋ฅ์ ์ ๊ณตํ ์ ์์์ ํ์ธํ ์ ์์๋ค.Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Contribution . 7
1.3 Outline 8
Chapter 2 Background 10
2.1 GraphicsProcessingUnit(GPU) and CUDA 10
2.2 CheckpointandRestart . 11
2.3 ResourceSharingModel. 11
2.4 CUDAContext 12
2.5 GPUThreadBlockScheduling . 13
2.6 Multi-ProcessServicewithHyper-Q 13
Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16
3.1 Motivation 16
3.2 RelatedWork. 18
3.3 DesignandImplementation . 20
3.3.1 System Design 21
3.3.2 CUDAAPIwrappingmodule 22
3.3.3 Scheduler . 28
3.4 Evaluation. 31
3.4.1 Evaluationsetup . 31
3.4.2 OverheadofFlexGPU 32
3.4.3 Performance with GPU Benchmark Suits 34
3.4.4 Performance with Real-world Workloads 36
3.4.5 Performance of workloads composed of multiple applications 39
3.5 Summary 42
Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43
4.1 Motivation 43
4.2 RelatedWork. 45
4.2.1 GPUresourcesharing 45
4.2.2 GPUscheduling . 46
4.3 DesignandImplementation . 47
4.3.1 SystemArchitecture . 47
4.3.2 CUDAAPIWrappingModule . 49
4.3.3 smCompactorRuntime . 50
4.3.4 ImplementationDetails . 57
4.4 Analysis on the relation between performance and workload usage pattern 60
4.4.1 WorkloadDefinition . 60
4.4.2 Analysisonperformancesaturation 60
4.4.3 Predict the necessary SMs and thread blocks for best performance . 64
4.5 Evaluation. 69
4.5.1 EvaluationMethodology. 70
4.5.2 OverheadofsmCompactor . 71
4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72
4.5.4 Performance with Concurrent Kernel and Resource Sharing 74
4.6 Summary . 79
Chapter 5 Conclusion. 81
์์ฝ. 92Docto
Data Resource Management in Throughput Processors
Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system.
Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown.
This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads.
Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%.
Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%.
GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146122/1/jklooste_1.pd
Looking for change? Roll the Dice and demand Attention
Change detection, i.e. identification per pixel of changes for some classes
of interest from a set of bi-temporal co-registered images, is a fundamental
task in the field of remote sensing. It remains challenging due to unrelated
forms of change that appear at different times in input images. Here, we
propose a reliable deep learning framework for the task of semantic change
detection in very high-resolution aerial images. Our framework consists of a
new loss function, new attention modules, new feature extraction building
blocks, and a new backbone architecture that is tailored for the task of
semantic change detection. Specifically, we define a new form of set
similarity, that is based on an iterative evaluation of a variant of the Dice
coefficient. We use this similarity metric to define a new loss function as
well as a new spatial and channel convolution Attention layer (the FracTAL).
The new attention layer, designed specifically for vision tasks, is memory
efficient, thus suitable for use in all levels of deep convolutional networks.
Based on these, we introduce two new efficient self-contained feature
extraction convolution units. We validate the performance of these feature
extraction building blocks on the CIFAR10 reference data and compare the
results with standard ResNet modules. Further, we introduce a new
encoder/decoder scheme, a network macro-topology, that is tailored for the task
of change detection. Our network moves away from any notion of subtraction of
feature layers for identifying change. We validate our approach by showing
excellent performance and achieving state of the art score (F1 and Intersection
over Union-hereafter IoU) on two building change detection datasets, namely,
the LEVIRCD (F1: 0.918, IoU: 0.848) and the WHU (F1: 0.938, IoU: 0.882)
datasets.Comment: 28 pages, under review in ISPRS P&RS, 1st revision. Figures of low
quality due to compression for arxiv. Reduced abstract in arxiv due to
character limitation
- โฆ