563 research outputs found

    A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems

    Full text link
    Recent technological advances have greatly improved the performance and features of embedded systems. With the number of just mobile devices now reaching nearly equal to the population of earth, embedded systems have truly become ubiquitous. These trends, however, have also made the task of managing their power consumption extremely challenging. In recent years, several techniques have been proposed to address this issue. In this paper, we survey the techniques for managing power consumption of embedded systems. We discuss the need of power management and provide a classification of the techniques on several important parameters to highlight their similarities and differences. This paper is intended to help the researchers and application-developers in gaining insights into the working of power management techniques and designing even more efficient high-performance embedded systems of tomorrow

    ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒ”์šฉ์  ๊ณ„์‚ฐ ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ GPU ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.Recently, General Purpose GPU (GPGPU) applications are playing key roles in many different research fields, such as high-performance computing (HPC) and deep learning (DL). The common feature exists in these applications is that all of them require massive computation power, which follows the high parallelism characteristics of the graphics processing unit (GPU). However, because of the resource usage pattern of each GPGPU application varies, a single application cannot fully exploit the GPU systems resources to achieve the best performance of the GPU since the GPU system is designed to provide system-level fairness to all applications instead of optimizing for a specific type. GPU multitasking can address the issue by co-locating multiple kernels with diverse resource usage patterns to share the GPU resource in parallel. However, the current GPU mul- titasking scheme focuses just on co-launching the kernels rather than making them execute more efficiently. Besides, the current GPU multitasking scheme is not open-sourced, which makes it more difficult to be optimized, since the GPGPU applications and the GPU system are unaware of the feature of each other. In this dissertation, we claim that using the support from framework between the GPU system and the GPGPU applications without modifying the application can yield better performance. We design and implement the frame- work while addressing two issues in GPGPU applications. First, we introduce a GPU memory checkpointing approach between the host memory and the device memory to address the problem that GPU memory cannot be over-subscripted in a multitasking environment. Second, we present a fine-grained GPU kernel management scheme to avoid the GPU resource under-utilization problem in a i multitasking environment. We implement and evaluate our schemes on a real GPU system. The experimental results show that our proposed approaches can solve the problems related to GPGPU applications than the existing approaches while delivering better performance.์ตœ๊ทผ ๋ฒ”์šฉ GPU (GPGPU) ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… (HPC) ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ (DL)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘ ์šฉ ๋ถ„์•ผ์˜ ๊ณตํ†ต์ ์ธ ํŠน์„ฑ์€ ๊ฑฐ๋Œ€ํ•œ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒƒ์ด๋ฉฐ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜ (GPU)์˜ ๋†’์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํŠน์„ฑ๊ณผ ๋งค์šฐ ์ ํ•ฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPU ์‹œ์Šคํ…œ์€ ํŠน์ • ์œ  ํ˜•์˜ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์ตœ์ €ํ™”ํ•˜๋Š” ๋Œ€์‹  ๋ชจ๋“  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ๊ณต์ • ์„ฑ์„ ์ œ๊ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์ด ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ผ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด GPU ์‹œ์Šคํ…œ์˜ ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ ํ•  ์ˆ˜๋Š” ์—†๋‹ค. ๋”ฐ๋ผ์„œ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น์€ ๋‹ค์–‘ํ•œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ ํŒจํ„ด์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ ๋žจ์„ ํ•จ๊ป˜ ๋ฐฐ์น˜ํ•˜์—ฌ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ GPU ์ž์› ์‚ฌ์šฉ๋ฅ  ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์ž์› ์‚ฌ์šฉ๋ฅ  ๊ด€์ ์—์„œ ์‘ ์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰๋ณด๋‹ค ๊ณต๋™์œผ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘”๋‹ค. ๋˜ํ•œ ํ˜„์žฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ๊ธฐ์ˆ ์€ ์˜คํ”ˆ ์†Œ์Šค๊ฐ€ ์•„๋‹ˆ๋ฏ€๋กœ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ GPU ์‹œ์Šคํ…œ์ด ์„œ๋กœ์˜ ๊ธฐ๋Šฅ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ ํ™”ํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ค์šธ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ์ˆ˜์ • ์—†์ด GPU ์‹œ์Šคํ…œ๊ณผ GPGPU ์‘์šฉ ์‚ฌ ์ด์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•˜๋ฉด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ์„ฑ๋Šฅ๊ณผ ์ž์› ์‚ฌ์šฉ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด GPU ํƒœ์Šคํฌ ๊ด€๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ GPU ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋ฉ€ํ‹ฐ ํƒœ ์Šคํ‚น ํ™˜๊ฒฝ์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ฒดํฌํฌ์ธํŠธ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์˜€๋‹ค. ๋‘˜์งธ, ๋ฉ€ํ‹ฐ ํƒœ์Šคํ‚น ํ™˜ ๊ฒฝ์—์„œ GPU ์ž์› ์‚ฌ์šฉ์œจ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋”์šฑ ์„ธ๋ถ„ํ™” ๋œ GPU ์ปค๋„ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ GPU ์‹œ์Šคํ…œ์— 92 ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค GPGPU ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contribution . 7 1.3 Outline 8 Chapter 2 Background 10 2.1 GraphicsProcessingUnit(GPU) and CUDA 10 2.2 CheckpointandRestart . 11 2.3 ResourceSharingModel. 11 2.4 CUDAContext 12 2.5 GPUThreadBlockScheduling . 13 2.6 Multi-ProcessServicewithHyper-Q 13 Chapter 3 Checkpoint based solution for GPU memory over- subscription problem 16 3.1 Motivation 16 3.2 RelatedWork. 18 3.3 DesignandImplementation . 20 3.3.1 System Design 21 3.3.2 CUDAAPIwrappingmodule 22 3.3.3 Scheduler . 28 3.4 Evaluation. 31 3.4.1 Evaluationsetup . 31 3.4.2 OverheadofFlexGPU 32 3.4.3 Performance with GPU Benchmark Suits 34 3.4.4 Performance with Real-world Workloads 36 3.4.5 Performance of workloads composed of multiple applications 39 3.5 Summary 42 Chapter 4 A Workload-aware Fine-grained Resource Manage- ment Framework for GPGPUs 43 4.1 Motivation 43 4.2 RelatedWork. 45 4.2.1 GPUresourcesharing 45 4.2.2 GPUscheduling . 46 4.3 DesignandImplementation . 47 4.3.1 SystemArchitecture . 47 4.3.2 CUDAAPIWrappingModule . 49 4.3.3 smCompactorRuntime . 50 4.3.4 ImplementationDetails . 57 4.4 Analysis on the relation between performance and workload usage pattern 60 4.4.1 WorkloadDefinition . 60 4.4.2 Analysisonperformancesaturation 60 4.4.3 Predict the necessary SMs and thread blocks for best performance . 64 4.5 Evaluation. 69 4.5.1 EvaluationMethodology. 70 4.5.2 OverheadofsmCompactor . 71 4.5.3 Performance with Different Thread Block Counts on Dif- ferentNumberofSMs 72 4.5.4 Performance with Concurrent Kernel and Resource Sharing 74 4.6 Summary . 79 Chapter 5 Conclusion. 81 ์š”์•ฝ. 92Docto

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    MURAC: A unified machine model for heterogeneous computers

    Get PDF
    Includes bibliographical referencesHeterogeneous computing enables the performance and energy advantages of multiple distinct processing architectures to be efficiently exploited within a single machine. These systems are capable of delivering large performance increases by matching the applications to architectures that are most suited to them. The Multiple Runtime-reconfigurable Architecture Computer (MURAC) model has been proposed to tackle the problems commonly found in the design and usage of these machines. This model presents a system-level approach that creates a clear separation of concerns between the system implementer and the application developer. The three key concepts that make up the MURAC model are a unified machine model, a unified instruction stream and a unified memory space. A simple programming model built upon these abstractions provides a consistent interface for interacting with the underlying machine to the user application. This programming model simplifies application partitioning between hardware and software and allows the easy integration of different execution models within the single control ow of a mixed-architecture application. The theoretical and practical trade-offs of the proposed model have been explored through the design of several systems. An instruction-accurate system simulator has been developed that supports the simulated execution of mixed-architecture applications. An embedded System-on-Chip implementation has been used to measure the overhead in hardware resources required to support the model, which was found to be minimal. An implementation of the model within an operating system on a tightly-coupled reconfigurable processor platform has been created. This implementation is used to extend the software scheduler to allow for the full support of mixed-architecture applications in a multitasking environment. Different scheduling strategies have been tested using this scheduler for mixed-architecture applications. The design and implementation of these systems has shown that a unified abstraction model for heterogeneous computers provides important usability benefits to system and application designers. These benefits are achieved through a consistent view of the multiple different architectures to the operating system and user applications. This allows them to focus on achieving their performance and efficiency goals by gaining the benefits of different execution models during runtime without the complex implementation details of the system-level synchronisation and coordination

    A scheduling theory framework for GPU tasks e๏ฌƒcient execution

    Get PDF
    Concurrent execution of tasks in GPUs can reduce the computation time of a workload by overlapping data transfer and execution commands. However it is di๏ฌƒcult to implement an e๏ฌƒcient run- time scheduler that minimizes the workload makespan as many execution orderings should be evaluated. In this paper, we employ scheduling theory to build a model that takes into account the device capabili- ties, workload characteristics, constraints and objec- tive functions. In our model, GPU tasks schedul- ing is reformulated as a ๏ฌ‚ow shop scheduling prob- lem, which allow us to apply and compare well known methods already developed in the operations research ๏ฌeld. In addition we develop a new heuristic, specif- ically focused on executing GPU commands, that achieves better scheduling results than previous tech- niques. Finally, a comprehensive evaluation, showing the suitability and robustness of this new approach, is conducted in three di๏ฌ€erent NVIDIA architectures (Kepler, Maxwell and Pascal).Proyecto TIN2016- 0920R, Universidad de Mรกlaga (Campus de Excelencia Internacional Andalucรญa Tech) y programa de donaciรณn de NVIDIA Corporation

    XinuPi3: Teaching Multicore Concepts Using Embedded Xinu

    Get PDF
    As computer platforms become more advanced, the need to teach advanced computing concepts grows accordingly. This paper addresses one such need by presenting XinuPi3, a port of the lightweight instructional operating system Embedded Xinu to the Raspberry Pi 3. The Raspberry Pi 3 improves upon previous generations of inexpensive, credit card-sized computers by including a quad-core, ARM-based processor, opening the door for educators to demonstrate essential aspects of modern computing like inter-core communication and genuine concurrency. Embedded Xinu has proven to be an effective teaching tool for demonstrating low-level concepts on single-core platforms, and it is currently used to teach a range of systems courses at multiple universities. As of this writing, no other bare metal educational operating system supports multicore computing. XinuPi3 provides a suitable learning environment for beginners on genuinely concurrent hardware. This paper provides an overview of the key features of the XinuPi3 system, as well as the novel embedded system education experiences it makes possible

    Cooperative kernels: GPU multitasking for blocking algorithms

    Get PDF
    There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking , so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploit- ing scheduling quirks of todayโ€™s GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative kernels , an extension to the traditional GPU programming model geared towards writing blocking algorithms. Workgroups of a cooperative kernel are fairly scheduled, and multitasking is supported via a small set of language extensions through which the kernel and scheduler cooperate. We describe a prototype implementation of a cooperative kernel frame- work implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking
    • โ€ฆ
    corecore