338 research outputs found

    ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ฝ”๋“œ ์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ํ•˜์ˆœํšŒ.์†Œํ”„ํŠธ์›จ์–ด ์„ค๊ณ„ ์ƒ์‚ฐ์„ฑ ๋ฐ ์œ ์ง€๋ณด์ˆ˜์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์ด ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ์‘์šฉ ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์„œ์—์„œ ๋™์ž‘์‹œํ‚ค๋Š” ๋ฐ์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ, ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๋ฐ์— ํ•„์š”ํ•œ ์ง€์—ฐ์ด๋‚˜ ์ž์› ์š”๊ตฌ ์‚ฌํ•ญ์— ๋Œ€ํ•œ ๋น„๊ธฐ๋Šฅ์  ์š”๊ตฌ ์‚ฌํ•ญ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์ธ ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์„ ์ž„๋ฒ ๋””๋“œ ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ๊ฐœ๋ฐœํ•˜๋Š” ๋ฐ์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์„ ๋Œ€์ƒ์œผ๋กœ ํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ๋ชจ๋ธ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ์ด๋ฅผ ์†Œํ”„ํŠธ์›จ์–ด ๋ถ„์„์ด๋‚˜ ๊ฐœ๋ฐœ์— ํ™œ์šฉํ•˜๋Š” ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์„ ์†Œ๊ฐœํ•œ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์—์„œ ์‘์šฉ ์†Œํ”„ํŠธ์›จ์–ด๋Š” ๊ณ„์ธต์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํƒœ์Šคํฌ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ ๋ช…์„ธํ•œ๋‹ค. ํƒœ์Šคํฌ ๊ฐ„์˜ ํ†ต์‹  ๋ฐ ๋™๊ธฐํ™”๋Š” ๋ชจ๋ธ์ด ์ •์˜ํ•œ ๊ทœ์•ฝ์ด ์ •ํ•ด์ ธ ์žˆ๊ณ , ์ด๋Ÿฌํ•œ ๊ทœ์•ฝ์„ ํ†ตํ•ด ์‹ค์ œ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•˜๊ธฐ ์ „์— ์†Œํ”„ํŠธ์›จ์–ด ์—๋Ÿฌ๋ฅผ ์ •์  ๋ถ„์„์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ณ , ์ด๋Š” ์‘์šฉ์˜ ๊ฒ€์ฆ ๋ณต์žก๋„๋ฅผ ์ค„์ด๋Š” ๋ฐ์— ๊ธฐ์—ฌํ•œ๋‹ค. ์ง€์ •ํ•œ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์—์„œ ๋™์ž‘ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์€ ํƒœ์Šคํฌ๋“ค์„ ํ”„๋กœ์„ธ์„œ์— ๋งคํ•‘ํ•œ ์ดํ›„์— ์ž๋™์ ์œผ๋กœ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ๊ธฐ๋ฅผ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ๋ช…์„ธํ•œ ํ”Œ๋žซํผ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์„์—์„œ ๋™์ž‘ํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ •ํ˜•์  ๋ชจ๋ธ๋“ค์„ ๊ณ„์ธต์ ์œผ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์‘์šฉ์˜ ๋™์  ํ–‰ํƒœ๋ฅผ ๋‚˜ํƒ€๊ณ , ํ•ฉ์„ฑ๊ธฐ๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ๊ณ„์ธต์ ์ธ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ํƒœ์Šคํฌ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ๊ธฐ์—์„œ ๋‹ค์–‘ํ•œ ํ”Œ๋žซํผ์ด๋‚˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ง€์›ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฝ”๋“œ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์€ 6๊ฐœ์˜ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ๊ณผ 3 ์ข…๋ฅ˜์˜ ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ์‹ค์ œ ๊ฐ์‹œ ์†Œํ”„ํŠธ์›จ์–ด ์‹œ์Šคํ…œ ์‘์šฉ ์˜ˆ์ œ์™€ ์ด์ข… ๋ฉ€ํ‹ฐ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์›๊ฒฉ ๋”ฅ ๋Ÿฌ๋‹ ์˜ˆ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก ์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œํ—˜ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ๊ธฐ๊ฐ€ ์ƒˆ๋กœ์šด ํ”Œ๋žซํผ์ด๋‚˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”๋กœ ํ•˜๋Š” ๊ฐœ๋ฐœ ๋น„์šฉ๋„ ์‹ค์ œ ์ธก์ • ๋ฐ ์˜ˆ์ธกํ•˜์—ฌ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ๋…ธ๋ ฅ์œผ๋กœ ์ƒˆ๋กœ์šด ํ”Œ๋žซํผ์„ ์ง€์›ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งŽ์€ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ํ•˜๋“œ์›จ์–ด ์—๋Ÿฌ์— ๋Œ€ํ•ด ๊ฒฐํ•จ์„ ๊ฐ๋‚ดํ•˜๋Š” ๊ฒƒ์„ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐํ•จ ๊ฐ๋‚ด์— ๋Œ€ํ•œ ์ฝ”๋“œ๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ์—ฐ๊ตฌ๋„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋ณธ ๊ธฐ๋ฒ•์—์„œ ๊ฒฐํ•จ ๊ฐ๋‚ด ์„ค์ •์— ๋”ฐ๋ผ ํƒœ์Šคํฌ ๊ทธ๋ž˜ํ”„๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ฒฐํ•จ ๊ฐ๋‚ด์˜ ๋น„๊ธฐ๋Šฅ์  ์š”๊ตฌ ์‚ฌํ•ญ์„ ์‘์šฉ ๊ฐœ๋ฐœ์ž๊ฐ€ ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ๊ฒฐํ•จ ๊ฐ๋‚ด ์ง€์›ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ด€๋ จํ•˜์—ฌ ์‹ค์ œ ์ˆ˜๋™์œผ๋กœ ๊ตฌํ˜„ํ–ˆ์„ ๊ฒฝ์šฐ์™€ ๋น„๊ตํ•˜์˜€๊ณ , ๊ฒฐํ•จ ์ฃผ์ž… ๋„๊ตฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒฐํ•จ ๋ฐœ์ƒ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์žฌํ˜„ํ•˜๊ฑฐ๋‚˜, ์ž„์˜๋กœ ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•˜๋Š” ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฒฐํ•จ ๊ฐ๋‚ด๋ฅผ ์‹คํ—˜ํ•  ๋•Œ์— ํ™œ์šฉํ•œ ๊ฒฐํ•จ ์ฃผ์ž… ๋„๊ตฌ๋Š” ๋ณธ ๋…ผ๋ฌธ์˜ ๋˜ ๋‹ค๋ฅธ ๊ธฐ์—ฌ ์‚ฌํ•ญ ์ค‘ ํ•˜๋‚˜๋กœ ๋ฆฌ๋ˆ…์Šค ํ™˜๊ฒฝ์œผ๋กœ ๋Œ€์ƒ์œผ๋กœ ์‘์šฉ ์˜์—ญ ๋ฐ ์ปค๋„ ์˜์—ญ์— ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•˜๋Š” ๋„๊ตฌ๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์‹œ์Šคํ…œ์˜ ๊ฒฌ๊ณ ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•˜์—ฌ ๊ฒฐํ•จ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ๊ฐœ๋ฐœ๋œ ๊ฒฐํ•จ ์ฃผ์ž… ๋„๊ตฌ๋Š” ์‹œ์Šคํ…œ์ด ๋™์ž‘ํ•˜๋Š” ๋„์ค‘์— ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ์ด๋‹ค. ์ปค๋„ ์˜์—ญ์—์„œ์˜ ๊ฒฐํ•จ ์ฃผ์ž…์„ ์œ„ํ•ด ๋‘ ์ข…๋ฅ˜์˜ ๊ฒฐํ•จ ์ฃผ์ž… ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•˜๋ฉฐ, ํ•˜๋‚˜๋Š” ์ปค๋„ GNU ๋””๋ฒ„๊ฑฐ๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ARM ํ•˜๋“œ์›จ์–ด ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ฅผ ํ™œ์šฉํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์‘์šฉ ์˜์—ญ์—์„œ ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•˜๊ธฐ ์œ„ํ•ด GDB ๊ธฐ๋ฐ˜ ๊ฒฐํ•จ ์ฃผ์ž… ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ๋™์ผ ์‹œ์Šคํ…œ ํ˜น์€ ์›๊ฒฉ ์‹œ์Šคํ…œ์˜ ์‘์šฉ์— ๊ฒฐํ•จ์„ ์ฃผ์ž…ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐํ•จ ์ฃผ์ž… ๋„๊ตฌ์— ๋Œ€ํ•œ ์‹คํ—˜์€ ODROID-XU4 ๋ณด๋“œ์—์„œ ์ง„ํ–‰ํ•˜์˜€๋‹ค.While various software development methodologies have been proposed to increase the design productivity and maintainability of software, they usually focus on the development of application software running on a single processing element, without concern about the non-functional requirements of an embedded system such as latency and resource requirements. In this thesis, we present a model-based software development method for parallel and distributed embedded systems. An application is specified as a set of tasks that follow a set of given rules for communication and synchronization in a hierarchical fashion, independently of the hardware platform. Having such rules enables us to perform static analysis to check some software errors at compile time to reduce the verification difficulty. Platform-specific program is synthesized automatically after mapping of tasks onto processing elements is determined. The program synthesizer is also proposed to generate codes which satisfies platform requirements for parallel and distributed embedded systems. As multiple models which can express dynamic behaviors can be depicted hierarchically, the synthesizer supports to manage multiple task graphs with a different hierarchy to run tasks with parallelism. Also, the synthesizer shows methods of managing codes for heterogeneous platforms and generating various communication methods. The viability of the proposed software development method is verified with a real-life surveillance application that runs on six processing elements with three remote communication methods, and remote deep learning example is conducted to use heterogeneous multiprocessing components on distributed systems. Also, supporting a new platform and network requires a small effort by measuring and estimating development costs. Since tolerance to unexpected errors is a required feature of many embedded systems, we also support an automatic fault-tolerant code generation. Fault tolerance can be applied by modifying the task graph based on the selected fault tolerance configurations, so the non-functional requirement of fault tolerance can be easily adopted by an application developer. To compare the effort of supporting fault tolerance, manual implementation of fault tolerance is performed. Also, the fault tolerance method is tested with the fault injection tool to emulate fault scenarios and inject faults randomly. Our fault injection tool, which has used for testing our fault-tolerance method, is another work of this thesis. Emulating fault scenarios by intentionally injecting faults is commonly used to test and verify the robustness of a system. To emulate faults on an embedded system, we present a run-time fault injection framework that can inject a fault on both a kernel and application layer of Linux-based systems. For injecting faults on a kernel layer, two complementary fault injection techniques are used. One is based on Kernel GNU Debugger, and the other is using a hardware breakpoint supported by the ARM architecture. For application-level fault injection, the GDB-based fault injection method is used to inject a fault on a remote application. The viability of the proposed fault injection tool is proved by real-life experiments with an ODROID-XU4 system.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 6 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 HOPES: Hope of Parallel Embedded Software 9 2.1.1 Software Development Procedure 9 2.1.2 Components of HOPES 12 2.2 Universal Execution Model 13 2.2.1 Task Graph Specification 13 2.2.2 Dataflow specification of an Application 15 2.2.3 Task Code Specification and Generic APIs 21 2.2.4 Meta-data Specification 23 Chapter 3 Program Synthesis for Parallel and Distributed Embedded Systems 24 3.1 Motivational Example 24 3.2 Program Synthesis Overview 26 3.3 Program Synthesis from Hierarchically-mixed Models 30 3.4 Platform Code Synthesis 33 3.5 Communication Code Synthesis 36 3.6 Experiments 40 3.6.1 Development Cost of Supporting New Platforms and Networks 40 3.6.2 Program Synthesis for the Surveillance System Example 44 3.6.3 Remote GPU-accelerated Deep Learning Example 46 3.7 Document Generation 48 3.8 Related Works 49 Chapter 4 Model Transformation for Fault-tolerant Code Synthesis 56 4.1 Fault-tolerant Code Synthesis Techniques 56 4.2 Applying Fault Tolerance Techniques in HOPES 61 4.3 Experiments 62 4.3.1 Development Cost of Applying Fault Tolerance 62 4.3.2 Fault Tolerance Experiments 62 4.4 Random Fault Injection Experiments 65 4.5 Related Works 68 Chapter 5 Fault Injection Framework for Linux-based Embedded Systems 70 5.1 Background 70 5.1.1 Fault Injection Techniques 70 5.1.2 Kernel GNU Debugger 71 5.1.3 ARM Hardware Breakpoint 72 5.2 Fault Injection Framework 74 5.2.1 Overview 74 5.2.2 Architecture 75 5.2.3 Fault Injection Techniques 79 5.2.4 Implementation 83 5.3 Experiments 90 5.3.1 Experiment Setup 90 5.3.2 Performance Comparison of Two Fault Injection Methods 90 5.3.3 Bit-flip Fault Experiments 92 5.3.4 eMMC Controller Fault Experiments 94 Chapter 6 Conclusion 97 Bibliography 99 ์š” ์•ฝ 108Docto

    GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP

    Full text link
    Full detector simulation was among the largest CPU consumer in all CERN experiment software stacks for the first two runs of the Large Hadron Collider (LHC). In the early 2010's, the projections were that simulation demands would scale linearly with luminosity increase, compensated only partially by an increase of computing resources. The extension of fast simulation approaches to more use cases, covering a larger fraction of the simulation budget, is only part of the solution due to intrinsic precision limitations. The remainder corresponds to speeding-up the simulation software by several factors, which is out of reach using simple optimizations on the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport codes in order to make them benefit from fine-grained parallelism features such as vectorization, but also from increased code and data locality. This paper presents extensively the results and achievements of this R&D, as well as the conclusions and lessons learnt from the beta prototype.Comment: 34 pages, 26 figures, 24 table

    A Streamline Upwind/ Petrov-Galerkin FEM based time-accurate solution of 3D time-domain Maxwell\u27s equations for dispersive materials

    Get PDF
    Although the simulation of most broadband frequency responses are made under the assumption of constant electromagnetic material parameters, this is not a valid assumption for many materials found in nature. In this dissertation the time-accurate solution of the 3D time-domain Maxwellโ€™s equations for dispersive materials in a Streamline Upwind/Petrov-Galerkin framework is investigated. For this purpose the permittivity associated with a material is expressed as a matrix, enabling the solution of anisotropic material models with multiple poles. Here diagonally isotropic models with up to 2 poles are investigated. Near-field to far-field transformations are implemented to enable the solution of open boundary problems such as radiation patterns and radar cross sections. A unified perfectly matched layer absorbing boundary layer is implemented to efficiently terminate the computational region. Numerical simulations of these equations are tightly coupled together and compared against a loosely coupled approach to improve efficiency. An alternative diagonal stabilization matrix is proposed which is implemented and compared with a non-sparse stabilization matrix derived from the flux Jacobians. Along with this new stabilization parameter, scalability is improved by coupling the equations for the perfectly matched layer with those of Maxwellโ€™s equations. Further efficiency gains are achieved by allowing for a variable number of equations to be solved throughout the domain

    Doctor of Philosophy

    Get PDF
    dissertationRecent trends in high performance computing present larger and more diverse computers using multicore nodes possibly with accelerators and/or coprocessors and reduced memory. These changes pose formidable challenges for applications code to attain scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities oer a portable solution for easy programming. The Uintah framework, for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. However, the original Uintah code had limited scalability as tasks were run in a predefined order based solely on static analysis of the task graph and used only message passing interface (MPI) for parallelism. By using a new hybrid multithread and MPI runtime system, this research has made it possible for Uintah to scale to 700K central processing unit (CPU) cores when solving challenging fluid-structure interaction problems. Those problems often involve moving objects with adaptive mesh refinement and thus with highly variable and unpredictable work patterns. This research has also demonstrated an ability to run capability jobs on the heterogeneous systems with Nvidia graphics processing unit (GPU) accelerators or Intel Xeon Phi coprocessors. The new runtime system for Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for multicore CPUs and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases without significant changes to application code. This research concludes that the adaptive directed acyclic graph (DAG)-based approach provides a very powerful abstraction for solving challenging multiscale multiphysics engineering problems. Excellent scalability with regard to the different processors and communications performance are achieved on some of the largest and most powerful computers available today

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto bรกsico de la arquitectura mono-nucleo en los procesadores de propรณsito general se ajusta bien a un modelo de programaciรณn secuencial. La integraciรณn de multiples nรบcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaciรณn del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difรญcil de conseguir usando unicamente multicores de propรณsito general. La apariciรณn de aceleradores tipo streaming y de los correspondientes modelos de programaciรณn han mejorado esta situaciรณn proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea bรกsica detrรกs del diseรฑo de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaciรณn paralela y comunicaciรณn entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracterรญsticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran nรบmero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaciรณn en forma de flujos. La disposiciรณn de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaciรณn de los patrones de las aplicaciones de acceso a datos puede que no sean muy รบtiles para lograr un rendimiento รณptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaรฑo y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podrรญa ser eliminado empleando tรฉcnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectรณnicas de los aceleradores streaming con diseรฑos de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseรฑo de aceleradores de aplicaciones especรญficas con diseรฑos de memoria personalizados, ii) diseรฑo de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseรฑo para dispositivos orientados a flujos con las memorias estรกndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaciรณn Blacksmith permite la adopciรณn a nivel de hardware de un front-end de aplicaciรณn especรญfico utilizando una GPU como back-end. Esto permite maximizar la explotaciรณn de la localidad de datos y el paralelismo a nivel de datos de una aplicaciรณn mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseรฑo de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaciรณn en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ยฟ similar to the other processors ยฟ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 โ€œHigh-Performance Modelling and Simulation for Big Data Applications (cHiPSet)โ€œ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    A mixed-signal computer architecture and its application to power system problems

    Get PDF
    Radical changes are taking place in the landscape of modern power systems. This massive shift in the way the system is designed and operated has been termed the advent of the ``smart grid''. One of its implications is a strong market pull for faster power system analysis computing. This work is concerned in particular with transient simulation, which is one of the most demanding power system analyses. This refers to the imitation of the operation of the real-world system over time, for time scales that cover the majority of slow electromechanical transient phenomena. The general mathematical formulation of the simulation problem includes a set of non-linear differential algebraic equations (DAEs). In the algebraic part of this set, heavy linear algebra computations are included, which are related to the admittance matrix of the topology. These computations are a critical factor to the overall performance of a transient simulator. This work proposes the use of analog electronic computing as a means of exceeding the performance barriers of conventional digital computers for the linear algebra operations. Analog computing is integrated in the frame of a power system transient simulator yielding significant computational performance benefits to the latter. Two hybrid, analog and digital computers are presented. The first prototype has been implemented using reconfigurable hardware. In its core, analog computing is used for linear algebra operations, while pipelined digital resources on a field programmable gate array (FPGA) handle all remaining computations. The properties of the analog hardware are thoroughly examined, with special attention to accuracy and timing. The application of the platform to the transient analysis of power system dynamics showed a speedup of two orders of magnitude against conventional software solutions. The second prototype is proposed as a future conceptual architecture that would overcome the limitations of the already implemented hardware, while retaining its virtues. The design space of this future architecture has been thoroughly explored, with the help of a software emulator. For one possible suggested implementation, speedups of four orders of magnitude against software solvers have been observed for the linear algebra operations
    • โ€ฆ
    corecore