309 research outputs found

    HPC Accelerators with 3D Memory

    Get PDF
    Artรญculo invitado, publicado en las actas del congreso por IEEE Society Press. Pรกginas 320 a 328. ISBN: 978-1-5090-3593-9.DOI 10.1109/CSE-EUC-DCABES-2016.203After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.Universidad de Mรกlaga. Campus de Excelencia Internacional Andalucia Tec

    Programming MPSoC platforms: Road works ahead

    Get PDF
    This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developerยดs viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy

    ์ด์ข… ๋ฉ€ํ‹ฐ ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ์—์„œ SDF/L ๊ทธ๋ž˜ํ”„ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Ha Soonhoi.Although dataflow models are known to thrive at exploiting task-level parallelism of an application, it is difficult to exploit the parallelism of data. Data-level parallelism can be represented well with loop structures, but these structures are not explicitly specified in most existing dataflow models. SDF/L model was introduced to overcome this shortcoming by specifying the loop structures explicitly in a hierarchical fashion. To the best of our knowledge however, scheduling of SDF/L graph onto heterogeneous processors has not been considered in any previous work. In this dissertation, we introduce a scheduling technique of an application represented by the SDF/L model onto heterogeneous processors. In the proposed method, we explore the mapping of tasks using an evolutionary meta-heuristic and schedule hierarchically in a bottom-up fashion, creating parallel loop schedules at lower levels first and then re-using them when constructing the schedule at a higher level. To verify the efficiency of the proposed scheduling methodology, we apply it to benchmark examples and randomly generated SDF/L graphs.๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ ๋ชจ๋ธ์€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ํƒœ์Šคํฌ๋ฅผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ข‹์€ ๋ชจ๋ธ๋กœ ์•Œ๋ ค์ ธ ์žˆ์ง€๋งŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์— ํ™œ์šฉํ•˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. ๋ฐ์ดํ„ฐ ์ˆ˜์ค€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋Š” ๋ฃจํ”„ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์œผ๋‚˜ ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ ๋ชจ๋ธ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ๋ฃจํ”„ ๊ตฌ์กฐ๋Š” ๋ช…์„ธํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์—†์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฃจํ”„ ๊ตฌ์กฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ช…์„ธํ•  ์ˆ˜ ์žˆ๋Š” SDF/L ๋ชจ๋ธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ธฐ์ข… ํ”„๋กœ์„ธ์„œ์— ๋Œ€ํ•œ SDF/L ๊ทธ๋ž˜ํ”„์˜ ์Šค์ผ€์ค„๋ง์€ ์ด์ „๊นŒ์ง€ ๊ณ ๋ ค๋˜์ง€ ์•Š์€ ๊ฒƒ์œผ๋กœ ํŒŒ์•…๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SDF/L ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ด๊ธฐ์ข… ํ”„๋กœ์„ธ์„œ์— ๋Œ€ํ•˜์—ฌ ์Šค์ผ€์ค„๋งํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์—์„œ๋Š” ๋จผ์ € ์ง„ํ™”์  ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํƒœ์Šคํฌ ๋งคํ•‘์„ ํƒ์ƒ‰ํ•œ๋‹ค. ์ดํ›„ ํ•˜์œ„ ์ˆ˜์ค€์—์„œ ๋ณ‘๋ ฌ ๋ฃจํ”„ ์Šค์ผ€์ค„์„ ๋งŒ๋“  ๋‹ค์Œ ์ƒ์œ„ ์ˆ˜์ค€์—์„œ ์Šค์ผ€์ค„ ๊ตฌ์„ฑํ•  ๋•Œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ์ƒํ–ฅ์‹์˜ ๊ณ„์ธต์  ํƒœ์Šคํฌ ์Šค์ผ€์ค„๋ง์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ฒค์น˜๋งˆํฌ ์˜ˆ์ œ์™€ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑ๋œ SDF/L ๊ทธ๋ž˜ํ”„์— ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related Work 6 2.1 SDF Scheduling with Data-level Parallelism 8 2.2 Hierarchical Scheduling 9 Chapter 3 Problem and Challenges 11 3.1 Notations and Problem Description 11 3.2 Challenges 12 Chapter 4 Proposed methodology 15 4.1 Mapping Exploration 15 4.2 Priority Assignment and List Scheduling Heuristic 17 4.3 Hierarchical Scheduling 18 4.4 Complexity 23 Chapter 5 Experiments 24 5.1 Benchmarks 25 5.2 Randomly Generated Graphs 30 Chapter 6 Conclusions 35 Bibliography 37 ์š” ์•ฝ 41์„

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    High Performance Computing via High Level Synthesis

    Get PDF
    As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices

    Memory Consistency and Cache Coherency in Network-on-Chip Based Multi-Core Systems

    Get PDF
    The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user de๏ฌned, real-time applications, but the computer engineering society aims at hiding underlying platform speci๏ฌc characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, in๏ฌ‚uenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is con๏ฌrmed by simulation-based experimental results.The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user de๏ฌned, real-time applications, but the computer engineering society aims at hiding underlying platform speci๏ฌc characteristics and providing user with platform-independent services. Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing. In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, in๏ฌ‚uenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is con๏ฌrmed by simulation-based experimental results
    • โ€ฆ
    corecore