1,323 research outputs found

    GPU peer-to-peer techniques applied to a cluster interconnect

    Full text link
    Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201

    Design and management of image processing pipelines within CPS: Acquired experience towards the end of the FitOptiVis ECSEL Project

    Get PDF
    Cyber-Physical Systems (CPSs) are dynamic and reactive systems interacting with processes, environment and, sometimes, humans. They are often distributed with sensors and actuators, characterized for being smart, adaptive, predictive and react in real-time. Indeed, image- and video-processing pipelines are a prime source for environmental information for systems allowing them to take better decisions according to what they see. Therefore, in FitOptiVis, we are developing novel methods and tools to integrate complex image- and video-processing pipelines. FitOptiVis aims to deliver a reference architecture for describing and optimizing quality and resource management for imaging and video pipelines in CPSs both at design- and run-time. The architecture is concretized in low-power, high-performance, smart components, and in methods and tools for combined design-time and run-time multi-objective optimization and adaptation within system and environment constraints

    ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ… ํ™˜๊ฒฝ๊ธฐ๋ฐ˜์—์„œ ์ˆ˜์น˜ ๋ชจ๋ธ๋ง๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•œ ์ง€๊ตฌ๊ณผํ•™ ์ž๋ฃŒ์ƒ์„ฑ์— ๊ด€ํ•œ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ์ง€๊ตฌํ™˜๊ฒฝ๊ณผํ•™๋ถ€, 2022. 8. ์กฐ์–‘๊ธฐ.To investigate changes and phenomena on Earth, many scientists use high-resolution-model results based on numerical models or develop and utilize machine learning-based prediction models with observed data. As information technology advances, there is a need for a practical methodology for generating local and global high-resolution numerical modeling and machine learning-based earth science data. This study recommends data generation and processing using high-resolution numerical models of earth science and machine learning-based prediction models in a cloud environment. To verify the reproducibility and portability of high-resolution numerical ocean model implementation on cloud computing, I simulated and analyzed the performance of a numerical ocean model at various resolutions in the model domain, including the Northwest Pacific Ocean, the East Sea, and the Yellow Sea. With the containerization method, it was possible to respond to changes in various infrastructure environments and achieve computational reproducibility effectively. The data augmentation of subsurface temperature data was performed using generative models to prepare large datasets for model training to predict the vertical temperature distribution in the ocean. To train the prediction model, data augmentation was performed using a generative model for observed data that is relatively insufficient compared to satellite dataset. In addition to observation data, HYCOM datasets were used for performance comparison, and the data distribution of augmented data was similar to the input data distribution. The ensemble method, which combines stand-alone predictive models, improved the performance of the predictive model compared to that of the model based on the existing observed data. Large amounts of computational resources were required for data synthesis, and the synthesis was performed in a cloud-based graphics processing unit environment. High-resolution numerical ocean model simulation, predictive model development, and the data generation method can improve predictive capabilities in the field of ocean science. The numerical modeling and generative models based on cloud computing used in this study can be broadly applied to various fields of earth science.์ง€๊ตฌ์˜ ๋ณ€ํ™”์™€ ํ˜„์ƒ์„ ์—ฐ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ๊ณผํ•™์ž๋“ค์€ ์ˆ˜์น˜ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ณ ํ•ด์ƒ๋„ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๊ณ  ํ™œ์šฉํ•œ๋‹ค. ์ •๋ณด๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•จ์— ๋”ฐ๋ผ ์ง€์—ญ ๋ฐ ์ „ ์ง€๊ตฌ์ ์ธ ๊ณ ํ•ด์ƒ๋„ ์ˆ˜์น˜ ๋ชจ๋ธ๋ง๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ง€๊ตฌ๊ณผํ•™ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ง€๊ตฌ๊ณผํ•™์˜ ๊ณ ํ•ด์ƒ๋„ ์ˆ˜์น˜ ๋ชจ๋ธ๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋ฐ ์ฒ˜๋ฆฌ๊ฐ€ ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ๊ตฌํ˜„๋  ์ˆ˜ ์žˆ์Œ์„ ์ œ์•ˆํ•œ๋‹ค. ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ…์—์„œ ๊ณ ํ•ด์ƒ๋„ ์ˆ˜์น˜ ํ•ด์–‘ ๋ชจ๋ธ ๊ตฌํ˜„์˜ ์žฌํ˜„์„ฑ๊ณผ ์ด์‹์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ถ์„œํƒœํ‰์–‘, ๋™ํ•ด, ํ™ฉํ•ด ๋“ฑ ๋ชจ๋ธ ์˜์—ญ์˜ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์—์„œ ์ˆ˜์น˜ ํ•ด์–‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๊ณ  ๋ถ„์„ํ•˜์˜€๋‹ค. ์ปจํ…Œ์ด๋„ˆํ™” ๋ฐฉ์‹์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ธํ”„๋ผ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๋Œ€์‘ํ•˜๊ณ  ๊ณ„์‚ฐ ์žฌํ˜„์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์˜ ์ ์šฉ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ์ƒ์„ฑ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ํ‘œ์ธต ์ดํ•˜ ์˜จ๋„ ๋ฐ์ดํ„ฐ์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์‹คํ–‰ํ•˜์—ฌ ํ•ด์–‘์˜ ์ˆ˜์ง ์˜จ๋„ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ–ˆ๋‹ค. ์˜ˆ์ธก๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•ด ์œ„์„ฑ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋ถ€์กฑํ•œ ๊ด€์ธก ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋ชจ๋ธ์˜ ์˜ˆ์ธก์„ฑ๋Šฅ ๋น„๊ต์—๋Š” ๊ด€์ธก ๋ฐ์ดํ„ฐ ์™ธ์—๋„ HYCOM ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ฆ๊ฐ• ๋ฐ์ดํ„ฐ์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์™€ ์œ ์‚ฌํ•จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋…๋ฆฝํ˜• ์˜ˆ์ธก ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•œ ์•™์ƒ๋ธ” ๋ฐฉ์‹์€ ๊ธฐ์กด ๊ด€์ธก ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๋น„ํ•ด ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ๋ฐ์ดํ„ฐํ•ฉ์„ฑ์„ ์œ„ํ•ด ๋งŽ์€ ์–‘์˜ ๊ณ„์‚ฐ ์ž์›์ด ํ•„์š”ํ–ˆ์œผ๋ฉฐ, ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ์€ ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ GPU ํ™˜๊ฒฝ์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค. ๊ณ ํ•ด์ƒ๋„ ์ˆ˜์น˜ ํ•ด์–‘ ๋ชจ๋ธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ์˜ˆ์ธก ๋ชจ๋ธ ๊ฐœ๋ฐœ, ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋ฐฉ๋ฒ•์€ ํ•ด์–‘ ๊ณผํ•™ ๋ถ„์•ผ์—์„œ ์˜ˆ์ธก ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉ๋œ ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ… ๊ธฐ๋ฐ˜์˜ ์ˆ˜์น˜ ๋ชจ๋ธ๋ง ๋ฐ ์ƒ์„ฑ ๋ชจ๋ธ์€ ์ง€๊ตฌ ๊ณผํ•™์˜ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.1. General Introduction 1 2. Performance of numerical ocean modeling on cloud computing 6 2.1. Introduction 6 2.2. Cloud Computing 9 2.2.1. Cloud computing overview 9 2.2.2. Commercial cloud computing services 12 2.3. Numerical model for performance analysis of commercial clouds 15 2.3.1. High Performance Linpack Benchmark 15 2.3.2. Benchmark Sustainable Memory Bandwidth and Memory Latency 16 2.3.3. Numerical Ocean Model 16 2.3.4. Deployment of Numerical Ocean Model and Benchmark Packages on Cloud Clusters 19 2.4. Simulation results 21 2.4.1. Benchmark simulation 21 2.4.2. Ocean model simulation 24 2.5. Analysis of ROMS performance on commercial clouds 26 2.5.1. Performance of ROMS according to H/W resources 26 2.5.2. Performance of ROMS according to grid size 34 2.6. Summary 41 3. Reproducibility of numerical ocean model on the cloud computing 44 3.1. Introduction 44 3.2. Containerization of numerical ocean model 47 3.2.1. Container virtualization 47 3.2.2. Container-based architecture for HPC 49 3.2.3. Container-based architecture for hybrid cloud 53 3.3. Materials and Methods 55 3.3.1. Comparison of traditional and container based HPC cluster workflows 55 3.3.2. Model domain and datasets for numerical simulation 57 3.3.3. Building the container image and registration in the repository 59 3.3.4. Configuring a numeric model execution cluster 64 3.4. Results and Discussion 74 3.4.1. Reproducibility 74 3.4.2. Portability and Performance 76 3.5. Conclusions 81 4. Generative models for the prediction of ocean temperature profile 84 4.1. Introduction 84 4.2. Materials and Methods 87 4.2.1. Model domain and datasets for predicting the subsurface temperature 87 4.2.2. Model architecture for predicting the subsurface temperature 90 4.2.3. Neural network generative models 91 4.2.4. Prediction Models 97 4.2.5. Accuracy 103 4.3. Results and Discussion 104 4.3.1. Data Generation 104 4.3.2. Ensemble Prediction 109 4.3.3. Limitations of this study and future works 111 4.4. Conclusion 111 5. Summary and conclusion 114 6. References 118 7. Abstract (in Korean) 140๋ฐ•

    A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH

    Get PDF
    Fulltext linkThis paper explores the design and implementation of BORPH, an operating system designed for FPGA-based reconfigurable computers. Hardware designs execute as normal UNIX processes under BORPH, having access to standard OS services, such as file system support. Hardware and software components of user designs may, therefore, run as communicating processes within BORPH's runtime environment. The familiar language independent UNIX kernel interface facilitates easy design reuse and rapid application development. To develop hardware designs, a Simulink-based design flow that integrates with BORPH is employed. Performances of BORPH on two on-chip systems implemented on a BEE2 platform are compared. ยฉ 2008 ACM.link_to_subscribed_fulltex

    A multi-port 10GbE PCIe NIC featuring UDP offload and GPUDirect capabilities

    Get PDF
    NaNet-10 is a four-ports 10GbE PCIe Network Interface Card designed for low-latency real-time operations with GPU systems. To this purpose the design includes an UDP offload module, for fast and clock-cycle deterministic handling of the transport layer protocol, plus a GPUDirect P2P/RDMA engine for low-latency communication with NVIDIA Tesla GPU devices. A dedicated module (Multi-Stream) can optionally process input UDP streams before data is delivered through PCIe DMA to their destination devices, re-organizing data from different streams guaranteeing computational optimization. NaNet-10 is going to be integrated in the NA62 CERN experiment in order to assess the suitability of GPGPU systems as real-time triggers; results and lessons learned while performing this activity will be reported herein

    Hierarchical Agent-based Adaptation for Self-Aware Embedded Computing Systems

    Get PDF
    Siirretty Doriast

    Using SMT to accelerate nested virtualization

    Get PDF
    IaaS datacenters offer virtual machines (VMs) to their clients, who in turn sometimes deploy their own virtualized environments, thereby running a VM inside a VM. This is known as nested virtualization. VMs are intrinsically slower than bare-metal execution, as they often trap into their hypervisor to perform tasks like operating virtual I/O devices. Each VM trap requires loading and storing dozens of registers to switch between the VM and hypervisor contexts, thereby incurring costly runtime overheads. Nested virtualization further magnifies these overheads, as every VM trap in a traditional virtualized environment triggers at least twice as many traps. We propose to leverage the replicated thread execution resources in simultaneous multithreaded (SMT) cores to alleviate the overheads of VM traps in nested virtualization. Our proposed architecture introduces a simple mechanism to colocate different VMs and hypervisors on separate hardware threads of a core, and replaces the costly context switches of VM traps with simple thread stall and resume events. More concretely, as each thread in an SMT core has its own register set, trapping between VMs and hypervisors does not involve costly context switches, but simply requires the core to fetch instructions from a different hardware thread. Furthermore, our inter-thread communication mechanism allows a hypervisor to directly access and manipulate the registers of its subordinate VMs, given that they both share the same in-core physical register file. A model of our architecture shows up to 2.3ร— and 2.6ร— better I/O latency and bandwidth, respectively. We also show a software-only prototype of the system using existing SMT architectures, with up to 1.3ร— and 1.5ร— better I/O latency and bandwidth, respectively, and 1.2--2.2ร— speedups on various real-world applications
    • โ€ฆ
    corecore