Search CORE

1,284 research outputs found

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel

Author: Keltcher Paul
Kimball Daniel
Michel Elizabeth
Wolf Michael M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2014
Field of study

Sparse matrix-vector multiplication (SpMV) is the core operation in many common network and graph analytics, but poor performance of the SpMV kernel handicaps these applications. This work quantifies the effect of matrix structure on SpMV performance, using Intel's VTune tool for the Sandy Bridge architecture. Two types of sparse matrices are considered: finite difference (FD) matrices, which are structured, and R-MAT matrices, which are unstructured. Analysis of cache behavior and prefetcher activity reveals that the SpMV kernel performs far worse with R-MAT matrices than with FD matrices, due to the difference in matrix structure. To address the problems caused by unstructured matrices, novel architecture improvements are proposed.Comment: 6 pages, 7 figures. IEEE HPEC 201

arXiv.org e-Print Archive

Crossref

Elastic and cost-effective data carrier architecture for smart contract in blockchain

Author: Chen Yu-Wen
Liu Xiaolong
Lloret Jaime
Muhammad Khan
Yuan Shyan-Ming
Publication venue: 'Elsevier BV'
Publication date: 01/11/2019
Field of study

[EN] Smart contract, which could help developer deploy decentralized and secure blockchain application, is one of the most promising technologies for modern Internet of things (IoT) ecosystem today. However, Ethereum smart contract lacks of ability to communicate with outside IoT environment. To enable smart contracts to fetch off-chain data, this paper proposes a data carrier architecture that is cost-effective and elastic for blockchain-enabled IoT environment. Three components, namely Mission Manager, Task Publisher and Worker, are presented in the data carrier architecture to interact with contract developer, smart contract, Ethereum node and off-chain data sources. Selective solutions are also proposed for filtering smart contract event and decoding event log to fit different requirements. The evaluation results and discussions show the proposed system will decrease about 20USD deployment cost in average for every smart contract, and it is more efficient and elastic compared with Oraclize Oracle data carrier service.This work was supported by the fund of National Natural Science Foundation of China (Grants No. 61702102), Natural Science Foundation of Fujian Province, China (Grant No. 2018J05100), Foundation for Distinguished Young Scholars of Fujian Agriculture and Forestry University (Grant No. xjq201809), and in part by the MOST of Taiwan (Grant No. 107-2623-E-009-006-D).Liu, X.; Muhammad, K.; Lloret, J.; Chen, Y.; Yuan, S. (2019). Elastic and cost-effective data carrier architecture for smart contract in blockchain. Future Generation Computer Systems. 100:590-599. https://doi.org/10.1016/j.future.2019.05.04259059910

RiuNet

Improving GPU cache hierarchy performance with a fetch and replacement cache

Author: Candel Francisco
Petit Salvador
Sahuquillo Julio
Valero Alejandro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

In the last few years, GPGPU computing has become one of the most popular computing paradigms in high-performance computers due to its excellent performance to power ratio. The memory requirements of GPGPU applications widely differ from the requirements of CPU counterparts. The amount of memory accesses is several orders of magnitude higher in GPU applications than in CPU applications, and they present disparate access patterns. Because of this fact, large and highly associative Last-Level Caches (LLCs) bring much lower performance gains in GPUs than in CPUs. This paper presents a novel approach to manage LLC misses that efficiently improves LLC hit ratio, memory-level parallelism, and miss latencies in GPU systems. The proposed approach leverages a small additional Fetch and Replacement Cache (FRC) that stores control and coherence information of incoming blocks until they are fetched from main memory. Then, fetched blocks are swapped with victim blocks to be replaced in the LLC. After that, the eviction of victim blocks is performed from the FRC. This management approach improves performance due to three main reasons: (i) the lifetime of blocks being replaced is increased, (ii) the main memory path is unclogged on long bursts of LLC misses, and (iii) the average L2 miss delaying latency is reduced. Experimental results show that our proposal increases the performance (OPC) over 25% in most of the studied applications, reaching improvements up to 150% in some applications

Crossref

Repositorio Universidad de Zaragoza

Wrong-Path-Aware Entangling Instruction Prefetcher

Author: Ros Alberto
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2024
Field of study

© 2023.IEEE. This document is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by /4.0/ This document is the Accepted version of a Published Work that appeared in final form in IEEE Transactions on Computers. To access the final edited and published work see DOI 10.1109/TC.2023.3337308Instruction prefetching is instrumental for guaranteeing a high flow of instructions through the processor front end for applications whose working set does not fit in the lowerlevel caches. Examples of such applications are server workloads, whose instruction footprints are constantly growing. There are two main techniques to mitigate this problem: fetch directed prefetching (or decoupled front end) and instruction cache (L1I) prefetching. This work extends the state-of-the-art Entangling prefetcher to avoid training during wrong-path execution. Our Entangling wrong-path-aware prefetcher is equipped with microarchitectural techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. Next, we propose two microarchitectural optimizations able to further increase performance benefits by 1.8%, on average. All this is achieved with just 304 bytes. Finally, we study the interplay between the L1I prefetcher and a decoupled front end. Our analysis shows that due to pollution caused by wrong-path instructions, the degree of decoupling cannot be increased unlimitedly without negative effects on the energy-delay product (EDP). Furthermore, the closer to ideal is the L1I prefetcher, the less decoupling is required. For example, our Entangling prefetcher reaches an optimal EDP with a decoupling degree of 64 instructions

DIGITUM Universidad de Murcia (España)

SysOptic: A Fine-Grained Monitoring System for Virtual Machines Based on PMU

Author: Liu P
Liu X
Sun J
Yang R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/05/2019
Field of study

Modern cloud data centers indicate the frequent existence of complex failure manifestation. Failures have become the norm, and not the exception. This is a key challenge since assumptions that underpin designing reliable systems are monitoring systems status and detecting anomaly at runtime. Performance Monitoring Unit on CPU (PMU) can obtain fine-grained monitoring data by adopting interrupt sampling method based on hardware events. However, profilers in virtual machines fail to receive PMU relevant information directly due to the limited capacity of PMU virtualization. In this paper, we present a fine-grained monitoring system SysOptic based on PMU virtualization. First, we propose a method to expose PMU information PMU and ensure the visibility ofsuch information at virtual machine level. Second, to maximize the PMU reusability, SysOptic supports the PMU sharing and simultaneous monitoring among multiple virtual machines. Furthermore, we also describe how to synchronize interrupts on physical machines to virtual machines by using injecting interrupts. Experimental results show that with the aid of SysOptic, profiler tools in virtual machines manage to perceive the existence of PMU and collect the monitoring data. The additional overhead incurred by SysOptic is at most 9.8%

Crossref

White Rose Research Online

Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport

Author: Amitava Majumdar
Bielajew A F Hirayama H Nelson W R Rogers D W O
Briesmeister J F
Dongju Choi
Gu X
Gu X
Jacques R Taylor R Wong J McNutt T
Josep Sempau
Lensch H Strzodka R
Li J S
Li M
Ma C M
Men C
Nelson W R Hirayama H Rogers D W O
NVIDIA
Salvat F Fernández-Varea J M Baró J Sempau J
Salvat F Fernández-Varea J M Sempau J
Sempau J
Sharp G C
Steve B Jiang
Woodcock E Murphy T Hemmings P Longworth S
Xu F
Xuejun Gu
Xun Jia
Yan G R
Publication venue: 'IOP Publishing'
Publication date: 22/03/2010
Field of study

Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. We have implemented the Dose Planning Method (DPM) Monte Carlo dose calculation package (Sempau et al, Phys. Med. Biol., 45(2000)2263-2291) on GPU architecture under CUDA platform. The implementation has been tested with respect to the original sequential DPM code on CPU in phantoms with water-lung-water or water-bone-water slab geometry. A 20 MeV mono-energetic electron point source or a 6 MV photon point source is used in our validation. The results demonstrate adequate accuracy of our GPU implementation for both electron and photon beams in radiotherapy energy range. Speed up factors of about 5.0 ~ 6.6 times have been observed, using an NVIDIA Tesla C1060 GPU card against a 2.27GHz Intel Xeon CPU processor.Comment: 13 pages, 3 figures, and 1 table. Paper revised. Figures update

arXiv.org e-Print Archive

Crossref

Twinkle: A fast resource provisioning mechanism for internet services

Author: Jiang Zhefu
Xiao Zhen
Zhu Jun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

A key benefit of Amazon EC2-style cloud computing service is the ability to instantiate a large number of virtual machines (VMs) on the fly during flash crowd events. Most existing research focuses on the policy decision such as when and where to start a VM for an application. In this paper, we study a different problem: how can the VMs and the applications inside be brought up as quickly as possible? This problem has not been solved satisfactorily in existing cloud services. We develop a fast start technique for cloud applications by restoring previously created VM snapshots of fully initialized application. We propose a set of optimizations, including working set estimation, demand prediction, and free page avoidance, that allow an application to start running with only partially loaded memory, yet without noticeable performance penalty during its subsequent execution. We implement our system, called Twinkle, in the Xen hypervisor and employ the two-dimensional page walks supported by the latest virtualization technology. We use the RUBiS and TPC-W benchmarks to evaluate its performance under flash crowd and failure over scenarios. The results indicate that Twinkle can provision VMs and restore the QoS significantly faster than the current approaches.http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000297374700146&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=8e1609b174ce4e31116a60747a720701Computer Science, Hardware & ArchitectureComputer Science, Theory & MethodsEngineering, Electrical & ElectronicTelecommunicationsEICPCI-S(ISTP)1

Crossref