Search CORE

21,928 research outputs found

Variable-based multi-module data caches for clustered VLIW processors

Author: Abella Ferrer Jaume
Gibert Codina Enric
González Colás Antonio María
Sánchez Navarro Jesús
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Practical Fine-grained Privilege Separation in Multithreaded Applications

Author: Liu Peng
Wang Jun
Xiong Xi
Publication venue
Publication date: 11/05/2013
Field of study

An inherent security limitation with the classic multithreaded programming model is that all the threads share the same address space and, therefore, are implicitly assumed to be mutually trusted. This assumption, however, does not take into consideration of many modern multithreaded applications that involve multiple principals which do not fully trust each other. It remains challenging to retrofit the classic multithreaded programming model so that the security and privilege separation in multi-principal applications can be resolved. This paper proposes ARBITER, a run-time system and a set of security primitives, aimed at fine-grained and data-centric privilege separation in multithreaded applications. While enforcing effective isolation among principals, ARBITER still allows flexible sharing and communication between threads so that the multithreaded programming paradigm can be preserved. To realize controlled sharing in a fine-grained manner, we created a novel abstraction named ARBITER Secure Memory Segment (ASMS) and corresponding OS support. Programmers express security policies by labeling data and principals via ARBITER's API following a unified model. We ported a widely-used, in-memory database application (memcached) to ARBITER system, changing only around 100 LOC. Experiments indicate that only an average runtime overhead of 5.6% is induced to this security enhanced version of application

arXiv.org e-Print Archive

CiteSeerX

2D Proactive Uplink Resource Allocation Algorithm for Event Based MTC Applications

Author: Dutkiewicz Eryk
Nguyen Diep N.
Vu Thai T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/06/2018
Field of study

We propose a two dimension (2D) proactive uplink resource allocation (2D-PURA) algorithm that aims to reduce the delay/latency in event-based machine-type communications (MTC) applications. Specifically, when an event of interest occurs at a device, it tends to spread to the neighboring devices. Consequently, when a device has data to send to the base station (BS), its neighbors later are highly likely to transmit. Thus, we propose to cluster devices in the neighborhood around the event, also referred to as the disturbance region, into rings based on the distance from the original event. To reduce the uplink latency, we then proactively allocate resources for these rings. To evaluate the proposed algorithm, we analytically derive the mean uplink delay, the proportion of resource conservation due to successful allocations, and the proportion of uplink resource wastage due to unsuccessful allocations for 2D-PURA algorithm. Numerical results demonstrate that the proposed method can save over 16.5 and 27 percent of mean uplink delay, compared with the 1D algorithm and the standard method, respectively.Comment: 6 pages, 6 figures, Published in 2018 IEEE Wireless Communications and Networking Conference (WCNC

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

Author: Liu Yongchao
Schmidt Bertil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/04/2014
Field of study

The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its wide use in biological sequence database search. Unfortunately, the high sensitivity comes at the expense of quadratic time complexity, which makes the algorithm computationally demanding for big databases. In this paper, we present SWAPHI, the first parallelized algorithm employing Xeon Phi coprocessors to accelerate SW protein database search. SWAPHI is designed based on the scale-and-vectorize approach, i.e. it boosts alignment speed by effectively utilizing both the coarse-grained parallelism from the many co-processing cores (scale) and the fine-grained parallelism from the 512-bit wide single instruction, multiple data (SIMD) vectors within each core (vectorize). By searching against the large UniProtKB/TrEMBL protein database, SWAPHI achieves a performance of up to 58.8 billion cell updates per second (GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors. Furthermore, it demonstrates good parallel scalability on varying number of coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of 1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.Comment: A short version of this paper has been accepted by the IEEE ASAP 2014 conferenc

arXiv.org e-Print Archive

Crossref

A Survey of Techniques for Improving Security of GPUs

Author: Abhinaya S. B.
Ali Irfan
Mittal Sparsh
Reddy Manish
Publication venue
Publication date: 01/01/2018
Field of study

Graphics processing unit (GPU), although a powerful performance-booster, also has many security vulnerabilities. Due to these, the GPU can act as a safe-haven for stealthy malware and the weakest `link' in the security `chain'. In this paper, we present a survey of techniques for analyzing and improving GPU security. We classify the works on key attributes to highlight their similarities and differences. More than informing users and researchers about GPU security techniques, this survey aims to increase their awareness about GPU security vulnerabilities and potential countermeasures

arXiv.org e-Print Archive

Research Archive of Indian Institute of Technology Hyderabad

Performance of distributed mechanisms for flow admission in wireless adhoc networks

Author: A. Ganesan
Ashwin Ganesan
B. Hajek
B. Hamdaoui
C.E. Shannon
D. Bertsekas
M. Grötschel
M. Kodialam
R. Gupta
R.L. Brooks
S. Gerke
S. Sarkar
W.C.Y. Lee
W.H. Hale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/10/2012
Field of study

Given a wireless network where some pairs of communication links interfere with each other, we study sufficient conditions for determining whether a given set of minimum bandwidth quality-of-service (QoS) requirements can be satisfied. We are especially interested in algorithms which have low communication overhead and low processing complexity. The interference in the network is modeled using a conflict graph whose vertices correspond to the communication links in the network. Two links are adjacent in this graph if and only if they interfere with each other due to being in the same vicinity and hence cannot be simultaneously active. The problem of scheduling the transmission of the various links is then essentially a fractional, weighted vertex coloring problem, for which upper bounds on the fractional chromatic number are sought using only localized information. We recall some distributed algorithms for this problem, and then assess their worst-case performance. Our results on this fundamental problem imply that for some well known classes of networks and interference models, the performance of these distributed algorithms is within a bounded factor away from that of an optimal, centralized algorithm. The performance bounds are simple expressions in terms of graph invariants. It is seen that the induced star number of a network plays an important role in the design and performance of such networks.Comment: 21 pages, submitted. Journal version of arXiv:0906.378

arXiv.org e-Print Archive

Crossref

A Lightweight, Compiler-Assisted Register File Cache for GPGPU

Author: Arnau Jose Maria
Gonzalez Antonio
Murgadas Jordi Tubella
Shoushtary Mojtaba Abaie
Publication venue
Publication date: 26/10/2023
Field of study

Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conflicts. This paper introduces an energy-efficient RF caching mechanism called Malekeh that repurposes an existing component in GPUs' RF to operate as a cache in addition to its original functionality. In this way, Malekeh minimizes the overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue scheduling policy that utilizes the reuse distance of the values in the RF cache and is controlled by a dynamic algorithm. The goal is to adapt the issue policy to the runtime program characteristics to maximize the GPU's performance and the hit ratio of the RF cache. The reuse distance is approximated by the compiler using profiling and is used at run time by the proposed caching scheme. We show that Malekeh reduces the number of reads to the RF banks by 46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves performance by 6.1% while adding only 2KB of extra storage per core to the baseline RF of 256KB, which represents a negligible overhead of 0.78%

arXiv.org e-Print Archive