21,928 research outputs found
Variable-based multi-module data caches for clustered VLIW processors
Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version
Practical Fine-grained Privilege Separation in Multithreaded Applications
An inherent security limitation with the classic multithreaded programming
model is that all the threads share the same address space and, therefore, are
implicitly assumed to be mutually trusted. This assumption, however, does not
take into consideration of many modern multithreaded applications that involve
multiple principals which do not fully trust each other. It remains challenging
to retrofit the classic multithreaded programming model so that the security
and privilege separation in multi-principal applications can be resolved.
This paper proposes ARBITER, a run-time system and a set of security
primitives, aimed at fine-grained and data-centric privilege separation in
multithreaded applications. While enforcing effective isolation among
principals, ARBITER still allows flexible sharing and communication between
threads so that the multithreaded programming paradigm can be preserved. To
realize controlled sharing in a fine-grained manner, we created a novel
abstraction named ARBITER Secure Memory Segment (ASMS) and corresponding OS
support. Programmers express security policies by labeling data and principals
via ARBITER's API following a unified model. We ported a widely-used, in-memory
database application (memcached) to ARBITER system, changing only around 100
LOC. Experiments indicate that only an average runtime overhead of 5.6% is
induced to this security enhanced version of application
2D Proactive Uplink Resource Allocation Algorithm for Event Based MTC Applications
We propose a two dimension (2D) proactive uplink resource allocation
(2D-PURA) algorithm that aims to reduce the delay/latency in event-based
machine-type communications (MTC) applications. Specifically, when an event of
interest occurs at a device, it tends to spread to the neighboring devices.
Consequently, when a device has data to send to the base station (BS), its
neighbors later are highly likely to transmit. Thus, we propose to cluster
devices in the neighborhood around the event, also referred to as the
disturbance region, into rings based on the distance from the original event.
To reduce the uplink latency, we then proactively allocate resources for these
rings. To evaluate the proposed algorithm, we analytically derive the mean
uplink delay, the proportion of resource conservation due to successful
allocations, and the proportion of uplink resource wastage due to unsuccessful
allocations for 2D-PURA algorithm. Numerical results demonstrate that the
proposed method can save over 16.5 and 27 percent of mean uplink delay,
compared with the 1D algorithm and the standard method, respectively.Comment: 6 pages, 6 figures, Published in 2018 IEEE Wireless Communications
and Networking Conference (WCNC
SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors
The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its
wide use in biological sequence database search. Unfortunately, the high
sensitivity comes at the expense of quadratic time complexity, which makes the
algorithm computationally demanding for big databases. In this paper, we
present SWAPHI, the first parallelized algorithm employing Xeon Phi
coprocessors to accelerate SW protein database search. SWAPHI is designed based
on the scale-and-vectorize approach, i.e. it boosts alignment speed by
effectively utilizing both the coarse-grained parallelism from the many
co-processing cores (scale) and the fine-grained parallelism from the 512-bit
wide single instruction, multiple data (SIMD) vectors within each core
(vectorize). By searching against the large UniProtKB/TrEMBL protein database,
SWAPHI achieves a performance of up to 58.8 billion cell updates per second
(GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors.
Furthermore, it demonstrates good parallel scalability on varying number of
coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and
BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of
1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of
SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.Comment: A short version of this paper has been accepted by the IEEE ASAP 2014
conferenc
A Survey of Techniques for Improving Security of GPUs
Graphics processing unit (GPU), although a powerful performance-booster, also
has many security vulnerabilities. Due to these, the GPU can act as a
safe-haven for stealthy malware and the weakest `link' in the security `chain'.
In this paper, we present a survey of techniques for analyzing and improving
GPU security. We classify the works on key attributes to highlight their
similarities and differences. More than informing users and researchers about
GPU security techniques, this survey aims to increase their awareness about GPU
security vulnerabilities and potential countermeasures
Performance of distributed mechanisms for flow admission in wireless adhoc networks
Given a wireless network where some pairs of communication links interfere
with each other, we study sufficient conditions for determining whether a given
set of minimum bandwidth quality-of-service (QoS) requirements can be
satisfied. We are especially interested in algorithms which have low
communication overhead and low processing complexity. The interference in the
network is modeled using a conflict graph whose vertices correspond to the
communication links in the network. Two links are adjacent in this graph if and
only if they interfere with each other due to being in the same vicinity and
hence cannot be simultaneously active. The problem of scheduling the
transmission of the various links is then essentially a fractional, weighted
vertex coloring problem, for which upper bounds on the fractional chromatic
number are sought using only localized information. We recall some distributed
algorithms for this problem, and then assess their worst-case performance. Our
results on this fundamental problem imply that for some well known classes of
networks and interference models, the performance of these distributed
algorithms is within a bounded factor away from that of an optimal, centralized
algorithm. The performance bounds are simple expressions in terms of graph
invariants. It is seen that the induced star number of a network plays an
important role in the design and performance of such networks.Comment: 21 pages, submitted. Journal version of arXiv:0906.378
A Lightweight, Compiler-Assisted Register File Cache for GPGPU
Modern GPUs require an enormous register file (RF) to store the context of
thousands of active threads. It consumes considerable energy and contains
multiple large banks to provide enough throughput. Thus, a RF caching mechanism
can significantly improve the performance and energy consumption of the GPUs by
avoiding reads from the large banks that consume significant energy and may
cause port conflicts.
This paper introduces an energy-efficient RF caching mechanism called Malekeh
that repurposes an existing component in GPUs' RF to operate as a cache in
addition to its original functionality. In this way, Malekeh minimizes the
overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue
scheduling policy that utilizes the reuse distance of the values in the RF
cache and is controlled by a dynamic algorithm. The goal is to adapt the issue
policy to the runtime program characteristics to maximize the GPU's performance
and the hit ratio of the RF cache. The reuse distance is approximated by the
compiler using profiling and is used at run time by the proposed caching
scheme. We show that Malekeh reduces the number of reads to the RF banks by
46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves
performance by 6.1% while adding only 2KB of extra storage per core to the
baseline RF of 256KB, which represents a negligible overhead of 0.78%
- …