380 research outputs found
Fissile Locks
Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high
performance and low latency of ownership transfer under light or no contention.
However, they do not scale gracefully under high contention and do not provide
any admission order guarantees. Such concerns led to the development of
scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a
variant of another popular queue-based MCS lock. CNA scales well under load and
provides certain admission guarantees, but has more complicated lock handover
operations than TS and incurs higher latencies at low contention. We propose
Fissile locks, which capture the most desirable properties of both TS and CNA.
A Fissile lock consists of two underlying locks: a TS lock, which serves as a
fast path, and a CNA lock, which serves as a slow path. The key feature of
Fissile locks is the ability of threads on the fast path to bypass threads
enqueued on the slow path, and acquire the lock with less overhead than CNA.
Bypass is bounded (by a tunable parameter) to avoid starvation and ensure
long-term fairness. The result is a highly scalable NUMA-aware lock with
progress guarantees that performs like TS at low contention and like CNA at
high contention
PolyShard: Coded Sharding Achieves Linearly Scaling Efficiency and Security Simultaneously
Today's blockchain designs suffer from a trilemma claiming that no blockchain
system can simultaneously achieve decentralization, security, and performance
scalability. For current blockchain systems, as more nodes join the network,
the efficiency of the system (computation, communication, and storage) stays
constant at best. A leading idea for enabling blockchains to scale efficiency
is the notion of sharding: different subsets of nodes handle different portions
of the blockchain, thereby reducing the load for each individual node. However,
existing sharding proposals achieve efficiency scaling by compromising on trust
- corrupting the nodes in a given shard will lead to the permanent loss of the
corresponding portion of data. In this paper, we settle the trilemma by
demonstrating a new protocol for coded storage and computation in blockchains.
In particular, we propose PolyShard: ``polynomially coded sharding'' scheme
that achieves information-theoretic upper bounds on the efficiency of the
storage, system throughput, as well as on trust, thus enabling a truly scalable
system. We provide simulation results that numerically demonstrate the
performance improvement over state of the arts, and the scalability of the
PolyShard system. Finally, we discuss potential enhancements, and highlight
practical considerations in building such a system
clusterNOR: A NUMA-Optimized Clustering Framework
Clustering algorithms are iterative and have complex data access patterns
that result in many small random memory accesses. The performance of parallel
implementations suffer from synchronous barriers for each iteration and skewed
workloads. We rethink the parallelization of clustering for modern non-uniform
memory architectures (NUMA) to maximizes independent, asynchronous computation.
We eliminate many barriers, reduce remote memory accesses, and maximize cache
reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR)
extensible parallel framework that provides algorithmic building blocks. The
system is generic, we demonstrate nine modern clustering algorithms that have
simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external
memory, and (iii) distributed memory execution, enabling computation for
varying memory and hardware budgets. For algorithms that rely on Euclidean
distance, clusterNOR defines an updated Elkan's triangle inequality pruning
algorithm that uses asymptotically less memory so that it works on
billion-point data sets. clusterNOR extends and expands the scope of the 'knor'
library for k-means clustering by generalizing underlying principles, providing
a uniform programming interface and expanding the scope to hierarchical and
linear algebraic classes of algorithms. The compound effect of our
optimizations is an order of magnitude improvement in speed over other
state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.Comment: arXiv admin note: Journal version of arXiv:1606.0890
PUF-RLA: A PUF-based Reliable and Lightweight Authentication Protocol employing Binary String Shuffling
Physically unclonable functions (PUFs) can be employed for device
identification, authentication, secret key storage, and other security tasks.
However, PUFs are susceptible to modeling attacks if a number of PUFs'
challenge-response pairs (CRPs) are exposed to the adversary. Furthermore, many
of the embedded devices requiring authentication have stringent resource
constraints and thus require a lightweight authentication mechanism. We propose
PUF-RLA, a PUF-based lightweight, highly reliable authentication scheme
employing binary string shuffling. The proposed scheme enhances the reliability
of PUF as well as alleviates the resource constraints by employing error
correction in the server instead of the device without compromising the
security. The proposed PUF-RLA is robust against brute force, replay, and
modeling attacks. In PUF-RLA, we introduce an inexpensive yet secure stream
authentication scheme inside the device which authenticates the server before
the underlying PUF can be invoked. This prevents an adversary from brute
forcing the device's PUF to acquire CRPs essentially locking out the device
from unauthorized model generation. Additionally, we also introduce a
lightweight CRP obfuscation mechanism involving XOR and shuffle operations.
Results and security analysis verify that the PUF-RLA is secure against brute
force, replay, and modeling attacks, and provides ~99% reliable authentication.
In addition, PUF-RLA provides a reduction of 63% and 74% for look-up tables
(LUTs) and register count, respectively, in FPGA compared to a recently
proposed approach while providing additional authentication advantages.Comment: Published in the 2019 IEEE International Conference on Computer
Design (ICCD
Compact NUMA-Aware Locks
Modern multi-socket architectures exhibit non-uniform memory access (NUMA)
behavior, where access by a core to data cached locally on a socket is much
faster than access to data cached on a remote socket. Prior work offers several
efficient NUMA-aware locks that exploit this behavior by keeping the lock
ownership on the same socket, thus reducing remote cache misses and
inter-socket communication. Virtually all those locks, however, are
hierarchical in their nature, thus requiring space proportional to the number
of sockets. The increased memory cost renders NUMA-aware locks unsuitable for
systems that are conscious to space requirements of their synchronization
constructs, with the Linux kernel being the chief example.
In this work, we present a compact NUMA-aware lock that requires only one
word of memory, regardless of the number of sockets in the underlying machine.
The new lock is a variant of an efficient (NUMA-oblivious) MCS lock, and
inherits its performant features, such as local spinning and a single atomic
instruction in the acquisition path. Unlike MCS, the new lock organizes waiting
threads in two queues, one composed of threads running on the same socket as
the current lock holder, and another composed of threads running on a different
socket(s).
We integrated the new lock in the Linux kernel's qspinlock, one of the major
synchronization constructs in the kernel. Our evaluation using both user-space
and kernel benchmarks shows that the new lock has a single-thread performance
of MCS, but significantly outperforms the latter under contention, achieving a
similar level of performance when compared to other, state-of-the-art
NUMA-aware locks that require substantially more space
The Anatomy of Large-Scale Distributed Graph Algorithms
The increasing complexity of the software/hardware stack of modern
supercomputers results in explosion of parameters. The performance analysis
becomes a truly experimental science, even more challenging in the presence of
massive irregularity and data dependency. We analyze how the existing body of
research handles the experimental aspect in the context of distributed graph
algorithms (DGAs). We distinguish algorithm-level contributions, often
prioritized by authors, from runtime-level concerns that are harder to place.
We show that the runtime is such an integral part of DGAs that experimental
results are difficult to interpret and extrapolate without understanding the
properties of the runtime used. We argue that in order to gain understanding
about the impact of runtimes, more information needs to be gathered. To begin
this process, we provide an initial set of recommendations for describing DGA
results based on our analysis of the current state of the field
The Design and Implementation of the Wave Transactional Filesystem
This paper introduces the Wave Transactional Filesystem (WTF), a novel,
transactional, POSIX-compatible filesystem based on a new file slicing API that
enables efficient file transformations. WTF provides transactional access to a
distributed filesystem, eliminating the possibility of inconsistencies across
multiple files. Further, the file slicing API enables applications to construct
files from the contents of other files without having to rewrite or relocate
data. Combined, these enable a new class of high-performance applications.
Experiments show that WTF can qualitatively outperform the industry-standard
HDFS distributed filesystem, up to a factor of four in a sorting benchmark, by
reducing I/O costs. Microbenchmarks indicate that the new features of WTF
impose only a modest overhead on top of the POSIX-compatible API
DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters
The increasing complexity of deep neural networks (DNNs) has made it
challenging to exploit existing large-scale data processing pipelines for
handling massive data and parameters involved in DNN training. Distributed
computing platforms and GPGPU-based acceleration provide a mainstream solution
to this computational challenge. In this paper, we propose DeepSpark, a
distributed and parallel deep learning framework that exploits Apache Spark on
commodity clusters. To support parallel operations, DeepSpark automatically
distributes workloads and parameters to Caffe/Tensorflow-running nodes using
Spark, and iteratively aggregates training results by a novel lock-free
asynchronous variant of the popular elastic averaging stochastic gradient
descent based update scheme, effectively complementing the synchronized
processing capabilities of Spark. DeepSpark is an on-going project, and the
current release is available at http://deepspark.snu.ac.kr
Instantly Obsoleting the Address-code Associations: A New Principle for Defending Advanced Code Reuse Attack
Fine-grained Address Space Randomization has been considered as an effective
protection against code reuse attacks such as ROP/JOP. However, it only employs
a one-time randomization, and such a limitation has been exploited by recent
just-in-time ROP and side channel ROP, which collect gadgets on-the-fly and
dynamically compile them for malicious purposes. To defeat these advanced code
reuse attacks, we propose a new defense principle: instantly obsoleting the
address-code associations. We have initialized this principle with a novel
technique called virtual space page table remapping and implemented the
technique in a system CHAMELEON. CHAMELEON periodically re-randomizes the
locations of code pages on-the-fly. A set of techniques are proposed to achieve
our goal, including iterative instrumentation that instruments a
to-be-protected binary program to generate a re-randomization compatible
binary, runtime virtual page shuffling, and function reordering and instruction
rearranging optimizations. We have tested CHAMELEON with over a hundred binary
programs. Our experiments show that CHAMELEON can defeat all of our tested
exploits by both preventing the exploit from gathering sufficient gadgets, and
blocking the gadgets execution. Regarding the interval of our re-randomization,
it is a parameter and can be set as short as 100ms, 10ms or 1ms. The experiment
results show that CHAMELEON introduces on average 11.1%, 12.1% and 12.9%
performance overhead for these parameters, respectively.Comment: 23 pages, 4 figure
knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library
k-means is one of the most influential and utilized machine learning
algorithms. Its computation limits the performance and scalability of many
statistical analysis and machine learning tasks. We rethink and optimize
k-means in terms of modern NUMA architectures to develop a novel
parallelization scheme that delays and minimizes synchronization barriers. The
\textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i)
in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii)
semi-external memory (\textsf{knors}) modules that radically improve the
performance of k-means for varying memory and hardware budgets. \textsf{knori}
boosts performance for single machine datasets by an order of magnitude or
more. \textsf{knors} improves the scalability of k-means on a memory budget
using SSDs. \textsf{knors} scales to billions of points on a single machine,
using a fraction of the resources that distributed in-memory systems require.
\textsf{knord} retains \textsf{knori}'s performance characteristics, while
scaling in-memory through distributed computation in the cloud. \textsf{knor}
modifies Elkan's triangle inequality pruning algorithm such that we utilize it
on billion-point datasets without the significant memory overhead of the
original algorithm. We demonstrate \textsf{knor} outperforms distributed
commercial products like HO, Turi (formerly Dato, GraphLab) and Spark's
MLlib by more than an order of magnitude for datasets of to
points
- …