380 research outputs found

    Fissile Locks

    Full text link
    Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of another popular queue-based MCS lock. CNA scales well under load and provides certain admission guarantees, but has more complicated lock handover operations than TS and incurs higher latencies at low contention. We propose Fissile locks, which capture the most desirable properties of both TS and CNA. A Fissile lock consists of two underlying locks: a TS lock, which serves as a fast path, and a CNA lock, which serves as a slow path. The key feature of Fissile locks is the ability of threads on the fast path to bypass threads enqueued on the slow path, and acquire the lock with less overhead than CNA. Bypass is bounded (by a tunable parameter) to avoid starvation and ensure long-term fairness. The result is a highly scalable NUMA-aware lock with progress guarantees that performs like TS at low contention and like CNA at high contention

    PolyShard: Coded Sharding Achieves Linearly Scaling Efficiency and Security Simultaneously

    Full text link
    Today's blockchain designs suffer from a trilemma claiming that no blockchain system can simultaneously achieve decentralization, security, and performance scalability. For current blockchain systems, as more nodes join the network, the efficiency of the system (computation, communication, and storage) stays constant at best. A leading idea for enabling blockchains to scale efficiency is the notion of sharding: different subsets of nodes handle different portions of the blockchain, thereby reducing the load for each individual node. However, existing sharding proposals achieve efficiency scaling by compromising on trust - corrupting the nodes in a given shard will lead to the permanent loss of the corresponding portion of data. In this paper, we settle the trilemma by demonstrating a new protocol for coded storage and computation in blockchains. In particular, we propose PolyShard: ``polynomially coded sharding'' scheme that achieves information-theoretic upper bounds on the efficiency of the storage, system throughput, as well as on trust, thus enabling a truly scalable system. We provide simulation results that numerically demonstrate the performance improvement over state of the arts, and the scalability of the PolyShard system. Finally, we discuss potential enhancements, and highlight practical considerations in building such a system

    clusterNOR: A NUMA-Optimized Clustering Framework

    Full text link
    Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and maximize cache reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR) extensible parallel framework that provides algorithmic building blocks. The system is generic, we demonstrate nine modern clustering algorithms that have simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external memory, and (iii) distributed memory execution, enabling computation for varying memory and hardware budgets. For algorithms that rely on Euclidean distance, clusterNOR defines an updated Elkan's triangle inequality pruning algorithm that uses asymptotically less memory so that it works on billion-point data sets. clusterNOR extends and expands the scope of the 'knor' library for k-means clustering by generalizing underlying principles, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic classes of algorithms. The compound effect of our optimizations is an order of magnitude improvement in speed over other state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.Comment: arXiv admin note: Journal version of arXiv:1606.0890

    PUF-RLA: A PUF-based Reliable and Lightweight Authentication Protocol employing Binary String Shuffling

    Full text link
    Physically unclonable functions (PUFs) can be employed for device identification, authentication, secret key storage, and other security tasks. However, PUFs are susceptible to modeling attacks if a number of PUFs' challenge-response pairs (CRPs) are exposed to the adversary. Furthermore, many of the embedded devices requiring authentication have stringent resource constraints and thus require a lightweight authentication mechanism. We propose PUF-RLA, a PUF-based lightweight, highly reliable authentication scheme employing binary string shuffling. The proposed scheme enhances the reliability of PUF as well as alleviates the resource constraints by employing error correction in the server instead of the device without compromising the security. The proposed PUF-RLA is robust against brute force, replay, and modeling attacks. In PUF-RLA, we introduce an inexpensive yet secure stream authentication scheme inside the device which authenticates the server before the underlying PUF can be invoked. This prevents an adversary from brute forcing the device's PUF to acquire CRPs essentially locking out the device from unauthorized model generation. Additionally, we also introduce a lightweight CRP obfuscation mechanism involving XOR and shuffle operations. Results and security analysis verify that the PUF-RLA is secure against brute force, replay, and modeling attacks, and provides ~99% reliable authentication. In addition, PUF-RLA provides a reduction of 63% and 74% for look-up tables (LUTs) and register count, respectively, in FPGA compared to a recently proposed approach while providing additional authentication advantages.Comment: Published in the 2019 IEEE International Conference on Computer Design (ICCD

    Compact NUMA-Aware Locks

    Full text link
    Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication. Virtually all those locks, however, are hierarchical in their nature, thus requiring space proportional to the number of sockets. The increased memory cost renders NUMA-aware locks unsuitable for systems that are conscious to space requirements of their synchronization constructs, with the Linux kernel being the chief example. In this work, we present a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine. The new lock is a variant of an efficient (NUMA-oblivious) MCS lock, and inherits its performant features, such as local spinning and a single atomic instruction in the acquisition path. Unlike MCS, the new lock organizes waiting threads in two queues, one composed of threads running on the same socket as the current lock holder, and another composed of threads running on a different socket(s). We integrated the new lock in the Linux kernel's qspinlock, one of the major synchronization constructs in the kernel. Our evaluation using both user-space and kernel benchmarks shows that the new lock has a single-thread performance of MCS, but significantly outperforms the latter under contention, achieving a similar level of performance when compared to other, state-of-the-art NUMA-aware locks that require substantially more space

    The Anatomy of Large-Scale Distributed Graph Algorithms

    Full text link
    The increasing complexity of the software/hardware stack of modern supercomputers results in explosion of parameters. The performance analysis becomes a truly experimental science, even more challenging in the presence of massive irregularity and data dependency. We analyze how the existing body of research handles the experimental aspect in the context of distributed graph algorithms (DGAs). We distinguish algorithm-level contributions, often prioritized by authors, from runtime-level concerns that are harder to place. We show that the runtime is such an integral part of DGAs that experimental results are difficult to interpret and extrapolate without understanding the properties of the runtime used. We argue that in order to gain understanding about the impact of runtimes, more information needs to be gathered. To begin this process, we provide an initial set of recommendations for describing DGA results based on our analysis of the current state of the field

    The Design and Implementation of the Wave Transactional Filesystem

    Full text link
    This paper introduces the Wave Transactional Filesystem (WTF), a novel, transactional, POSIX-compatible filesystem based on a new file slicing API that enables efficient file transformations. WTF provides transactional access to a distributed filesystem, eliminating the possibility of inconsistencies across multiple files. Further, the file slicing API enables applications to construct files from the contents of other files without having to rewrite or relocate data. Combined, these enable a new class of high-performance applications. Experiments show that WTF can qualitatively outperform the industry-standard HDFS distributed filesystem, up to a factor of four in a sorting benchmark, by reducing I/O costs. Microbenchmarks indicate that the new features of WTF impose only a modest overhead on top of the POSIX-compatible API

    DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters

    Full text link
    The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms and GPGPU-based acceleration provide a mainstream solution to this computational challenge. In this paper, we propose DeepSpark, a distributed and parallel deep learning framework that exploits Apache Spark on commodity clusters. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark. DeepSpark is an on-going project, and the current release is available at http://deepspark.snu.ac.kr

    Instantly Obsoleting the Address-code Associations: A New Principle for Defending Advanced Code Reuse Attack

    Full text link
    Fine-grained Address Space Randomization has been considered as an effective protection against code reuse attacks such as ROP/JOP. However, it only employs a one-time randomization, and such a limitation has been exploited by recent just-in-time ROP and side channel ROP, which collect gadgets on-the-fly and dynamically compile them for malicious purposes. To defeat these advanced code reuse attacks, we propose a new defense principle: instantly obsoleting the address-code associations. We have initialized this principle with a novel technique called virtual space page table remapping and implemented the technique in a system CHAMELEON. CHAMELEON periodically re-randomizes the locations of code pages on-the-fly. A set of techniques are proposed to achieve our goal, including iterative instrumentation that instruments a to-be-protected binary program to generate a re-randomization compatible binary, runtime virtual page shuffling, and function reordering and instruction rearranging optimizations. We have tested CHAMELEON with over a hundred binary programs. Our experiments show that CHAMELEON can defeat all of our tested exploits by both preventing the exploit from gathering sufficient gadgets, and blocking the gadgets execution. Regarding the interval of our re-randomization, it is a parameter and can be set as short as 100ms, 10ms or 1ms. The experiment results show that CHAMELEON introduces on average 11.1%, 12.1% and 12.9% performance overhead for these parameters, respectively.Comment: 23 pages, 4 figure

    knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library

    Full text link
    k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i) in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. \textsf{knori} boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. \textsf{knors} scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. \textsf{knord} retains \textsf{knori}'s performance characteristics, while scaling in-memory through distributed computation in the cloud. \textsf{knor} modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate \textsf{knor} outperforms distributed commercial products like H2_2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 10710^7 to 10910^9 points
    • …
    corecore