9 research outputs found

    Scaling Kernel Speedup to Application-Level Performance with CGRAS: Stream Program

    Get PDF
    Department of Electrical EngineeringWhile accelerators often generate impressive speedup at the kernel level, the speedup often do not scale to the application-level performance improvement due to several reasons. In this paper we identify key factors impacting the application-level performance of CGRA (Coarse-Grained Recon???gurable Architecture) accelerators using stream programs as the target application. As a practical remedy, we also propose a low-cost architecture extension focusing on the nested loops appearing very frequently in stream programs. We also present detailed application-level performance evaluation for the full StreamIt benchmark applications, which suggests that CGRAs can realistically accelerate stream applications by 3.6???4.0 times on average, compared to software-only execution on a typical mobile processor.ope

    Automatic Collapsing of Non-Rectangular Loops

    Get PDF
    International audienceLoop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization , as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable execution-time overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options " static " or " dynamic " of the OpenMP-schedule clause

    Loop Rolling for Code Size Reduction

    Get PDF

    Batchman and Robin: Batched and Non-batched Branching for Interactive ZK

    Get PDF
    Vector Oblivious Linear Evaluation (VOLE) supports fast and scalable interactive Zero-Knowledge (ZK) proofs. Despite recent improvements to VOLE-based ZK, compiling proof statements to a control-flow oblivious form (e.g., a circuit) continues to lead to expensive proofs. One useful setting where this inefficiency stands out is when the statement is a disjunction of clauses L1 √ · · · √ LB. Typically, ZK requires paying the price to handle all B branches. Prior works have shown how to avoid this price in communication, but not in computation. Our main result, Batchman, is asymptotically and concretely efficient VOLE-based ZK for batched disjunctions, i.e. statements containing R repetitions of the same disjunction. This is crucial for, e.g., emulating CPU steps in ZK. Our prover and verifier complexity is only O(RB + R|C| + B|C|), where |C| is the maximum circuit size of the B branches. Prior works’ computation scales in RB|C|. For non-batched disjunctions, we also construct a VOLE-based ZK protocol, Robin, which is (only) communication efficient. For small fields and for statistical security parameter λ, this protocol’s communication improves over the previous state of the art (Macâ€Čnâ€ČCheese, Baum et al., CRYPTO’21) by up to factor λ. Our implementation outperforms prior state of the art. E.g., we achieve up to 6×6× improvement over Macâ€Čnâ€ČCheese (Boolean, single disjunction), and for arithmetic batched disjunctions our experiments show we improve over QuickSilver (Yang et al., CCS’21) by up to 70×70× and over AntMan (Weng et al., CCS’22) by up to 36×36×

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    New (Zero-Knowledge) Arguments and Their Applications to Verifiable Computation

    Get PDF
    We study the problem of argument systems, where a computationally weak verifier outsources the execution of a computation to a powerful but untrusted prover, while being able to validate that the result was computed correctly through a proof generated by the prover. In addition, the zero-knowledge property guarantees that proof leaks no information about the potential secret input from the prover. Existing efficient zero-knowledge arguments with sublinear verification time require an expensive preprocessing phase that depends on a particular computation, and incur big overhead on the prover time and prover memory consumption. This thesis proposes new constructions for zero-knowledge arguments that overcome the above problems. The new constructions require only a one time preprocessing and can be used to validate any computations later. They also reduce the overhead on the prover time and memory by orders of magnitude. We apply our new constructions to build a verifiable database system and verifiable RAM programs, leading to significant improvements over prior work
    corecore