Search CORE

9 research outputs found

Scaling Kernel Speedup to Application-Level Performance with CGRAS: Stream Program

Author: Seo Seongseok
Publication venue: Graduate school of UNIST
Publication date: 01/02/2014
Field of study

Department of Electrical EngineeringWhile accelerators often generate impressive speedup at the kernel level, the speedup often do not scale to the application-level performance improvement due to several reasons. In this paper we identify key factors impacting the application-level performance of CGRA (Coarse-Grained Recon???gurable Architecture) accelerators using stream programs as the target application. As a practical remedy, we also propose a low-cost architecture extension focusing on the nested loops appearing very frequently in stream programs. We also present detailed application-level performance evaluation for the full StreamIt benchmark applications, which suggests that CGRAs can realistically accelerate stream applications by 3.6???4.0 times on average, compared to software-only execution on a typical mobile processor.ope

ScholarWorks@UNIST

Automatic Collapsing of Non-Rectangular Loops

Author: Altıntas Ervin
Clauss Philippe
Kuhn Matthieu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/05/2017
Field of study

International audienceLoop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization , as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable execution-time overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options " static " or " dynamic " of the OpenMP-schedule clause

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Loop Rolling for Code Size Reduction

Author: Bhatotia Pramod
Franke Björn
O’Boyle Michael
Petoumenos Pavlos
Rocha Rodrigo C. O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/03/2022
Field of study

Edinburgh Research Explorer

Batchman and Robin: Batched and Non-batched Branching for Interactive ZK

Author: Carmit Hazay
David Heath
Muthuramakrishnan Venkitasubramaniam
Vladimir Kolesnikov
Yibin Yang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 20/08/2023
Field of study

Vector Oblivious Linear Evaluation (VOLE) supports fast and scalable interactive Zero-Knowledge (ZK) proofs. Despite recent improvements to VOLE-based ZK, compiling proof statements to a control-flow oblivious form (e.g., a circuit) continues to lead to expensive proofs. One useful setting where this inefficiency stands out is when the statement is a disjunction of clauses L1 ∨ · · · ∨ LB. Typically, ZK requires paying the price to handle all B branches. Prior works have shown how to avoid this price in communication, but not in computation. Our main result, Batchman, is asymptotically and concretely efficient VOLE-based ZK for batched disjunctions, i.e. statements containing R repetitions of the same disjunction. This is crucial for, e.g., emulating CPU steps in ZK. Our prover and verifier complexity is only O(RB + R|C| + B|C|), where |C| is the maximum circuit size of the B branches. Prior works’ computation scales in RB|C|. For non-batched disjunctions, we also construct a VOLE-based ZK protocol, Robin, which is (only) communication efficient. For small fields and for statistical security parameter λ, this protocol’s communication improves over the previous state of the art (Mac′n′Cheese, Baum et al., CRYPTO’21) by up to factor λ. Our implementation outperforms prior state of the art. E.g., we achieve up to

6×

improvement over Mac′n′Cheese (Boolean, single disjunction), and for arithmetic batched disjunctions our experiments show we improve over QuickSilver (Yang et al., CCS’21) by up to

70×

and over AntMan (Weng et al., CCS’22) by up to

36×

Cryptology ePrint Archive

Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

Author: Lutz Thibaut
Publication venue: The University of Edinburgh
Publication date: 29/06/2015
Field of study

Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

Edinburgh Research Archive

Flattening and parallelizing irregular, recurrent loop nests

Author: Aho A.
Allan L. Fisher
Anwar M. Ghuloum
Duff S.
Kuck C.J.
Polychronopoulos C.
von Hanxleden R.
Wolfe M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

New (Zero-Knowledge) Arguments and Their Applications to Verifiable Computation

Author: Zhang Yupeng
Publication venue
Publication date: 01/01/2018
Field of study

We study the problem of argument systems, where a computationally weak verifier outsources the execution of a computation to a powerful but untrusted prover, while being able to validate that the result was computed correctly through a proof generated by the prover. In addition, the zero-knowledge property guarantees that proof leaks no information about the potential secret input from the prover. Existing efficient zero-knowledge arguments with sublinear verification time require an expensive preprocessing phase that depends on a particular computation, and incur big overhead on the prover time and prover memory consumption. This thesis proposes new constructions for zero-knowledge arguments that overcome the above problems. The new constructions require only a one time preprocessing and can be used to validate any computations later. They also reduce the overhead on the prover time and memory by orders of magnitude. We apply our new constructions to build a verifiable database system and verifiable RAM programs, leading to significant improvements over prior work

Digital Repository at the University of Maryland

Modelling for Environment's Sake:Proceedings of the 5th Biennial Conference of the International Environmental Modelling and Software Society, iEMSs 2010

Author
Publication venue
Publication date: 01/12/2010
Field of study

University of Twente Research Information