7,637 research outputs found
Practical Run-time Checking via Unobtrusive Property Caching
The use of annotations, referred to as assertions or contracts, to describe
program properties for which run-time tests are to be generated, has become
frequent in dynamic programing languages. However, the frameworks proposed to
support such run-time testing generally incur high time and/or space overheads
over standard program execution. We present an approach for reducing this
overhead that is based on the use of memoization to cache intermediate results
of check evaluation, avoiding repeated checking of previously verified
properties. Compared to approaches that reduce checking frequency, our proposal
has the advantage of being exhaustive (i.e., all tests are checked at all
points) while still being much more efficient than standard run-time checking.
Compared to the limited previous work on memoization, it performs the task
without requiring modifications to data structure representation or checking
code. While the approach is general and system-independent, we present it for
concreteness in the context of the Ciao run-time checking framework, which
allows us to provide an operational semantics with checks and caching. We also
report on a prototype implementation and provide some experimental results that
support that using a relatively small cache leads to significant decreases in
run-time checking overhead.Comment: 30 pages, 1 table, 170 figures; added appendix with plots; To appear
in Theory and Practice of Logic Programming (TPLP), Proceedings of ICLP 201
CommCSL: Proving Information Flow Security for Concurrent Programs using Abstract Commutativity
Information flow security ensures that the secret data manipulated by a
program does not influence its observable output. Proving information flow
security is especially challenging for concurrent programs, where operations on
secret data may influence the execution time of a thread and, thereby, the
interleaving between different threads. Such internal timing channels may
affect the observable outcome of a program even if an attacker does not observe
execution times. Existing verification techniques for information flow security
in concurrent programs attempt to prove that secret data does not influence the
relative timing of threads. However, these techniques are often restrictive
(for instance because they disallow branching on secret data) and make strong
assumptions about the execution platform (ignoring caching, processor
instructions with data-dependent runtime, and other common features that affect
execution time). In this paper, we present a novel verification technique for
secure information flow in concurrent programs that lifts these restrictions
and does not make any assumptions about timing behavior. The key idea is to
prove that all mutating operations performed on shared data commute, such that
different thread interleavings do not influence its final value. Crucially,
commutativity is required only for an abstraction of the shared data that
contains the information that will be leaked to a public output. Abstract
commutativity is satisfied by many more operations than standard commutativity,
which makes our technique widely applicable. We formalize our technique in
CommCSL, a relational concurrent separation logic with support for
commutativity-based reasoning, and prove its soundness in Isabelle/HOL. We
implemented CommCSL in HyperViper, an automated verifier based on the Viper
verification infrastructure, and demonstrate its ability to verify challenging
examples
DIA: A complexity-effective decoding architecture
Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version
Stable Model Counting and Its Application in Probabilistic Logic Programming
Model counting is the problem of computing the number of models that satisfy
a given propositional theory. It has recently been applied to solving inference
tasks in probabilistic logic programming, where the goal is to compute the
probability of given queries being true provided a set of mutually independent
random variables, a model (a logic program) and some evidence. The core of
solving this inference task involves translating the logic program to a
propositional theory and using a model counter. In this paper, we show that for
some problems that involve inductive definitions like reachability in a graph,
the translation of logic programs to SAT can be expensive for the purpose of
solving inference tasks. For such problems, direct implementation of stable
model semantics allows for more efficient solving. We present two
implementation techniques, based on unfounded set detection, that extend a
propositional model counter to a stable model counter. Our experiments show
that for particular problems, our approach can outperform a state-of-the-art
probabilistic logic programming solver by several orders of magnitude in terms
of running time and space requirements, and can solve instances of
significantly larger sizes on which the current solver runs out of time or
memory.Comment: Accepted in AAAI, 201
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue
The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ,) that achieves high performance with small energy requirements. The SAMIE-LSQ classifies the memory instructions (loads and stores) based on the address to be accessed, and groups those instructions accessing the same cache line in the same entry. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. Each SAMIE-LSQ entry has space for several memory instructions accessing the same cache line. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays. This is achieved by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry once the access of one of the instructions in an entry is performed, so instructions that share an entry can reuse the translation, avoid the tag check and get the data directly from the concrete cache way without checking the others. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% for the LI data cache and 73% for the data TLB, with a negligible impact on performance (0.6%)Peer ReviewedPostprint (published version
- …