2,379 research outputs found
Late allocation and early release of physical registers
The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
HAL/S-360 compiler system specification
A three phase language compiler is described which produces IBM 360/370 compatible object modules and a set of simulation tables to aid in run time verification. A link edit step augments the standard OS linkage editor. A comprehensive run time system and library provide the HAL/S operating environment, error handling, a pseudo real time executive, and an extensive set of mathematical, conversion, I/O, and diagnostic routines. The specifications of the information flow and content for this system are also considered
Formal change impact analyses for emulated control software
Processor emulators are a software tool for allowing legacy computer programs to be executed on a modern processor. In the past emulators have been used in trivial applications such as maintenance of video games. Now, however, processor emulation is being applied to safety-critical control systems, including military avionics. These applications demand utmost guarantees of correctness, but no verification techniques exist for proving that an emulated system preserves the original system’s functional and timing properties. Here we show how this can be done by combining concepts previously used for reasoning about real-time program compilation, coupled with an understanding of the new and old software architectures. In particular, we show how both the old and new systems can be given a common semantics, thus allowing their behaviours to be compared directly
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
Advanced software techniques for space shuttle data management systems Final report
Airborne/spaceborn computer design and techniques for space shuttle data management system
Partitioning SKA Dataflows for Optimal Graph Execution
Optimizing data-intensive workflow execution is essential to many modern
scientific projects such as the Square Kilometre Array (SKA), which will be the
largest radio telescope in the world, collecting terabytes of data per second
for the next few decades. At the core of the SKA Science Data Processor is the
graph execution engine, scheduling tens of thousands of algorithmic components
to ingest and transform millions of parallel data chunks in order to solve a
series of large-scale inverse problems within the power budget. To tackle this
challenge, we have developed the Data Activated Liu Graph Engine (DALiuGE) to
manage data processing pipelines for several SKA pathfinder projects. In this
paper, we discuss the DALiuGE graph scheduling sub-system. By extending
previous studies on graph scheduling and partitioning, we lay the foundation on
which we can develop polynomial time optimization methods that minimize both
workflow execution time and resource footprint while satisfying resource
constraints imposed by individual algorithms. We show preliminary results
obtained from three radio astronomy data pipelines.Comment: Accepted in HPDC ScienceCloud 2018 Worksho
- …