21,953 research outputs found
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
Recommended from our members
Percolation scheduling with resource constraints
This paper presents a new approach to resource-constrained compiler extraction of fine-grain parallelism, targeted towards VLIW supercomputers, and in particular, the IBM VLIW (Very Large Instruction Word) processor. The algorithms described integrate resource limitations into Percolation Scheduling—a global parallelization technique—to deal with resource constraints, without sacrificing the generality and completeness of Percolation Scheduling in the process. This is in sharp contrast with previous approaches which either applied only to conditional-free code, or drastically limited the parallelization process by imposing relatively local heuristic resource constraints early in the scheduling process
Design of multimedia processor based on metric computation
Media-processing applications, such as signal processing, 2D and 3D graphics
rendering, and image compression, are the dominant workloads in many embedded
systems today. The real-time constraints of those media applications have
taxing demands on today's processor performances with low cost, low power and
reduced design delay. To satisfy those challenges, a fast and efficient
strategy consists in upgrading a low cost general purpose processor core. This
approach is based on the personalization of a general RISC processor core
according the target multimedia application requirements. Thus, if the extra
cost is justified, the general purpose processor GPP core can be enforced with
instruction level coprocessors, coarse grain dedicated hardware, ad hoc
memories or new GPP cores. In this way the final design solution is tailored to
the application requirements. The proposed approach is based on three main
steps: the first one is the analysis of the targeted application using
efficient metrics. The second step is the selection of the appropriate
architecture template according to the first step results and recommendations.
The third step is the architecture generation. This approach is experimented
using various image and video algorithms showing its feasibility
Scratchpad Sharing in GPUs
GPGPU applications exploit on-chip scratchpad memory available in the
Graphics Processing Units (GPUs) to improve performance. The amount of thread
level parallelism present in the GPU is limited by the number of resident
threads, which in turn depends on the availability of scratchpad memory in its
streaming multiprocessor (SM). Since the scratchpad memory is allocated at
thread block granularity, part of the memory may remain unutilized. In this
paper, we propose architectural and compiler optimizations to improve the
scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad
under-utilization by launching additional thread blocks in each SM. These
thread blocks use unutilized scratchpad and also share scratchpad with other
resident blocks. To improve the performance of scratchpad sharing, we propose
Owner Warp First (OWF) scheduling that schedules warps from the additional
thread blocks effectively. The performance of this approach, however, is
limited by the availability of the shared part of scratchpad.
We propose compiler optimizations to improve the availability of shared
scratchpad. We describe a scratchpad allocation scheme that helps in allocating
scratchpad variables such that shared scratchpad is accessed for short
duration. We introduce a new instruction, relssp, that when executed, releases
the shared scratchpad. Finally, we describe an analysis for optimal placement
of relssp instructions such that shared scratchpad is released as early as
possible.
We implemented the hardware changes using the GPGPU-Sim simulator and
implemented the compiler optimizations in Ocelot framework. We evaluated the
effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK,
GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an
average improvement of 19% and maximum improvement of 92.17% compared to the
baseline approach
GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP
Full detector simulation was among the largest CPU consumer in all CERN
experiment software stacks for the first two runs of the Large Hadron Collider
(LHC). In the early 2010's, the projections were that simulation demands would
scale linearly with luminosity increase, compensated only partially by an
increase of computing resources. The extension of fast simulation approaches to
more use cases, covering a larger fraction of the simulation budget, is only
part of the solution due to intrinsic precision limitations. The remainder
corresponds to speeding-up the simulation software by several factors, which is
out of reach using simple optimizations on the current code base. In this
context, the GeantV R&D project was launched, aiming to redesign the legacy
particle transport codes in order to make them benefit from fine-grained
parallelism features such as vectorization, but also from increased code and
data locality. This paper presents extensively the results and achievements of
this R&D, as well as the conclusions and lessons learnt from the beta
prototype.Comment: 34 pages, 26 figures, 24 table
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
- …