41,741 research outputs found
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
A Linear First-Order Functional Intermediate Language for Verified Compilers
We present the linear first-order intermediate language IL for verified
compilers. IL is a functional language with calls to a nondeterministic
environment. We give IL terms a second, imperative semantic interpretation and
obtain a register transfer language. For the imperative interpretation we
establish a notion of live variables. Based on live variables, we formulate a
decidable property called coherence ensuring that the functional and the
imperative interpretation of a term coincide. We formulate a register
assignment algorithm for IL and prove its correctness. The algorithm translates
a functional IL program into an equivalent imperative IL program. Correctness
follows from the fact that the algorithm reaches a coherent program after
consistently renaming local variables. We prove that the maximal number of live
variables in the initial program bounds the number of different variables in
the final coherent program. The entire development is formalized in Coq.Comment: Addressed comments from reviewers (ITP 2015): (1) Added discussion of
a paper in related work (2) Added definition of renamed-apart in appendix (3)
Formulation changes in a coupe of place
A formally verified compiler back-end
This article describes the development and formal verification (proof of
semantic preservation) of a compiler back-end from Cminor (a simple imperative
intermediate language) to PowerPC assembly code, using the Coq proof assistant
both for programming the compiler and for proving its correctness. Such a
verified compiler is useful in the context of formal methods applied to the
certification of critical software: the verification of the compiler guarantees
that the safety properties proved on the source code hold for the executable
compiled code as well
A unified modulo scheduling and register allocation technique for clustered processors
This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version
Maximizing resource utilization by slicing of superscalar architecture
Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed
Recommended from our members
Silicon compilation
Silicon compilation is a term used for many different purposes. In this paper we define silicon compilation as a mapping from some higher level description into layout. We define the basic issues in structural and behavioral silicon compilation and some possible solutions to those issues. Finally, we define the concept of an intelligent silicon compiler in which the compiler evaluates the quality of the generated design and attempts to improve it if it is not satisfactory
Recommended from our members
VSS : a VHDL synthesis system
This report describes a register transfer synthesis system that allows a designer to interact with the design process. The designer can modify the compiled design by changing the input description, selecting optimization and mapping strategies, or graphically changing the generated design schematic. The VHDL language is used for input and output descriptions. An intermediate representation which incorporates signal typing and component attributes simplifies compilation and facilitates design optimization. The compilation process consists of two phases. First, a design composed of generic components is synthesized from the input description. Second, this design is translated into components from a particular library by a mapper and optimized by a logic optimizer. Redesign to new technologies can be accomplished by changing only the component library
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
Variable-based multi-module data caches for clustered VLIW processors
Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version
- …