406 research outputs found
Analyzing Conflict Freedom For Multi-threaded Programs With Time Annotations
Avoiding access conflicts is a major challenge in the design of
multi-threaded programs. In the context of real-time systems, the absence of
conflicts can be guaranteed by ensuring that no two potentially conflicting
accesses are ever scheduled concurrently.In this paper, we analyze programs
that carry time annotations specifying the time for executing each statement.
We propose a technique for verifying that a multi-threaded program with time
annotations is free of access conflicts. In particular, we generate constraints
that reflect the possible schedules for executing the program and the required
properties. We then invoke an SMT solver in order to verify that no execution
gives rise to concurrent conflicting accesses. Otherwise, we obtain a trace
that exhibits the access conflict.Comment: http://journal.ub.tu-berlin.de/eceasst/article/view/97
Non-consistent dual register files to reduce register pressure
The continuous grow on instruction level parallelism offered by microprocessors requires a large register file and a large number of ports to access it. This paper presents the non-consistent dual register file, an alternative implementation and management of the register file. Non-consistent dual register files support the bandwidth demands and the high register requirements, penalizing neither access time nor implementation cost. The proposal is evaluated for software pipelined loops and compared against a unified register file. Empirical results show improvements on performance and a noticeable reduction of the density of memory traffic due to a reduction of the spill code. The spill code can in general increase the minimum initiation interval and decrease loop performance. Additional improvements can be obtained when the operations are scheduled having in mind the register file organization proposed.Peer ReviewedPostprint (published version
Selective Vectorization for Short-Vector Instructions
Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x
Harvesting graphics power for MD simulations
We discuss an implementation of molecular dynamics (MD) simulations on a
graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code
on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms
suitable for short-ranged and long-ranged interactions, and a congruential
shift random number generator are presented. The performance of the GPU's is
compared to their main processor counterpart. We achieve speedups of up to 80,
40 and 150 fold, respectively. With newest generation of GPU's one can run
standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si
Clustered VLIW architecture based on queue register files
Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs.
This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture
Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.
The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom
hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures.
This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of
custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like
limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor.
Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by
efficiently sharing hardware between multiple kernels.
Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is
presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd
Implementation and Improvement of a Swing Modulo Scheduler for VLIW Architecture
학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 백윤흥.For VLIW architectures, compiler is in charge of statically scheduling instructions since there are no hardware for hazard detection in this kind of architecture. Thus, instruction scheduling techniques for VLIW architectures have critical influences on both correctness of parallel executions and effective utilization of hardware resources. Software pipelining is one of the popular instruction scheduling techniques which enables overlapped execution of successive loop iterations. We implemented a module of compiler, a swing modulo scheduler, to achieve software pipelining for target VLIW architecture. Experiments on a set of multi-media applications show that with swing modulo scheduler, it has up to 2.6 times speed-up in performance when comparing to the basic list scheduling implementation.1. Introduction………………………………………………………………………………………. 1
2. Background……………………………………………………………………………………….. 3
2. 1 Very Long Instruction Word (VLIW) Architecture…………………………………… 3
2. 2 Instruction Scheduling for VLIW Architecture………………………………………… 4
2. 3 Software Pipelining for VLIW Architecture…………………………………………….. 5
2. 4 LLVM Compiler Infrastructure……………………………………………………………….. 6
3. Swing Modulo Scheduling………………………………………………………………….. 8
3.1 Build Data Dependence Graphs……………………………………………………………… 8
3.2 Calculate Minimum Initiation Interval (MII)……………………………………………. 9
3.3 Analysis and Computation………………………………………………………………….... 10
3.4 Order Nodes…………………………………………………………………………………………. 11
3.5 Schedule Nodes……………………………………………………………………………………. 12
4. Implementation and Improvement……………………………………………………13
4.1 Preprocess Basic Blocks………………………………………………………………………… 13
4.2 Build Scheduling Graphs……………………………………………………………………….. 14
4.3 Find or Build Basic Induction Variables…………………………………………………. 15
4.4 Calculate Resource MII…………………………………………………………………………. 16
4.5 Find All Circuits for Calculating Recurrence MII…………………………………….. 17
4.6 Break Anti-dependences………………………………………………………………………. 19
4.7 Compute Partial Order…………………………………………………………………………. 20
4.8 Compute Final Order……………………………………………………………………………. 21
4.9 Construct Prologue, Kernel and Epilogue……………………………………………… 22
4.10 Check Register Pressure……………………………………………………………………… 23
4.11 Adjust Loop Iteration Count……………………………………………………………….. 23
5. Experimental Results…………………………………………………………………………25
5.1 Environment………………………………………………………………………………………... 25
5.2 Performance………………………………………………………………………………………… 26
5.3 Effectiveness………………………………………………………………………………………… 27
6. Conclusion and Future Work……………………………………………………………..29
Reference……………………………………………………………………………………………..30Maste
Optimization Modulo Theories with Linear Rational Costs
In the contexts of automated reasoning (AR) and formal verification (FV),
important decision problems are effectively encoded into Satisfiability Modulo
Theories (SMT). In the last decade efficient SMT solvers have been developed
for several theories of practical interest (e.g., linear arithmetic, arrays,
bit-vectors). Surprisingly, little work has been done to extend SMT to deal
with optimization problems; in particular, we are not aware of any previous
work on SMT solvers able to produce solutions which minimize cost functions
over arithmetical variables. This is unfortunate, since some problems of
interest require this functionality.
In the work described in this paper we start filling this gap. We present and
discuss two general procedures for leveraging SMT to handle the minimization of
linear rational cost functions, combining SMT with standard minimization
techniques. We have implemented the procedures within the MathSAT SMT solver.
Due to the absence of competitors in the AR, FV and SMT domains, we have
experimentally evaluated our implementation against state-of-the-art tools for
the domain of linear generalized disjunctive programming (LGDP), which is
closest in spirit to our domain, on sets of problems which have been previously
proposed as benchmarks for the latter tools. The results show that our tool is
very competitive with, and often outperforms, these tools on these problems,
clearly demonstrating the potential of the approach.Comment: Submitted on january 2014 to ACM Transactions on Computational Logic,
currently under revision. arXiv admin note: text overlap with arXiv:1202.140
Automatic Design of Efficient Application-centric Architectures.
As the market for embedded devices continues to grow, the demand for high
performance, low cost, and low power computation grows as well. Many embedded
applications perform computationally intensive tasks such as processing streaming
video or audio, wireless communication, or speech recognition and must be
implemented within tight power budgets. Typically, general
purpose processors are not able to meet these performance and power requirements.
Custom hardware in the form of loop accelerators are often used to execute the
compute-intensive portions of these applications because they can achieve significantly
higher levels of performance and power efficiency.
Automated hardware synthesis from high level specifications is a key technology
used in designing these accelerators, because the resulting hardware is correct by
construction, easing verification and greatly decreasing time-to-market in the quickly
evolving embedded domain. In this dissertation, a compiler-directed approach is used
to design a loop accelerator from a C specification and a throughput requirement. The
compiler analyzes the loop and generates a virtual architecture containing sufficient
resources to sustain the required throughput. Next, a software pipelining scheduler
maps the operations in the loop to the virtual architecture. Finally, the accelerator
datapath is derived from the resulting schedule.
In this dissertation, synthesis of different types of loop accelerators is investigated.
First, the system for synthesizing single loop accelerators is detailed. In particular, a
scheduler is presented that is aware of the effects of its decisions on the resulting hardware,
and attempts to minimize hardware cost. Second, synthesis of multifunction
loop accelerators, or accelerators capable of executing multiple loops, is presented.
Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce
overall cost. Finally, synthesis of post-programmable accelerators is presented,
allowing changes to be made to the software after an accelerator has been created.
The tradeoffs between the flexibility, cost, and energy efficiency of these different
types of accelerators are investigated. Automatically synthesized loop accelerators
are capable of achieving order-of-magnitude gains in performance, area efficiency,
and power efficiency over processors, and programmable accelerators allow software
changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd
Putting Operational Techniques to the Test: A Syntactic Theory for Behavioral Verilog
AbstractWe present a syntactic theory for the behavioral subset of the Verilog Hardware Description Language. Due to the complexity of the language, the construction of this theory represents a serious test of the suitability of syntactic operational techniques for reasoning about industrial languages. Overall, we have found that these techniques are rather robust but with a few caveats. Our theory formalizes the simulation cycle explicitly, exposes a number of ambiguities and inconsistencies in the language reference manual (LRM), and is the most accurate known description of this subset of Verilog, with respect to the LRM. The syntactic theory has been used to automatically derive a simulator for Verilog
- …