406 research outputs found

    Analyzing Conflict Freedom For Multi-threaded Programs With Time Annotations

    Get PDF
    Avoiding access conflicts is a major challenge in the design of multi-threaded programs. In the context of real-time systems, the absence of conflicts can be guaranteed by ensuring that no two potentially conflicting accesses are ever scheduled concurrently.In this paper, we analyze programs that carry time annotations specifying the time for executing each statement. We propose a technique for verifying that a multi-threaded program with time annotations is free of access conflicts. In particular, we generate constraints that reflect the possible schedules for executing the program and the required properties. We then invoke an SMT solver in order to verify that no execution gives rise to concurrent conflicting accesses. Otherwise, we obtain a trace that exhibits the access conflict.Comment: http://journal.ub.tu-berlin.de/eceasst/article/view/97

    Non-consistent dual register files to reduce register pressure

    Get PDF
    The continuous grow on instruction level parallelism offered by microprocessors requires a large register file and a large number of ports to access it. This paper presents the non-consistent dual register file, an alternative implementation and management of the register file. Non-consistent dual register files support the bandwidth demands and the high register requirements, penalizing neither access time nor implementation cost. The proposal is evaluated for software pipelined loops and compared against a unified register file. Empirical results show improvements on performance and a noticeable reduction of the density of memory traffic due to a reduction of the spill code. The spill code can in general increase the minimum initiation interval and decrease loop performance. Additional improvements can be obtained when the operations are scheduled having in mind the register file organization proposed.Peer ReviewedPostprint (published version

    Selective Vectorization for Short-Vector Instructions

    Get PDF
    Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x

    Harvesting graphics power for MD simulations

    Get PDF
    We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 80, 40 and 150 fold, respectively. With newest generation of GPU's one can run standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture

    Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.

    Full text link
    The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures. This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor. Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by efficiently sharing hardware between multiple kernels. Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd

    Implementation and Improvement of a Swing Modulo Scheduler for VLIW Architecture

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 백윤흥.For VLIW architectures, compiler is in charge of statically scheduling instructions since there are no hardware for hazard detection in this kind of architecture. Thus, instruction scheduling techniques for VLIW architectures have critical influences on both correctness of parallel executions and effective utilization of hardware resources. Software pipelining is one of the popular instruction scheduling techniques which enables overlapped execution of successive loop iterations. We implemented a module of compiler, a swing modulo scheduler, to achieve software pipelining for target VLIW architecture. Experiments on a set of multi-media applications show that with swing modulo scheduler, it has up to 2.6 times speed-up in performance when comparing to the basic list scheduling implementation.1. Introduction………………………………………………………………………………………. 1 2. Background……………………………………………………………………………………….. 3 2. 1 Very Long Instruction Word (VLIW) Architecture…………………………………… 3 2. 2 Instruction Scheduling for VLIW Architecture………………………………………… 4 2. 3 Software Pipelining for VLIW Architecture…………………………………………….. 5 2. 4 LLVM Compiler Infrastructure……………………………………………………………….. 6 3. Swing Modulo Scheduling………………………………………………………………….. 8 3.1 Build Data Dependence Graphs……………………………………………………………… 8 3.2 Calculate Minimum Initiation Interval (MII)……………………………………………. 9 3.3 Analysis and Computation………………………………………………………………….... 10 3.4 Order Nodes…………………………………………………………………………………………. 11 3.5 Schedule Nodes……………………………………………………………………………………. 12 4. Implementation and Improvement……………………………………………………13 4.1 Preprocess Basic Blocks………………………………………………………………………… 13 4.2 Build Scheduling Graphs……………………………………………………………………….. 14 4.3 Find or Build Basic Induction Variables…………………………………………………. 15 4.4 Calculate Resource MII…………………………………………………………………………. 16 4.5 Find All Circuits for Calculating Recurrence MII…………………………………….. 17 4.6 Break Anti-dependences………………………………………………………………………. 19 4.7 Compute Partial Order…………………………………………………………………………. 20 4.8 Compute Final Order……………………………………………………………………………. 21 4.9 Construct Prologue, Kernel and Epilogue……………………………………………… 22 4.10 Check Register Pressure……………………………………………………………………… 23 4.11 Adjust Loop Iteration Count……………………………………………………………….. 23 5. Experimental Results…………………………………………………………………………25 5.1 Environment………………………………………………………………………………………... 25 5.2 Performance………………………………………………………………………………………… 26 5.3 Effectiveness………………………………………………………………………………………… 27 6. Conclusion and Future Work……………………………………………………………..29 Reference……………………………………………………………………………………………..30Maste

    Optimization Modulo Theories with Linear Rational Costs

    Full text link
    In the contexts of automated reasoning (AR) and formal verification (FV), important decision problems are effectively encoded into Satisfiability Modulo Theories (SMT). In the last decade efficient SMT solvers have been developed for several theories of practical interest (e.g., linear arithmetic, arrays, bit-vectors). Surprisingly, little work has been done to extend SMT to deal with optimization problems; in particular, we are not aware of any previous work on SMT solvers able to produce solutions which minimize cost functions over arithmetical variables. This is unfortunate, since some problems of interest require this functionality. In the work described in this paper we start filling this gap. We present and discuss two general procedures for leveraging SMT to handle the minimization of linear rational cost functions, combining SMT with standard minimization techniques. We have implemented the procedures within the MathSAT SMT solver. Due to the absence of competitors in the AR, FV and SMT domains, we have experimentally evaluated our implementation against state-of-the-art tools for the domain of linear generalized disjunctive programming (LGDP), which is closest in spirit to our domain, on sets of problems which have been previously proposed as benchmarks for the latter tools. The results show that our tool is very competitive with, and often outperforms, these tools on these problems, clearly demonstrating the potential of the approach.Comment: Submitted on january 2014 to ACM Transactions on Computational Logic, currently under revision. arXiv admin note: text overlap with arXiv:1202.140

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Putting Operational Techniques to the Test: A Syntactic Theory for Behavioral Verilog

    Get PDF
    AbstractWe present a syntactic theory for the behavioral subset of the Verilog Hardware Description Language. Due to the complexity of the language, the construction of this theory represents a serious test of the suitability of syntactic operational techniques for reasoning about industrial languages. Overall, we have found that these techniques are rather robust but with a few caveats. Our theory formalizes the simulation cycle explicitly, exposes a number of ambiguities and inconsistencies in the language reference manual (LRM), and is the most accurate known description of this subset of Verilog, with respect to the LRM. The syntactic theory has been used to automatically derive a simulator for Verilog
    corecore