393 research outputs found

    Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models

    Get PDF
    We address the problem of scheduling non-manifest data dependant periodic loops for high throughput DSP-applications based on a streaming data model. In contrast to manifest loops, non-manifest data dependant loops are loops where the number of iterations needed in order to perform a calculation is data dependant and hence not known at compile time. For the case of manifest loops, static scheduling techniques have been devised which produce near optimal schedules. Due to the lack of exact run-time execution knowledge of non-manifest loops, these static scheduling techniques are not suitable for tackling scheduling problems of DSP-algorithms with non-manifest loops embedded in them. We consider the case where (a) a-priori knowledge of the data distribution, and (b) worst case execution time of the non-manifest loop are known and a constraint on the total execution time has been given. Under these conditions dynamic schedules of the non-manifest data dependant loops within the DSP-algorithm are possible. We show how to construct hardware which dynamically schedules these non-manifest loops. The sliding window execution, which is the execution of a non-manifest loop when the data streams through it, of the constructed hardware will guarantee real time performance for the worst case situation. This is the situation when each non-manifest loop requires its maximum number of iterations

    Fast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers

    Get PDF
    © 2015 IEEE.Concurrency errors, such as data races, make device drivers notoriously hard to develop and debug without automated tool support. We present Whoop, a new automated approach that statically analyzes drivers for data races. Whoop is empowered by symbolic pairwise lockset analysis, a novel analysis that can soundly detect all potential races in a driver. Our analysis avoids reasoning about thread interleavings and thus scales well. Exploiting the race-freedom guarantees provided by Whoop, we achieve a sound partial-order reduction that significantly accelerates Corral, an industrial-strength bug-finder for concurrent programs. Using the combination of Whoop and Corral, we analyzed 16 drivers from the Linux 4.0 kernel, achieving 1.5 - 20× speedups over standalone Corral

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    CellSs : Scheduling techniques to better exploit memory hierarchy

    Get PDF
    ABSTRACT: Cell Superscalar’s (CellSs) main goal is to provide a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of the applications at a task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that takes care of the concurrent execution of the application. The first efforts for task scheduling in CellSs derived from very simple heuristics. This paper presents new scheduling techniques that have been developed for CellSs for the purpose of improving an application’s performance. Additionally, the design of a new scheduling algorithm is detailed and the algorithm evaluated. The CellSs scheduler takes an extension of the memory hierarchy for Cell/B.E. into account, with a cache memory shared between the SPEs. All new scheduling practices have been evaluated showing better behavior of our system

    Further sub-cycle and multi-cycle schedulling support for Bluespec Verilog

    Get PDF
    Bluespec is a hardware description language where all behaviour is expressed in rules that execute atomically. The standard compilation semantics for Bluespec enforce a particular mapping between rule firing and hardware clock cycles, such as a register only being updated by exactly one firing of at most one rule in any clock cycle. Also, the standard compiler does not introduce any additional state, such as credit-based or round-robin arbiters to guarantee fairness between rules over time. On the other hand, many useful hardware resources, such as complex ALUs and synchronous RAMs, are pipelined. Unlike typical high-level synthesis tools, in standard Bluespec such resources cannot be invoked using infix operators in expressions such as A[e] or e1*e2 since binding to specific instances and multi-clock cycle schedules are required. In this paper we extend the reference semantics of Bluespec to decouple it from clock cycles, allowing multiple updates to a register within one clock cycle and automatic instantiation of arbiters for multi-clock cycle behaviour. We describe the new semantic packing rules as extensions of our standard compilation rules and we report early results from an open-source, fully-functional implementation
    • …
    corecore