529 research outputs found

    Scalable framework for heterogeneous clustering of commodity FPGAs

    Get PDF
    A combination of parallelism exploitation and application specific hardware is increasingly being used to address the computational requirements of a diverse and extensive set of application areas. These targeted applications have specific computational requirements that often are not able to be implemented optimally on general purpose processors and have the potential to experience substantial speedup on dedicated hardware. While general parallelism has been exploited at various levels for decades, the advent of heterogeneous cluster computing has allowed applications to be accelerated through the use of intelligently mapped computational tasks to well-suited hardware. This trend has continued with the use of dedicated ASIC and FPGA coprocessors to off-load particularly intensive computations. With the inclusion of embedded microprocessors into otherwise reconfigurable FPGA fabric, it has become feasible to construct a heterogeneous cluster composed of application specific hardware resources that can be programatically treated as fully functional and independent cluster nodes via a standard message passing interface. The contribution of this thesis is the development of such a framework for organizing heterogeneous clusters of reconfigurable FPGA computing elements into clusters that enable development of complex systems delivering on the promise of parallel reconfigurable hardware. The framework includes a fully featured message passing interface implementation for seamless communication and synchronization among nodes running in an embedded Linux operating system environment while managing hardware accelerators through device driver abstractions and standard APIs. A set of application case studies deployed on a test platform of Xilinx Virtex-4 and Virtex-5 FPGAs demonstrates functionality, elucidates performance characteristics, and promotes future research and development efforts

    Investigating data throughput and partial dynamic reconfiguration in a commodity FPGA cluster framework

    Get PDF
    There are many computational kernels where parallelism can be exploited in applica- tion specific hardware, yielding significant speedup over a general purpose processor based solution. Commodity cluster computing technologies have been combined with FPGA co- processors, resulting in even greater performance capability through the exploitation of multiple levels of parallelism. One particularly economic solution both in terms of cost and power consumption is to cluster hybrid FPGAs with commodity network intercon- nects. Hybrid FPGAs combine embedded microprocessors with reconfigurable hardware resources on a single chip offering lower power consumption and cost compared to a tra- ditional I/O bus FPGA coprocessor solution. While there is a lot of promise in using com- modity hybrid FPGAs in a cluster configuration, the design flow and performance char- acteristics of such systems are currently a limiting factor to the range of applications that could benefit from such a system. The contribution of this thesis is a framework for clustering commodity FPGAs which integrates high speed DMA data transfers with a flexible FPGA resource sharing scheme enabled through partial reconfiguration. The framework includes an embedded Linux op- erating system, with a custom device driver to manage data transfers and hardware recon- figuration. User space tools for cluster computing including ssh and MPI are deployed allowing tasks to be split among nodes in the cluster. Performance analysis is performed with a homogeneous cluster composed of four Virtex-5 FXT based FPGA boards. The results demonstrate the advantages over previous work in terms of data throughput and reconfiguration, as well as promote future research efforts

    Description and Optimization of Abstract Machines in a Dialect of Prolog

    Full text link
    In order to achieve competitive performance, abstract machines for Prolog and related languages end up being large and intricate, and incorporate sophisticated optimizations, both at the design and at the implementation levels. At the same time, efficiency considerations make it necessary to use low-level languages in their implementation. This makes them laborious to code, optimize, and, especially, maintain and extend. Writing the abstract machine (and ancillary code) in a higher-level language can help tame this inherent complexity. We show how the semantics of most basic components of an efficient virtual machine for Prolog can be described using (a variant of) Prolog. These descriptions are then compiled to C and assembled to build a complete bytecode emulator. Thanks to the high level of the language used and its closeness to Prolog, the abstract machine description can be manipulated using standard Prolog compilation and optimization techniques with relative ease. We also show how, by applying program transformations selectively, we obtain abstract machine implementations whose performance can match and even exceed that of state-of-the-art, highly-tuned, hand-crafted emulators.Comment: 56 pages, 46 figures, 5 tables, To appear in Theory and Practice of Logic Programming (TPLP

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    A Light-Weight Virtual Machine Monitor for Blue Gene/P

    Get PDF

    Versatile Object-oriented Real-Time Operating System

    Get PDF
    Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 79-80).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.As computer software has become more complex in response to increasing demands and greater levels of abstraction, so have computer operating systems. In order to achieve the desired level of functionality, operating systems have become less flexible and overly complex. The additional complexity and abstraction introduced often leads to less efficient use of hardware and increased hardware requirements. In embedded systems with limited hardware resources, efficient resource use is extremely important to the functionality of the resources. Therefore, operating system functionality not useful for the embedded system's applications is detrimental to the system. Component-based software provides a way to achieve both the efficient application-specific functionality required in embedded systems and the ability to extend this functionality to other applications. This thesis presents a component-based operating system, VORTOS, the Versatile Object-oriented Real-Time Operating System. VORTOS uses a virtual machine to abstract the hardware, eliminating the need for further portability abstractions within the operating system and application level components. The simple modular component architecture allows both the operating system and user applications to be extremely flexible by allowing them to utilize the particular components required, without sacrificing performance.by Rusty Lee.M.Eng.and S.B

    Cycle-accurate modeling of multicore processors on FPGAs

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 169-176).We present a novel modeling methodology which enables the generation of a high-performance, cycle-accurate simulator from a cycle-level specification of the target design. We describe Arete, a full-system multicore processor simulator, developed using our modeling methodology. We provide details on Arete's resource-efficient and high-performance implementation on multiple FPGA platforms, and the architectural experiments performed using it. We present clear evidence that the use of simplified models in architectural studies can lead to wrong conclusions. Through two experiments performed using both cycle-accurate and simplified models, we show that on one hand there are substantial quantitative and qualitative differences in results, and on the other, the results match quite well.by Asif Imtiaz Khan.Ph.D

    Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models

    Get PDF
    The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research
    • …
    corecore