584 research outputs found

    Custom-Enabled System Architectures for High End Computing

    Get PDF
    The US Federal Government has convened a major committee to determine future directions for government sponsored high end computing system acquisitions and enabling research. The High End Computing Revitalization Task Force was inaugurated in 2003 involving all Federal agencies for which high end computing is critical to meeting mission goals. As part of the HECRTF agenda, a multi-day community wide workshop was conducted involving experts from academia, industry, and the national laboratories and centers to provide the broadest perspective on important issues related to the HECRTF purview. Among the most critical issues in establishing future directions is the relative merits of commodity based systems such as clusters and MPPs versus custom system architecture strategies. This paper presents a perspective on the importance and value of the custom architecture approach in meeting future US requirements in supercomputing. The contents of this paper reflect the ideas of the participants of the working group chartered to explore custom enabled system architectures for high end computing. As in any such consensus presentation, while this paper captures the key ideas and tradeoffs, it does not exactly match the viewpoint of any single contributor, and there remains much room for constructive disagreement and refinement of the essential conclusions

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

    A Memory-Centric Customizable Domain-Specific FPGA Overlay for Accelerating Machine Learning Applications

    Get PDF
    Low latency inferencing is of paramount importance to a wide range of real time and userfacing Machine Learning (ML) applications. Field Programmable Gate Arrays (FPGAs) offer unique advantages in delivering low latency as well as energy efficient accelertors for low latency inferencing. Unfortunately, creating machine learning accelerators in FPGAs is not easy, requiring the use of vendor specific CAD tools and low level digital and hardware microarchitecture design knowledge that the majority of ML researchers do not possess. The continued refinement of High Level Synthesis (HLS) tools can reduce but not eliminate the need for hardware-specific design knowledge. The designs by these tools can also produce inefficient use of FPGA resources that ultimately limit the performance of the neural network. This research investigated a new FPGA-based software-hardware codesigned overlay architecture that opens the advantages of FPGAs to the broader ML user community. As an overlay, the proposed design allows rapid coding and deployment of different ML network configurations and different data-widths, eliminating the prior barrier of needing to resynthesize each design. This brings important attributes of code portability over different FPGA families. The proposed overlay design is a Single-Instruction-Multiple-Data (SIMD) Processor-In-Memory (PIM) architecture developed as a programmable overlay for FPGAs. In contrast to point designs, it can be programmed to implement different types of machine learning algorithms. The overlay architecture integrates bit-serial Arithmetic Logic Units (ALUs) with distributed Block RAMs (BRAMs). The PIM design increases the size of arithmetic operations and on-chip storage capacity. User-visible inference latencies are reduced by exploiting concurrent accesses to network parameters (weights and biases) and partial results stored throughout the distributed BRAMs. Run-time performance comparisons show that the proposed design achieves a speedup compared to HLS-based or custom-tuned equivalent designs. Notably, the proposed design is programmable, allowing rapid design space exploration without the need to resynthesize when changing ML algorithms on the FPGA

    An empirical evaluation of High-Level Synthesis languages and tools for database acceleration

    Get PDF
    High Level Synthesis (HLS) languages and tools are emerging as the most promising technique to make FPGAs more accessible to software developers. Nevertheless, picking the most suitable HLS for a certain class of algorithms depends on requirements such as area and throughput, as well as on programmer experience. In this paper, we explore the different trade-offs present when using a representative set of HLS tools in the context of Database Management Systems (DBMS) acceleration. More specifically, we conduct an empirical analysis of four representative frameworks (Bluespec SystemVerilog, Altera OpenCL, LegUp and Chisel) that we utilize to accelerate commonly-used database algorithms such as sorting, the median operator, and hash joins. Through our implementation experience and empirical results for database acceleration, we conclude that the selection of the most suitable HLS depends on a set of orthogonal characteristics, which we highlight for each HLS framework.Peer ReviewedPostprint (author’s final draft
    • …
    corecore