5,325 research outputs found

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    Advances in computational modelling for personalised medicine after myocardial infarction

    Get PDF
    Myocardial infarction (MI) is a leading cause of premature morbidity and mortality worldwide. Determining which patients will experience heart failure and sudden cardiac death after an acute MI is notoriously difficult for clinicians. The extent of heart damage after an acute MI is informed by cardiac imaging, typically using echocardiography or sometimes, cardiac magnetic resonance (CMR). These scans provide complex data sets that are only partially exploited by clinicians in daily practice, implying potential for improved risk assessment. Computational modelling of left ventricular (LV) function can bridge the gap towards personalised medicine using cardiac imaging in patients with post-MI. Several novel biomechanical parameters have theoretical prognostic value and may be useful to reflect the biomechanical effects of novel preventive therapy for adverse remodelling post-MI. These parameters include myocardial contractility (regional and global), stiffness and stress. Further, the parameters can be delineated spatially to correspond with infarct pathology and the remote zone. While these parameters hold promise, there are challenges for translating MI modelling into clinical practice, including model uncertainty, validation and verification, as well as time-efficient processing. More research is needed to (1) simplify imaging with CMR in patients with post-MI, while preserving diagnostic accuracy and patient tolerance (2) to assess and validate novel biomechanical parameters against established prognostic biomarkers, such as LV ejection fraction and infarct size. Accessible software packages with minimal user interaction are also needed. Translating benefits to patients will be achieved through a multidisciplinary approach including clinicians, mathematicians, statisticians and industry partners

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment

    Get PDF
    • …
    corecore