42 research outputs found

    OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

    Full text link
    There is interest in exploring hybrid OpenSHMEM + X programming models to extend the applicability of the OpenSHMEM interface to more hardware architectures. We present a hybrid OpenCL + OpenSHMEM programming model for device-level programming for architectures like the Adapteva Epiphany many-core RISC array processor. The Epiphany architecture comprises a 2D array of low-power RISC cores with minimal uncore functionality connected by a 2D mesh Network-on-Chip (NoC). The Epiphany architecture offers high computational energy efficiency for integer and floating point calculations as well as parallel scalability. The Epiphany-III is available as a coprocessor in platforms that also utilize an ARM CPU host. OpenCL provides good functionality for supporting a co-design programming model in which the host CPU offloads parallel work to a coprocessor. However, the OpenCL memory model is inconsistent with the Epiphany memory architecture and lacks support for inter-core communication. We propose a hybrid programming model in which OpenSHMEM provides a better solution by replacing the non-standard OpenCL extensions introduced to achieve high performance with the Epiphany architecture. We demonstrate the proposed programming model for matrix-matrix multiplication based on Cannon's algorithm showing that the hybrid model addresses the deficiencies of using OpenCL alone to achieve good benchmark performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

    An integrated general practice and pharmacy-based intervention to promote the use of appropriate preventive medications among individuals at high cardiovascular disease risk: protocol for a cluster randomized controlled trial

    Get PDF
    Background: Cardiovascular diseases (CVD) are responsible for significant morbidity, premature mortality, and economic burden. Despite established evidence that supports the use of preventive medications among patients at high CVD risk, treatment gaps remain. Building on prior evidence and a theoretical framework, a complex intervention has been designed to address these gaps among high-risk, under-treated patients in the Australian primary care setting. This intervention comprises a general practice quality improvement tool incorporating clinical decision support and audit/feedback capabilities; availability of a range of CVD polypills (fixed-dose combinations of two blood pressure lowering agents, a statin ± aspirin) for prescription when appropriate; and access to a pharmacy-based program to support long-term medication adherence and lifestyle modification. Methods: Following a systematic development process, the intervention will be evaluated in a pragmatic cluster randomized controlled trial including 70 general practices for a median period of 18 months. The 35 general practices in the intervention group will work with a nominated partner pharmacy, whereas those in the control group will provide usual care without access to the intervention tools. The primary outcome is the proportion of patients at high CVD risk who were inadequately treated at baseline who achieve target blood pressure (BP) and low-density lipoprotein cholesterol (LDL-C) levels at the study end. The outcomes will be analyzed using data from electronic medical records, utilizing a validated extraction tool. Detailed process and economic evaluations will also be performed. Discussion: The study intends to establish evidence about an intervention that combines technological innovation with team collaboration between patients, pharmacists, and general practitioners (GPs) for CVD prevention. Trial registration: Australian New Zealand Clinical Trials Registry ACTRN1261600023342

    Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models

    Get PDF
    The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research

    Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs

    Get PDF
    Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy. Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification. We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP). In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques

    Architecting Dependable Many-Core Processors Using Core-Level Dynamic Redundancy

    No full text

    A Scalability-Aware Kernel Executive for Many-Core Operating Systems

    No full text

    Scaling OpenSHMEM for Massively Parallel Processor Arrays

    No full text
    corecore