42 research outputs found
OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture
There is interest in exploring hybrid OpenSHMEM + X programming models to
extend the applicability of the OpenSHMEM interface to more hardware
architectures. We present a hybrid OpenCL + OpenSHMEM programming model for
device-level programming for architectures like the Adapteva Epiphany many-core
RISC array processor. The Epiphany architecture comprises a 2D array of
low-power RISC cores with minimal uncore functionality connected by a 2D mesh
Network-on-Chip (NoC). The Epiphany architecture offers high computational
energy efficiency for integer and floating point calculations as well as
parallel scalability. The Epiphany-III is available as a coprocessor in
platforms that also utilize an ARM CPU host. OpenCL provides good functionality
for supporting a co-design programming model in which the host CPU offloads
parallel work to a coprocessor. However, the OpenCL memory model is
inconsistent with the Epiphany memory architecture and lacks support for
inter-core communication. We propose a hybrid programming model in which
OpenSHMEM provides a better solution by replacing the non-standard OpenCL
extensions introduced to achieve high performance with the Epiphany
architecture. We demonstrate the proposed programming model for matrix-matrix
multiplication based on Cannon's algorithm showing that the hybrid model
addresses the deficiencies of using OpenCL alone to achieve good benchmark
performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
An integrated general practice and pharmacy-based intervention to promote the use of appropriate preventive medications among individuals at high cardiovascular disease risk: protocol for a cluster randomized controlled trial
Background: Cardiovascular diseases (CVD) are responsible for significant morbidity, premature mortality, and economic burden. Despite established evidence that supports the use of preventive medications among patients at high CVD risk, treatment gaps remain. Building on prior evidence and a theoretical framework, a complex intervention has been designed to address these gaps among high-risk, under-treated patients in the Australian primary care setting. This intervention comprises a general practice quality improvement tool incorporating clinical decision support and audit/feedback capabilities; availability of a range of CVD polypills (fixed-dose combinations of two blood pressure lowering agents, a statin ± aspirin) for prescription when appropriate; and access to a pharmacy-based program to support long-term medication adherence and lifestyle modification.
Methods: Following a systematic development process, the intervention will be evaluated in a pragmatic cluster randomized controlled trial including 70 general practices for a median period of 18 months. The 35 general practices in the intervention group will work with a nominated partner pharmacy, whereas those in the control group will provide usual care without access to the intervention tools. The primary outcome is the proportion of patients at high CVD risk who were inadequately treated at baseline who achieve target blood pressure (BP) and low-density lipoprotein cholesterol (LDL-C) levels at the study end. The outcomes will be analyzed using data from electronic medical records, utilizing a validated extraction tool. Detailed process and economic evaluations will also be performed.
Discussion: The study intends to establish evidence about an intervention that combines technological innovation with team collaboration between patients, pharmacists, and general practitioners (GPs) for CVD prevention.
Trial registration: Australian New Zealand Clinical Trials Registry ACTRN1261600023342
Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
Tiny but Mighty: Designing and Realizing Scalable Latency Tolerance for Manycore SoCs
Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network and graph analytic applications where indirect memory accesses (IMAs) challenge the memory hierarchy.
Decades of prior art have proposed hardware and software mechanisms to mitigate IMA latency, but they fail to analyze real-chip considerations, especially when used in SoCs and manycores. In this paper, we revisit many of these techniques while taking into account manycore integration and verification.
We present the first system implementation of latency tolerance hardware that provides significant speedups without requiring any memory hierarchy or processor tile modifications. This is achieved through a Memory Access Parallel-Load Engine (MAPLE), integrated through the Network-on-Chip (NoC) in a scalable manner. Our hardware-software co-design allows programs to perform long-latency memory accesses asynchronously from the core, avoiding pipeline stalls, and enabling greater memory parallelism (MLP).
In April 2021 we taped out a manycore chip that includes tens of MAPLE instances for efficient data supply. MAPLE demonstrates a full RTL implementation of out-of-core latency-mitigation hardware, with virtual memory support and automated compilation targetting it. This paper evaluates MAPLE integrated with a dual-core FPGA prototype running applications with full SMP Linux, and demonstrates geomean speedups of 2.35× and 2.27× over software-based prefetching and decoupling, respectively. Compared to state-of-the-art hardware, it provides geomean speedups of 1.82× and 1.72× over prefetching and decoupling techniques
