105 research outputs found
Recommended from our members
A SIMD architecture for hard real-time systems
Emerging safety-critical systems require high-performance data-parallel architectures and, problematically, ones that can guarantee tight and safe worst-case execution times. Given the complexity of existing architectures like GPUs, it is unlikely that sufficiently accurate models and algorithms for timing analysis will emerge in the foreseeable future. This motivates a clean-slate approach to designing a real-time data-parallel architecture.
In this work I present Sim-D: a wide-SIMD architecture for hard real-time systems. Similar to GPUs, Sim-D performs hardware strip-mining to schedule the work for a compute kernel in entities called work-groups. Sim-D schedules the work for each work-group as a sequence of uninterruptible access- and execute program phases, interleaving the phases of two work-groups. By providing performance isolation between the memory- and compute resources, the execution time of each phase can be tightly bound through static analysis.
I present a predictable closed-page DRAM controller that processes requests for large 1D- and 2D blocks of data, as well as indirect indexed transfers. These large transfers coalesce the data requests of a whole work-group. For a linear 4KiB transfer over a 64-bit data bus, the utilisation provably exceeds 78% for DDR4-3200AA DRAM. For 2D blocks, a well-chosen tiling configuration can achieve near-similar efficiency. I show that bounds on the execution time of indexed transfers are pessimistic by nature, but propose a novel snoopy indexed transfer mechanism that permits more reasonable bounds when the buffer size is limited.
Finally, I present a worst-case execution time calculation algorithm for Sim-D. This algorithm is paired with two hardware work-group scheduling policies that deterministically reduce run-time variance. The worst-case execution time analysis algorithm combines static control flow analysis with a simulation-based cost model for execution and DRAM transfers. Its key novelty is the addition of a stage that considers work-group scheduling effects. I show that the work-group scheduling policies degrade performance on average by 8.9%, but permit the calculation of worst-case execution time bounds that are tight within 14.3% on average for benchmarks that avoid inefficient indexed transfers
Automated Compilation Framework for Scratchpad-based Real-Time Systems
ScratchPad Memory (SPM) is highly adopted in real-time systems as it exhibits a predictable behaviour. SPM is software-managed by explicitly inserting instructions to move code and data transfers between the SPM and the main memory. However, it is a tedious job to decide how to manage the SPM and to manually modify the code to insert memory transfers. Hence, an automated compilation tool is essential to efficiently utilize the SPM. Another key problem with SPM is the latency suffered by the system due to memory transfers. Hiding this latency is important for high-performance systems. In this thesis, we address the problems of managing SPM and reducing the impact of memory latency. To realize the automation of our work, we develop a compilation framework based on the LLVM compiler to analyze and transform the program code. We exploit our framework to improve the performance of the execution of single and multi-tasks in real-time systems. For the single task execution, Worst-Case Execution Time (WCET) is of great importance to assure correct and safe behaviour of the system. So, we propose a WCET-driven allocation technique for data SPM that employs software prefetching to efficiently manage the SPM and to overlap the memory transfer and the task execution in a predictable way. On the other hand, multi-tasking requires the system to be schedulable such that all the tasks can meet their timing requirements. However, executing multiple tasks on a multi-processor platform suffers from the contention of the accesses to the shared main memory. To avoid the contention, several scheduling techniques adopted the 3-phase execution model which executes the task as a sequence of memory and computation phases. This provides the means to avoid the contention as well as to hide the memory latency by using a Direct Memory Access (DMA) engine. Executing memory transfers using the DMA allows overlapping the memory transfers with the computations on the processor. Using the 3-phase model in systems with limited sizes of local SPM may necessitate a segmentation of the task. Automating the segmentation process is necessary especially for systems with large task sets. Hence, we propose a set of efficient segmentation algorithms that follow the 3-phase execution model. The application of these algorithms shows a significant improvement in the system schedulability. For our segmentation algorithms to be more applicable, we extend the 3-phase model to allow programs with multiple paths represented as conditional Directed Acyclic Graphs (DAGs), unlike the previous works that targeted sequential programs. We also introduce a multi-steaming model to exploit the benefits of prefetching by overlapping the memory and computation phases of the same task, which was not allowed in the previous approaches. By combining the automated compilation with the proposed algorithms, we are able to achieve our goal to efficiently manage data SPM in real-time systems
FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
Attention mechanisms form the backbone of state-of-the-art machine learning
models for a variety of tasks. Deploying them on deep neural network (DNN)
accelerators, however, is prohibitively challenging especially under long
sequences, as this work identifies. This is due to operators in attention
layers exhibiting limited reuse opportunities and quadratic growth in memory
footprint, leading to severe memory-boundedness. To address this, we introduce
a new attention-tailored dataflow, termed FLAT, which identifies fusion
opportunities within the attention layer, and implements an on-chip
memory-aware interleaved execution and tiling mechanism. FLAT increases the
effective memory bandwidth by efficiently utilizing the high-bandwidth,
low-capacity on-chip buffer and thus achieves better run time and compute
resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup
and 49% and 42% of energy reduction comparing to baseline execution over
state-of-the-art edge and cloud accelerators
A Survey on Cache Management Mechanisms for Real-Time Embedded Systems
© ACM, 2015. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, {48, 2, (November 2015)} http://doi.acm.org/10.1145/2830555Multicore processors are being extensively used by real-time systems, mainly because of their demand for
increased computing power. However, multicore processors have shared resources that affect the predictability
of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of
the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many
research works have proposed different techniques to deal with caches in multicore processors in the context
of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and
would be very useful for the real-time community. In this article, we present a survey of cache management
techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research
published in 2014. We categorize the main research works and provide a detailed comparison in terms of
similarities and differences. We also identify key challenges and discuss future research directions.King Saud University
NSER
A novel access pattern-based multi-core memory architecture
Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system.
This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit.
The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture.
The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research.
These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.Les aplicacions de computació d'alt rendiment (HPC) s'executen cada vegada més en plataformes heterogènies de múltiples nuclis. El motiu bà sic de la creixent popularitat d'aquestes arquitectures és el seu baix consum i la seva natura orientada a alt throughput. No obstant, aquest thoughput imposa el requeriment de que les dades es proporcionin al sistema també amb alt throughput. Això resulta en la necessitat de gestionar eficientment les trasferències de memòria (dins i fora del chip), un repte significatiu. Els patrons de transferències de memòria regulars però complexos aixà com els irregulars són cada vegada més dominants per a diversos dominis d'aplicacions, incloent el cientÃfic i el processat d'imagte i senyals. Aquests accessos a dades poden ser organitzats en patrons independents que un gestor de memòria eficient pot explotar. Els mètodes basats en programari emprant memòries cau de propòsit general i memòries al chip són beneficioses fins a cert punt. No obstant, la tasca de gestionar eficientment les transferències de dades per a dispositius orientats a throughput pot ser millorada oferint mecanismes hardware que explotin el coneixement dels patrons d'accés de les aplicacions, aixà com la planificació dels accessos a una arquitectura de múltiples nuclis. Aquesta tesis està enfocada a explorar una arquitectura de memòria novedosa per a processadors de múltiples nuclis, basada en els patrons d'accés. En general, la recerca de la tesis cobreix quatres aspectes principals del sistema de memòria. Aquests aspectes són: i) sistema de memòria per a un únic nucli amb patrons regulars, ii) sistema de memòria per a múltiples nuclis amb patrons regulars, iii) sistema de memòria per a un únic nucli amb patrons irregulars, iv) sistema de memòria per a múltiples nuclis amb patrons irregulars
Automatic performance optimisation of parallel programs for GPUs via rewrite rules
Graphics Processing Units (GPUs) are now commonplace in computing systems and are the
most successful parallel accelerators. Their performance is orders of magnitude higher than
traditional Central Processing Units (CPUs) making them attractive for many application domains
with high computational demands. However, achieving their full performance potential
is extremely hard, even for experienced programmers, as it requires specialised software tailored
for specific devices written in low-level languages such as OpenCL. Differences in device
characteristics between manufacturers and even hardware generations often lead to large performance
variations when different optimisations are applied. This inevitably leads to code that
is not performance portable across different hardware.
This thesis demonstrates that achieving performance portability is possible using LIFT, a
functional data-parallel language which allows programs to be expressed at a high-level in a
hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation
space using a set of well-defined rewrite rules to transform programs seamlessly between
different high-level algorithmic forms before translating them to a low-level OpenCL-specific
form.
The first contribution of this thesis is the development of techniques to compile functional
LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL
code. Producing efficient code is non-trivial as many performance sensitive details such as
memory allocation, array accesses or synchronisation are not explicitly represented in the functional
LIFT language. The thesis shows that the newly developed techniques are essential for
achieving performance on par with manually optimised code for GPU programs with the exact
same complex optimisations applied.
The second contribution of this thesis is the presentation of techniques that enable the
LIFT compiler to perform complex optimisations that usually require from tens to hundreds of
individual rule applications by grouping them as macro-rules that cut through the optimisation
space. Using matrix multiplication as an example, starting from a single high-level program
the compiler automatically generates highly optimised and specialised implementations for
desktop and mobile GPUs with very different architectures achieving performance portability.
The final contribution of this thesis is the demonstration of how low-level and GPU-specific
features are extracted directly from the high-level functional LIFT program, enabling building
a statistical performance model that makes accurate predictions about the performance of differently
optimised program variants. This performance model is then used to drastically speed
up the time taken by the optimisation space exploration by ranking the different variants based
on their predicted performance.
Overall, this thesis demonstrates that performance portability is achievable using LIFT
Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms
Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art.
As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques
to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level
- …