29 research outputs found
Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems
We present Swallow, a scalable many-core architecture, with a current
configuration of 480 x 32-bit processors.
Swallow is an open-source architecture, designed from the ground up to
deliver scalable increases in usable computational power to allow
experimentation with many-core applications and the operating systems that
support them.
Scalability is enabled by the creation of a tile-able system with a
low-latency interconnect, featuring an attractive communication-to-computation
ratio and the use of a distributed memory configuration.
We analyse the energy and computational and communication performances of
Swallow. The system provides 240GIPS with each core consuming 71--193mW,
dependent on workload. Power consumption per instruction is lower than almost
all systems of comparable scale.
We also show how the use of a distributed operating system (nOS) allows the
easy creation of scalable software to exploit Swallow's potential. Finally, we
show two use case studies: modelling neurons and the overlay of shared memory
on a distributed memory system.Comment: An open source release of the Swallow system design and code will
follow and references to these will be added at a later dat
Contention in multicore hardware shared resources: Understanding of the state of the art
The real-time systems community has over the years devoted considerable attention to the impact on execution timing that arises from contention on access to hardware shared resources. The relevance of this problem has been accentuated with the arrival of multicore processors. From the state of the art on the subject, there appears to be considerable diversity in the understanding of the problem and in the “approach” to solve it. This sparseness makes it difficult for any reader to form a coherent picture of the problem and solution space. This paper draws a tentative taxonomy in which each known approach to the problem can be categorised based on its specific goals and assumptions.Postprint (published version
Integration and validation of embedded flight software on space-qualified multicore architectures
In the recent decades, the importance of software on space missions has notably increased, reflecting the need to integrate advanced on-board functionalities. With multicore processors being lately introduced to host critical high-performance applications, the complexity to validate software has significantly raised with respect to single core architectures. While there has been a big step forward in avionics after the publication of the CAST-32A paper, the ECSS-E-ST-40C software engineering standard used by the European Space Agency (ESA) is still not providing validation support for multicore processors. Hence, it is expected that standardising guidelines to develop software on such platforms will become a recurring topic in the industry to match the demands of future space exploration missions
Operating System Contribution to Composable Timing Behaviour in High-Integrity Real-Time Systems
The development of High-Integrity Real-Time Systems has a high footprint in terms of human, material and schedule costs. Factoring functional, reusable logic in the application favors incremental development and contains costs. Yet, achieving incrementality in the timing behavior is a much harder problem. Complex features at all levels of the execution stack, aimed to boost average-case performance, exhibit timing behavior highly dependent on execution history, which wrecks time composability and incrementaility with it.
Our goal here is to restitute time composability to the execution stack, working bottom up across it. We first characterize time composability without making assumptions on the system architecture or the software deployment to it. Later, we focus on the role played by the real-time operating system in our pursuit. Initially we consider single-core processors and, becoming less permissive on the admissible hardware features, we devise solutions that restore a convincing degree of time composability.
To show what can be done for real, we developed TiCOS, an ARINC-compliant kernel, and re-designed ORK+, a kernel for Ada Ravenscar runtimes. In that work, we added support for limited-preemption to ORK+, an absolute premiere in the landscape of real-word kernels. Our implementation allows resource sharing to co-exist with limited-preemptive scheduling, which extends state of the art.
We then turn our attention to multicore architectures, first considering partitioned systems, for which we achieve results close to those obtained for single-core processors. Subsequently, we shy away from the over-provision of those systems and consider less restrictive uses of homogeneous multiprocessors, where the scheduling algorithm is key to high schedulable utilization. To that end we single out RUN, a promising baseline, and extend it to SPRINT, which supports sporadic task sets, hence matches real-world industrial needs better.
To corroborate our results we present findings from real-world case studies from avionic industry
Programmer-transparent efficient parallelism with skeletons
Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant complexity at the software level to the detriment of programmer productivity. To
produce correct and efficient code programmers not only have to manage synchronisation and communication but also be aware of low-level hardware details. It is foresee able that the problem is becoming worse because systems are increasingly parallel and
heterogeneous.
Building on earlier work, this thesis further investigates the contribution which
algorithmic skeletons can make towards solving this problem. Skeletons are high-level
abstractions for typical parallel computations. They hide low-level hardware details
from programmers and, in addition, encode information about the computations that
they implement, which runtime systems and library developers can use for automatic
optimisations. We present two novel case studies in this respect.
First, we provide scheduling flexibility on heterogeneous CPU + GPU systems in
a programmer transparent way similar to the freedom OS schedulers have on CPUs.
Thanks to the high-level nature of skeletons we automatically switch between CPU and
GPU implementations of kernels and use semantic information encoded in skeletons to
find execution time points at which switches can occur. In more detail, kernel iteration
spaces are processed in slices and migration is considered on a slice-by-slice basis. We
show that slice sizes choices that introduce negligible overheads can be learned by predictive models. We show that in a simple deployment scenario mid-kernel migration
achieves speedups of up to 1.30x and 1.08x on average. Our mechanism introduces
negligible overheads of 2.34% if a kernel does not actually migrate.
Second, we propose skeletons to simplify the programming of parallel hard real-time systems. We combine information encoded in task farms with real-time systems
user code analysis to automatically choose thread counts and an optimisation parameter
related to farm internal communication. Both parameters are chosen so that real-time
deadlines are met with minimum resource usage. We show that our approach achieves
1.22x speedup over unoptimised code, selects the best parameter settings in 83% of
cases, and never chooses parameters that cause deadline misses
Real-Time Stream Processing in Embedded Systems
Modern real-time embedded systems often involve computational-intensive data processing algorithms to meet their application requirements. As a result, there has been an increase in the use of multiprocessor platforms. The stream processing programming model aims to facilitate the construction of concurrent data processing programs to exploit the parallelism available on these architectures. However, most current stream processing frameworks or languages are not designed for use in real-time systems, let alone systems that might also have hard real-time control algorithms. This thesis contends that a generic architecture of a real-time stream processing infrastructure can be created to support predictable processing of both batched and live streaming data sources, and integrated with hard real-time control algorithms.
The thesis first reviews relevant stream processing techniques, and identifies the open issues. Then a real-time stream processing task model, and an architecture for supporting that model is proposed. An approach to the integration of stream processing tasks into a real-time environment that also has hard real-time components is presented. Data is processed in parallel using execution-time servers allocated to each core. An algorithm is presented for selecting the parameters of the servers that maximises their capacities (within an overall deadline) and ensures that hard real-time components remain schedulable. Response-time analysis is derived to guarantee that the real-time requirements (deadlines for batched data processing, and latency for each data item for live data) for the stream processing activity are met. A framework, called SPRY, is implemented to support the proposed real-time stream processing architecture. The framework supports fully-partitioned applications that are scheduled using fixed priority-based scheduling techniques. A case study based on a modified Generic Avionics Platform is given to demonstrate the overall approach. Finally, the evaluation shows that the presented approach provides a better schedulability than alternative approaches