On the Performance and Isolation of Asymmetric Microkernel Design for Lightweight Manycores by Penna, Pedro Henrique et al.
HAL Id: hal-02297637
https://hal.archives-ouvertes.fr/hal-02297637
Submitted on 22 Nov 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
On the Performance and Isolation of Asymmetric
Microkernel Design for Lightweight Manycores
Pedro Henrique Penna, João Souto, Davidson Lima, Márcio Castro, François
Broquedis, Henrique Freitas, Jean-François Mehaut
To cite this version:
Pedro Henrique Penna, João Souto, Davidson Lima, Márcio Castro, François Broquedis, et al.. On
the Performance and Isolation of Asymmetric Microkernel Design for Lightweight Manycores. SBESC
2019 - IX Brazilian Symposium on Computing Systems Engineering, Nov 2019, Natal, Brazil. pp.1-31.
￿hal-02297637￿
On the Performance and Isolation of Asymmetric
Microkernel Design for Lightweight Manycores
Pedro Henrique Penna1,2, João Vicente Souto3, Davidson Francis Lima2,
Márcio Castro3, François Broquedis4,
Henrique Freitas2 and Jean-François Méhaut1
1Université Grenoble Alpes (UGA)
2Pontifícia Universidade Católica de Minas Gerais (PUC Minas)
3Universidade Federal de Santa Catarina (UFSC)
4Institut National Polytechnique de Grenoble (Grenoble INP)
SBESC ’19
Introduction
Lightweight (LW) Manycores – Overview
Hundreds of Lightweight Cores
Target MMID computing workloads
Expose massive thread-level parallelism
Feature low-power consumption
Distributed Memory Architecture
Grants scalability
Delivers predictability
On-Chip Heterogeneity
Enables adaptability to diverse workloads
Uncovers high-energy efficiency
Rich On-Chip Interconnects
Offer quality of service
Allow asynchronous communications
DRAM
Devices
Compute Cluster
corecore
core core
SRAM
NoC
I/O Cluster
corecore
SRAM
NoC
core
NoCDMA
DMA
Figure: Overview of a manycore.
It is a distributed architecture in a chip!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 1 / 14
Introduction
Lightweight (LW) Manycores – Overview
Hundreds of Lightweight Cores
Target MMID computing workloads
Expose massive thread-level parallelism
Feature low-power consumption
Distributed Memory Architecture
Grants scalability
Delivers predictability
On-Chip Heterogeneity
Enables adaptability to diverse workloads
Uncovers high-energy efficiency
Rich On-Chip Interconnects
Offer quality of service
Allow asynchronous communications
DRAM
Devices
Compute Cluster
corecore
core core
SRAM
NoC
I/O Cluster
corecore
SRAM
NoC
core
NoCDMA
DMA
Figure: Overview of a manycore.
It is a distributed architecture in a chip!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 1 / 14
Introduction
Lightweight (LW) Manycores – Challenges
Used in embedded computing and HPC
What about multi-application support?
High Density Circuit Integration
Heat dissipation
Dark silicon
Distributed Memory Architecture
Data tiling (small local memories)
Message passing
On-Chip Heterogeneity
Thread scheduling and placement
Rich On-Chip Interconnects
Network congestion
Security checking
DRAM
Devices
Compute Cluster
corecore
core core
SRAM
NoC
I/O Cluster
corecore
SRAM
NoC
core
NoCDMA
DMA
Figure: Overview of a manycore.
Performance vs Programmability vs Portability
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 2 / 14
Introduction
Lightweight (LW) Manycores – Challenges
Used in embedded computing and HPC
What about multi-application support?
High Density Circuit Integration
Heat dissipation
Dark silicon
Distributed Memory Architecture
Data tiling (small local memories)
Message passing
On-Chip Heterogeneity
Thread scheduling and placement
Rich On-Chip Interconnects
Network congestion
Security checking
DRAM
Devices
Compute Cluster
corecore
core core
SRAM
NoC
I/O Cluster
corecore
SRAM
NoC
core
NoCDMA
DMA
Figure: Overview of a manycore.
Performance vs Programmability vs Portability
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 2 / 14
Introduction
Lightweight (LW) Manycores – Operating System Support
OSes enhance programmability and portability
Expose rich abstractions and APIs
Multiplex resources and ensure policies
How about commodity operating systems?
Ex: Linux, FreeBSD, Windows...
Pros: instantaneous support for tons of software
Symmetric design leads to cache interference (Wentzlaff and Agarwal 2009)
Poor fine-grain lock scalability (Amdahl’s Law)
Increasingly diverse hardware (Barbalace et al. 2015)
How about distributed operating systems?
Ex: microkernels and multikernels
Pros: modularity and scalability
Miss support for rich on-chip interconnects (Dinechin et al. 2013)
Do not cope with small local memories (Olofsson, Nordstrom, and Ul-Abdin 2014)
Existing OSes do not address lightweight manycores!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 3 / 14
Introduction
Lightweight (LW) Manycores – Operating System Support
OSes enhance programmability and portability
Expose rich abstractions and APIs
Multiplex resources and ensure policies
How about commodity operating systems?
Ex: Linux, FreeBSD, Windows...
Pros: instantaneous support for tons of software
Symmetric design leads to cache interference (Wentzlaff and Agarwal 2009)
Poor fine-grain lock scalability (Amdahl’s Law)
Increasingly diverse hardware (Barbalace et al. 2015)
How about distributed operating systems?
Ex: microkernels and multikernels
Pros: modularity and scalability
Miss support for rich on-chip interconnects (Dinechin et al. 2013)
Do not cope with small local memories (Olofsson, Nordstrom, and Ul-Abdin 2014)
Existing OSes do not address lightweight manycores!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 3 / 14
Introduction
Lightweight (LW) Manycores – Operating System Support
OSes enhance programmability and portability
Expose rich abstractions and APIs
Multiplex resources and ensure policies
How about commodity operating systems?
Ex: Linux, FreeBSD, Windows...
Pros: instantaneous support for tons of software
Symmetric design leads to cache interference (Wentzlaff and Agarwal 2009)
Poor fine-grain lock scalability (Amdahl’s Law)
Increasingly diverse hardware (Barbalace et al. 2015)
How about distributed operating systems?
Ex: microkernels and multikernels
Pros: modularity and scalability
Miss support for rich on-chip interconnects (Dinechin et al. 2013)
Do not cope with small local memories (Olofsson, Nordstrom, and Ul-Abdin 2014)
Existing OSes do not address lightweight manycores!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 3 / 14
Introduction
Lightweight (LW) Manycores – Operating System Support
OSes enhance programmability and portability
Expose rich abstractions and APIs
Multiplex resources and ensure policies
How about commodity operating systems?
Ex: Linux, FreeBSD, Windows...
Pros: instantaneous support for tons of software
Symmetric design leads to cache interference (Wentzlaff and Agarwal 2009)
Poor fine-grain lock scalability (Amdahl’s Law)
Increasingly diverse hardware (Barbalace et al. 2015)
How about distributed operating systems?
Ex: microkernels and multikernels
Pros: modularity and scalability
Miss support for rich on-chip interconnects (Dinechin et al. 2013)
Do not cope with small local memories (Olofsson, Nordstrom, and Ul-Abdin 2014)
Existing OSes do not address lightweight manycores!
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 3 / 14
Introduction
Goal and Contributions
Long-Term Goal: Propose an OS for LW Manycores
Deliver portability across multiple platforms
Expose a POSIX-compliant interface
Provide flexible view of the platform
Embrace a multikernel OS structure
Rely on asymmetric microkernels as building blocks Performance Portability
Programmability
Goal of This Work: Assess an Asymmetric Microkernel Design for LW Manycores
Microkernel Structure: improves flexibility and portability
Asymmetric Design: delivers scalability
Scientific Contribution: Insights on Kernel Construction for LW Manycores
Quantitative results on performance and isolation of the assessed design
Discussion on co-design aspects between the OS kernel and the hardware
Technical Contribution: Nanvix Microkernel
Open source asymmetric microkernel for LW manycores
Supports multiple baremetal platforms (MPPA-256, RISC-V, OpenRISC)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 4 / 14
Introduction
Goal and Contributions
Long-Term Goal: Propose an OS for LW Manycores
Deliver portability across multiple platforms
Expose a POSIX-compliant interface
Provide flexible view of the platform
Embrace a multikernel OS structure
Rely on asymmetric microkernels as building blocks Performance Portability
Programmability
Goal of This Work: Assess an Asymmetric Microkernel Design for LW Manycores
Microkernel Structure: improves flexibility and portability
Asymmetric Design: delivers scalability
Scientific Contribution: Insights on Kernel Construction for LW Manycores
Quantitative results on performance and isolation of the assessed design
Discussion on co-design aspects between the OS kernel and the hardware
Technical Contribution: Nanvix Microkernel
Open source asymmetric microkernel for LW manycores
Supports multiple baremetal platforms (MPPA-256, RISC-V, OpenRISC)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 4 / 14
Introduction
Goal and Contributions
Long-Term Goal: Propose an OS for LW Manycores
Deliver portability across multiple platforms
Expose a POSIX-compliant interface
Provide flexible view of the platform
Embrace a multikernel OS structure
Rely on asymmetric microkernels as building blocks Performance Portability
Programmability
Goal of This Work: Assess an Asymmetric Microkernel Design for LW Manycores
Microkernel Structure: improves flexibility and portability
Asymmetric Design: delivers scalability
Scientific Contribution: Insights on Kernel Construction for LW Manycores
Quantitative results on performance and isolation of the assessed design
Discussion on co-design aspects between the OS kernel and the hardware
Technical Contribution: Nanvix Microkernel
Open source asymmetric microkernel for LW manycores
Supports multiple baremetal platforms (MPPA-256, RISC-V, OpenRISC)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 4 / 14
Presentation Outline
1 Introduction
Context and Motivation
Target Problem
Goal and Contributions
2 The Nanvix OS
Overview
System Structure
Microkernel Overview
3 Experimental Results
Evaluation Methodology
Microbenchmarks
Synthetic Benchmarks
4 Conclusions
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 5 / 14
Presentation Outline
1 Introduction
Context and Motivation
Target Problem
Goal and Contributions
2 The Nanvix OS
Overview
System Structure
Microkernel Overview
3 Experimental Results
Evaluation Methodology
Microbenchmarks
Synthetic Benchmarks
4 Conclusions
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 5 / 14
The Nanvix Operating System
Overview – The Nanvix Project
Re-Engineered Version of Nanvix to LW Manycores
Home-grown OS
4 Professors (Brazil and France)
1 PhD, 2 MSc and 2 BSc Students
9 past contributors
UGA, PUC Minas, UFSC and Grenoble INP
Project Guidelines
Be Open: invite others to collaborate
Be Permissive: enable free adaptability
Design Principles
Be Portable: run on multiple architectures
Be Scalable: embrace distributed configuration
Be Flexible: expose multiple APIs
Figure: Bingo, our mascot.
https://github.com/nanvix
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 6 / 14
The Nanvix Operating System
System Structure
Architectural Model
Cores are grouped into clusters
Each cluster has its own physical address space
Intra-Cluster communication: shared memory
Inter-Cluster communication: NoC
Multikernel with Three-Layers
Kernels
One instance on each cluster
Provide minimum abstractions
Ensure policies and security
System Servers
Run on top of kernels at user-level
Provide traditional abstractions
Collaboratively implement subsystems
Runtime Libraries
Run alongside with user-applications
Interface with system servers
Expose standard APIs (i.e. POSIX)
Idle Core
Kernel Core
Service Core
Application A
Application B
Figure: The multikernel OS structure.
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 7 / 14
The Nanvix Operating System
Microkernel Overview
Microkernel Design
Asymmetric: runs on a dedicated core of a cluster
Small: provides only essential abstractions (about 5k LoC)
Portable: supports MPPA-256, RISC-V and OpenRISC based manycores
Thread Management System
Non-interruptible kernel threads
Sleep/wakeup primitives
Exception handling forwarding
Thread checkpointing
Memory Management System
Single address space
Two-level paging scheme
IPC Facility
Inter-cluster synchronization
Inter-cluster communication
Hardware Abastraction Layer
Kernel Call Interface
Thread
System
Memory
System
Device
System
IPC
Facility
OpTiMSoC
OpenRISC
MPPA-256
Bostan
HERO
RISC-V
Figure: An overview of the Nanvix kernel.
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 8 / 14
Presentation Outline
1 Introduction
Context and Motivation
Target Problem
Goal and Contributions
2 The Nanvix OS
Overview
System Structure
Microkernel Overview
3 Experimental Results
Evaluation Methodology
Microbenchmarks
Synthetic Benchmarks
4 Conclusions
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 8 / 14
The Nanvix Operating System
Experimental Results – Microkernel
Microbenchmark Experiments
Assess asymmetric design
L-Kcall: performance of local kernel calls
R-Kcall: performance of remote kernel calls
Synthetic Benchmark Experiments
Evaluate performance on representative use cases
Fork-Join: scalability for fork-join programming model
KNoise: kernel interference on application execution
Experimental Platform
Compute Cluster of MPPA-256 Bostan (16 cores, nocc 2 MB memory)
Kalray Accesscore 2.8.1 (Hypervisor 1.0, GCC 4.9.4 & Binutils 2.11.0)
Evaluation Methodology
Full factorial design for each experiment (70 configurations in total)
30 replicas for each experimental configuration (< 1% of c.o.v)
Nanvix Microkernel 0a0088b build with -03 flags (128 kb memory footprint)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 9 / 14
The Nanvix Operating System
Experimental Results – Microkernel
Microbenchmark Experiments
Assess asymmetric design
L-Kcall: performance of local kernel calls
R-Kcall: performance of remote kernel calls
Synthetic Benchmark Experiments
Evaluate performance on representative use cases
Fork-Join: scalability for fork-join programming model
KNoise: kernel interference on application execution
Experimental Platform
Compute Cluster of MPPA-256 Bostan (16 cores, nocc 2 MB memory)
Kalray Accesscore 2.8.1 (Hypervisor 1.0, GCC 4.9.4 & Binutils 2.11.0)
Evaluation Methodology
Full factorial design for each experiment (70 configurations in total)
30 replicas for each experimental configuration (< 1% of c.o.v)
Nanvix Microkernel 0a0088b build with -03 flags (128 kb memory footprint)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 9 / 14
The Nanvix Operating System
Microkernel – L-KCall Benchmark Results
0
100
200
300
400
500
600
700
800
Total Cycles Register Stalls Branch Stalls I−Cache Stalls D−Cache Stalls
Cy
cle
s
 Kalray Hypervisor
 Nanvix
Figure: Breakthrough of local kernel calls in Nanvix.
About 164 cycles are required for mode switch (i.e.: user to kernel)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 10 / 14
The Nanvix Operating System
Microkernel – L-KCall Benchmark Results
0
100
200
300
400
500
600
700
800
Total Cycles Register Stalls Branch Stalls I−Cache Stalls D−Cache Stalls
Cy
cle
s
 Kalray Hypervisor
 Nanvix
Figure: Breakthrough of local kernel calls in Nanvix.
Low kernel interference
Complex execution flow does not mess up branch unit
D-Cache is not badly impacted (low capacity misses)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 10 / 14
The Nanvix Operating System
Microkernel – L-KCall Benchmark Results
0
100
200
300
400
500
600
700
800
Total Cycles Register Stalls Branch Stalls I−Cache Stalls D−Cache Stalls
Cy
cle
s
 Kalray Hypervisor
 Nanvix
Figure: Breakthrough of local kernel calls in Nanvix.
I-Cache and I-Fetch units are badly performing
Working size set is small enough (less than 8 kB)
I-Cache stalls account for 52% of time
Register file stalls account for 73% of time (bad code generation)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 10 / 14
The Nanvix Operating System
Microkernel – R-KCall Benchmark Results
0
450
900
1350
1800
2250
2700
3150
3600
4050
4500
Total Cycles Register Stalls Branch Stalls I−Cache Stalls D−Cache Stalls
Cy
cle
s
 Kalray Hypervisor
 Nanvix
Figure: Breakthrough of remote kernel calls in Nanvix.
Overheads in inter-core synchronization do matter
D-Cache stalls account for 38% of time
Hardware cache coherency is not supported
Hardware misses elective cache line invalidation
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 11 / 14
The Nanvix Operating System
Microkernel – KNoise Benchmark Results
l
l l l l l l l l l l l l l
0.90
0.92
0.94
0.96
0.98
1.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Threads
Ef
fic
ie
nc
y 
(F
LO
Ps
/co
re)
l Without Noise
With Noise
Figure: Kernel noise scalability in Nanvix.
Linear scalability for user applications
Constant overhead of 0.5% per thread (bad-case scenario)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 12 / 14
The Nanvix Operating System
Microkernel – Fork-Join Benchmark Results
l
l
l
l
l
l
l
l
l
l
l
l
l
l
 0
20
40
60
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Threads
Cy
cle
s×
10
3
l Fork
Join
Figure: Fork-Join scalability in Nanvix.
Linear scalability for fork-join
Performance gap of 1.5× due to asynchronous resource release
Thread recycling is not implemented yet
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 13 / 14
Presentation Outline
1 Introduction
Context and Motivation
Target Problem
Goal and Contributions
2 The Nanvix OS
Overview
System Structure
Microkernel Overview
3 Experimental Results
Evaluation Methodology
Microbenchmarks
Synthetic Benchmarks
4 Conclusions
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 13 / 14
Conclusions on Our Research
Motivation, Goal and Contribution
Motivation: Existing OSes Do Not Address LW manycores
Distributed architecture with small local memories
On-chip heterogeneity
Rich on-chip interconnect
Log-Term Goal: Propose an OS for LW Manycores
Deliver portability across multiple platforms
Provide a POSIX interface and a flexible view of the platform
Embrace a multikernel OS structure
Rely on asymmetric microkernels as building blocks
Goal of This Work: Assess an Asymmetric Microkernel Design for LW Manycores
Scientific Contribution: Insights on Kernel Construction for LW Manycores
Asymmetric microkernel design delivers performance isolation and scalability
I-Cache and I-Fetch units are a hotspot for improvement
D-Cache coherence or selective cache invalidation may push performance further
Technical Contribution: Nanvix Microkernel
Asymmetric microkernel for LW manycores (MPPA-256, RISC-V, OpenRISC)
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 14 / 14
Thank You!
On the Performance and Isolation of Asymmetric
Microkernel Design for Lightweight Manycores
https://github.com/nanvix
Pedro Henrique Penna1,2, João Vicente Souto3, Davidson Francis Lima2,
Márcio Castro3, François Broquedis4,
Henrique Freitas2 and Jean-François Méhaut1
1Université Grenoble Alpes (UGA)
2Pontifícia Universidade Católica de Minas Gerais (PUC Minas)
3Universidade Federal de Santa Catarina (UFSC)
4Institut National Polytechnique de Grenoble (Grenoble INP)
References I
Antonio Barbalace et al. “Popcorn: Bridging the Programmability Gap in
Heterogeneous-ISA Platforms”. In: European Conf. on Computer Systems.
Bordeaux, France, Apr. 2015, pp. 1–16. ISBN: 978-1-4503-3238-5. DOI:
10.1145/2741948.2741962.
Benoit de Dinechin et al. “A Clustered Manycore Processor Architecture for
Embedded and Accelerated Applications”. In:
Int. Conf. on High Performance Extreme Computing. Waltham, USA, 2013,
pp. 1–6. ISBN: 978-1-4799-1365-7. DOI: 10.1109/HPEC.2013.6670342.
Andreas Olofsson, Tomas Nordstrom, and Zain Ul-Abdin. “Kickstarting
High-Performance Energy-Efficient Manycore Architectures with Epiphany”. In:
Asilomar Conf. on Signals, Systems and Computers. 2014, pp. 1719–1726. ISBN:
978-1-4799-8297-4. DOI: 10.1109/ACSSC.2014.7094761.
David Wentzlaff and Anant Agarwal. “Factored Operating Systems (FOS): The
Case for a Scalable Operating System for Multicores”. In:
ACM SIGOPS Operating Systems Review 43.2 (Apr. 2009), pp. 76–85. ISSN:
0163-5980. DOI: 10.1145/1531793.1531805.
The Nanvix Operating System
Microkernel – Overview
Interrupts
Exceptions Core
Execution
Context
Memory
CacheMMU/TLBEvents
HAL
Systems
Kernel
Call
Interface
Memory
Mapped I/O
Port
Mapped I/O
Device System
Inter-Core
Comm.
Inter-Cluster
Comm.
Communication Facility
Thread
Sync.
Thread
Mgmt.
Thread System
Paging
Module
Memory System
Frame
Module
mmio
portio
intctl
sync signals
mailbox
portal
alarm save
spawn
wakeupload
sleep
join
pgumap
pgctl
pgmap
Targets
exit
OpTiMSoC
OpenRISC (mor1kx)
MPPA-256
Bostan (k1)
NoC Traps
HERO
RISC-V (ri5cy)
Figure: A detailed view of the Nanvix kernel.
Pedro Henrique Penna et al. Assymetric uKernel for LW Manycpres SBESC ’19 15 / 14
