21 research outputs found
Lasagne : a static binary translator for weak memory model architectures
Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a programâs binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagneâs intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin
The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS
International audienceThis paper analyzes the impact on application performance of the design and implementation choices made in two widely used open-source schedulers: ULE, the default FreeBSD scheduler, and CFS, the default Linux scheduler. We compare ULE and CFS in otherwise identical circumstances. We have ported ULE to Linux, and use it to schedule all threads that are normally scheduled by CFS. We compare the performance of a large suite of applications on the modified kernel running ULE and on the standard Linux kernel running CFS. The observed performance differences are solely the result of scheduling decisions, and do not reflect differences in other subsystems between FreeBSD and Linux. There is no overall winner. On many workloads the two schedulers perform similarly, but for some work-loads there are significant and even surprising differences. ULE may cause starvation, even when executing a single application with identical threads, but this starvation may actually lead to better application performance for some workloads. The more complex load balancing mechanism of CFS reacts more quickly to work-load changes, but ULE achieves better load balance in the long run
The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS
This paper analyzes the impact on application performance of the design and implementation choices made in two widely used open-source schedulers: ULE, the default FreeBSD scheduler, and CFS, the default Linux scheduler. We compare ULE and CFS in otherwise identical circumstances. We have ported ULE to Linux, and use it to schedule all threads that are normally scheduled by CFS. We compare the performance of a large suite of applications on the modified kernel running ULE and on the standard Linux kernel running CFS. The observed performance differences are solely the result of scheduling decisions, and do not reflect differences in other subsystems between FreeBSD and Linux. There is no overall winner. On many workloads the two schedulers perform similarly, but for some workloads there are significant and even surprising differences. ULE may cause starvation, even when executing a single application with identical threads, but this starvation may actually lead to better application performance for some workloads. The more complex load balancing mechanism of CFS reacts more quickly to workload changes, but ULE achieves better load balance in the long run
Fewer Cores, More Hertz: Leveraging High-Frequency Cores in the OS Scheduler for Improved Application Performance
International audienceIn modern server CPUs, individual cores can run at different frequencies, which allows for fine-grained control of the per-formance/energy tradeoff. Adjusting the frequency, however, incurs a high latency. We find that this can lead to a problem of frequency inversion, whereby the Linux scheduler places a newly active thread on an idle core that takes dozens to hundreds of milliseconds to reach a high frequency, just before another core already running at a high frequency becomes idle. In this paper, we first illustrate the significant performance overhead of repeated frequency inversion through a case study of scheduler behavior during the compilation of the Linux kernel on an 80-core Intel R Xeon-based machine. Following this, we propose two strategies to reduce the likelihood of frequency inversion in the Linux scheduler. When benchmarked over 60 diverse applications on the Intel R Xeon, the better performing strategy, S move , improves performance by more than 5% (at most 56% with no energy overhead) for 23 applications, and worsens performance by more than 5% (at most 8%) for only 3 applications. On a 4-core AMD Ryzen we obtain performance improvements up to 56%
Ordonnancement de fils d'exécution dans les systÚmes d'exploitation multi-coeurs
In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows : - Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema. - a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications. - a scheduler model in the form of a âfeature treeâ. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers.Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cĆurs en lâabordant sous plusieurs angles : celui de la conception (simplicitĂ© et correction), celui de lâamĂ©lioration des performances et enfin celui du dĂ©veloppement dâordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes : - Ipanema, un langage dĂ©diĂ© au dĂ©veloppement dâordonnanceurs de processus pour multi-coeurs. Nous implĂ©mentons Ă©galement au coeur du noyau Linux une nouvelle abstraction permettant dâajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema. - une sĂ©rie dâoutils de recherche de bogues de performance. GrĂące Ă ces outils, nous montrons que lâordonnanceur de Linux, CFS, souffre dâun problĂšme liĂ© Ă la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă ce problĂšme sous la forme dâun patch soumis Ă la communautĂ©. Ce patch permet dâamĂ©liorer significativement les performances de nombreuses applications. - une modĂ©lisation des ordonnanceurs sous forme dâun âfeature treeâ. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet dâĂ©tudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement dâordonnanceurs spĂ©cifiques Ă une application donnĂ©e
Ordonnancement de Fils d'ExĂ©cution dans les SystĂšmes d'Exploitation Multi-cĆurs
In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows:- Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema.- a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications.- a scheduler model in the form of a âfeature treeâ. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers.Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cĆur en lâabordant sous plusieurs angles: celui de la conception (simplicitĂ© et correction), celui de lâamĂ©lioration des performances et enfin celui du dĂ©veloppement dâordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes:- Ipanema, un langage dĂ©diĂ© au dĂ©veloppement dâordonnanceurs de processus pour multi-cĆur. Nous implĂ©mentons Ă©galement au cĆur du noyau Linux une nouvelle abstraction permettant dâajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema.- une sĂ©rie dâoutils de recherche de bogues de performance. GrĂące Ă ces outils, nous montrons que lâordonnanceur de Linux, CFS, souffre dâun problĂšme liĂ© Ă la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă ce problĂšme sous la forme dâun patch soumis Ă la communautĂ©. Ce patch permet dâamĂ©liorer significativement les performances de nombreuses applications.- une modĂ©lisation des ordonnanceurs sous forme dâun âfeature treeâ. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet dâĂ©tudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement dâordonnanceurs spĂ©cifiques Ă une application donnĂ©e
Ordonnancement de fils d'exécution dans les systÚmes d'exploitation multi-coeurs
Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cĆurs en lâabordant sous plusieurs angles : celui de la conception (simplicitĂ© et correction), celui de lâamĂ©lioration des performances et enfin celui du dĂ©veloppement dâordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes : - Ipanema, un langage dĂ©diĂ© au dĂ©veloppement dâordonnanceurs de processus pour multi-coeurs. Nous implĂ©mentons Ă©galement au coeur du noyau Linux une nouvelle abstraction permettant dâajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema. - une sĂ©rie dâoutils de recherche de bogues de performance. GrĂące Ă ces outils, nous montrons que lâordonnanceur de Linux, CFS, souffre dâun problĂšme liĂ© Ă la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă ce problĂšme sous la forme dâun patch soumis Ă la communautĂ©. Ce patch permet dâamĂ©liorer significativement les performances de nombreuses applications. - une modĂ©lisation des ordonnanceurs sous forme dâun âfeature treeâ. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet dâĂ©tudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement dâordonnanceurs spĂ©cifiques Ă une application donnĂ©e.In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows : - Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema. - a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications. - a scheduler model in the form of a âfeature treeâ. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers
Risotto:a dynamic binary translator for weak memory model architectures
Dynamic Binary Translation (DBT) is a powerful approach to support cross-architecture emulation of unmodified binaries. However, DBT systems face correctness and performance challenges, when emulating concurrent binaries from strong to weak memory consistency architectures. As a matter of fact, we report several translation errors in QEMU, when emulating x86 binaries on Arm hosts.To address these challenges, we propose an end-to-end approach that provides correct and efficient emulation for weak memory model architectures. Our contributions are twofold: First, we formalize QEMUâs intermediate representationâs memory model, and use it to propose formally verified mapping schemes to bridge the strong-on-weak memory consistency mismatch. Second, we implement these verified mappings in Risotto, a QEMU-based DBT system that optimizes memory fence placement while ensuring correctness. Risotto further improves performance via cross-architecture dynamic linking of native shared libraries and faster yet correct translation of compare-and-swap operations.We evaluate Risotto using multi-threaded benchmark suites and real-world applications, and show that Risotto improves the emulation performance by 6.7% on average over âerroneousâ QEMU, while ensuring correctness