21 research outputs found

    Lasagne : a static binary translator for weak memory model architectures

    Get PDF
    Funding: This work was supported by a UK RISE Grant.The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program’s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. In this paper, we propose Lasagne, an end-to-end static binary translator with precise translation rules between x86 and Arm concurrency semantics. First, we propose a concurrency model for Lasagne’s intermediate representation (IR) and formally proved mappings between the IR and the two architectures. The memory ordering is preserved by introducing fences in the translated code. Finally, we propose optimizations focused on raising the level of abstraction of memory address calculations and reducing the number offences. Our evaluation shows that Lasagne reduces the number of fences by up to about 65%, with an average reduction of 45.5%, significantly reducing their runtime overhead.Postprin

    The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS

    Get PDF
    International audienceThis paper analyzes the impact on application performance of the design and implementation choices made in two widely used open-source schedulers: ULE, the default FreeBSD scheduler, and CFS, the default Linux scheduler. We compare ULE and CFS in otherwise identical circumstances. We have ported ULE to Linux, and use it to schedule all threads that are normally scheduled by CFS. We compare the performance of a large suite of applications on the modified kernel running ULE and on the standard Linux kernel running CFS. The observed performance differences are solely the result of scheduling decisions, and do not reflect differences in other subsystems between FreeBSD and Linux. There is no overall winner. On many workloads the two schedulers perform similarly, but for some work-loads there are significant and even surprising differences. ULE may cause starvation, even when executing a single application with identical threads, but this starvation may actually lead to better application performance for some workloads. The more complex load balancing mechanism of CFS reacts more quickly to work-load changes, but ULE achieves better load balance in the long run

    The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS

    Get PDF
    This paper analyzes the impact on application performance of the design and implementation choices made in two widely used open-source schedulers: ULE, the default FreeBSD scheduler, and CFS, the default Linux scheduler. We compare ULE and CFS in otherwise identical circumstances. We have ported ULE to Linux, and use it to schedule all threads that are normally scheduled by CFS. We compare the performance of a large suite of applications on the modified kernel running ULE and on the standard Linux kernel running CFS. The observed performance differences are solely the result of scheduling decisions, and do not reflect differences in other subsystems between FreeBSD and Linux. There is no overall winner. On many workloads the two schedulers perform similarly, but for some workloads there are significant and even surprising differences. ULE may cause starvation, even when executing a single application with identical threads, but this starvation may actually lead to better application performance for some workloads. The more complex load balancing mechanism of CFS reacts more quickly to workload changes, but ULE achieves better load balance in the long run

    Fewer Cores, More Hertz: Leveraging High-Frequency Cores in the OS Scheduler for Improved Application Performance

    Get PDF
    International audienceIn modern server CPUs, individual cores can run at different frequencies, which allows for fine-grained control of the per-formance/energy tradeoff. Adjusting the frequency, however, incurs a high latency. We find that this can lead to a problem of frequency inversion, whereby the Linux scheduler places a newly active thread on an idle core that takes dozens to hundreds of milliseconds to reach a high frequency, just before another core already running at a high frequency becomes idle. In this paper, we first illustrate the significant performance overhead of repeated frequency inversion through a case study of scheduler behavior during the compilation of the Linux kernel on an 80-core Intel R Xeon-based machine. Following this, we propose two strategies to reduce the likelihood of frequency inversion in the Linux scheduler. When benchmarked over 60 diverse applications on the Intel R Xeon, the better performing strategy, S move , improves performance by more than 5% (at most 56% with no energy overhead) for 23 applications, and worsens performance by more than 5% (at most 8%) for only 3 applications. On a 4-core AMD Ryzen we obtain performance improvements up to 56%

    Ordonnancement de fils d'exécution dans les systÚmes d'exploitation multi-coeurs

    No full text
    In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows : - Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema. - a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications. - a scheduler model in the form of a “feature tree”. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers.Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cƓurs en l’abordant sous plusieurs angles : celui de la conception (simplicitĂ© et correction), celui de l’amĂ©lioration des performances et enfin celui du dĂ©veloppement d’ordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes : - Ipanema, un langage dĂ©diĂ© au dĂ©veloppement d’ordonnanceurs de processus pour multi-coeurs. Nous implĂ©mentons Ă©galement au coeur du noyau Linux une nouvelle abstraction permettant d’ajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema. - une sĂ©rie d’outils de recherche de bogues de performance. GrĂące Ă  ces outils, nous montrons que l’ordonnanceur de Linux, CFS, souffre d’un problĂšme liĂ© Ă  la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă  ce problĂšme sous la forme d’un patch soumis Ă  la communautĂ©. Ce patch permet d’amĂ©liorer significativement les performances de nombreuses applications. - une modĂ©lisation des ordonnanceurs sous forme d’un “feature tree”. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet d’étudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement d’ordonnanceurs spĂ©cifiques Ă  une application donnĂ©e

    Ordonnancement de Fils d'ExĂ©cution dans les SystĂšmes d'Exploitation Multi-cƓurs

    Get PDF
    In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows:- Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema.- a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications.- a scheduler model in the form of a “feature tree”. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers.Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cƓur en l’abordant sous plusieurs angles: celui de la conception (simplicitĂ© et correction), celui de l’amĂ©lioration des performances et enfin celui du dĂ©veloppement d’ordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes:- Ipanema, un langage dĂ©diĂ© au dĂ©veloppement d’ordonnanceurs de processus pour multi-cƓur. Nous implĂ©mentons Ă©galement au cƓur du noyau Linux une nouvelle abstraction permettant d’ajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema.- une sĂ©rie d’outils de recherche de bogues de performance. GrĂące Ă  ces outils, nous montrons que l’ordonnanceur de Linux, CFS, souffre d’un problĂšme liĂ© Ă  la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă  ce problĂšme sous la forme d’un patch soumis Ă  la communautĂ©. Ce patch permet d’amĂ©liorer significativement les performances de nombreuses applications.- une modĂ©lisation des ordonnanceurs sous forme d’un “feature tree”. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet d’étudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement d’ordonnanceurs spĂ©cifiques Ă  une application donnĂ©e

    Ordonnancement de fils d'exécution dans les systÚmes d'exploitation multi-coeurs

    No full text
    Dans cette thĂšse, nous traitons du problĂšme des ordonnanceurs pour architectures multi-cƓurs en l’abordant sous plusieurs angles : celui de la conception (simplicitĂ© et correction), celui de l’amĂ©lioration des performances et enfin celui du dĂ©veloppement d’ordonnanceurs sur mesure pour une application. En rĂ©sumĂ©, les contributions prĂ©sentĂ©es sont les suivantes : - Ipanema, un langage dĂ©diĂ© au dĂ©veloppement d’ordonnanceurs de processus pour multi-coeurs. Nous implĂ©mentons Ă©galement au coeur du noyau Linux une nouvelle abstraction permettant d’ajouter dynamiquement un nouvel ordonnanceur Ă©crit en Ipanema. - une sĂ©rie d’outils de recherche de bogues de performance. GrĂące Ă  ces outils, nous montrons que l’ordonnanceur de Linux, CFS, souffre d’un problĂšme liĂ© Ă  la gestion de la frĂ©quence sur les processeurs modernes. Nous proposons une solution Ă  ce problĂšme sous la forme d’un patch soumis Ă  la communautĂ©. Ce patch permet d’amĂ©liorer significativement les performances de nombreuses applications. - une modĂ©lisation des ordonnanceurs sous forme d’un “feature tree”. Nous implĂ©mentons ces fonctionnalitĂ©s de façon indĂ©pendantes afin de proposer un nouvel ordonnanceur entiĂšrement modulaire. Cet ordonnanceur modulaire nous permet d’étudier exhaustivement les diffĂ©rentes combinaisons de fonctionnalitĂ©s ouvrant ainsi la voie au dĂ©veloppement d’ordonnanceurs spĂ©cifiques Ă  une application donnĂ©e.In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correctness), performance improvement and the development of application-specific schedulers. The contributions presented are summarized as follows : - Ipanema, a domain-specific language dedicated to thread schedulers for multi-core architectures. We also implement a new abstraction in the Linux kernel that enables the dynamic addition of schedulers written in Ipanema. - a series of performance and bug tracking tools. Thanks to these tools, we show that the Linux scheduler, CFS, suffers from a problem related to frequency management on modern processors. We propose a solution to this problem in the form of a patch submitted to the community. This patch allows to significantly improve the performance of numerous applications. - a scheduler model in the form of a “feature tree”. We implement these features independently in order to offer a new fully modular scheduler. This modular scheduler allows us to study exhaustively the different combinations of features, thus paving the way for the development of application-specific schedulers

    Risotto:a dynamic binary translator for weak memory model architectures

    No full text
    Dynamic Binary Translation (DBT) is a powerful approach to support cross-architecture emulation of unmodified binaries. However, DBT systems face correctness and performance challenges, when emulating concurrent binaries from strong to weak memory consistency architectures. As a matter of fact, we report several translation errors in QEMU, when emulating x86 binaries on Arm hosts.To address these challenges, we propose an end-to-end approach that provides correct and efficient emulation for weak memory model architectures. Our contributions are twofold: First, we formalize QEMU’s intermediate representation’s memory model, and use it to propose formally verified mapping schemes to bridge the strong-on-weak memory consistency mismatch. Second, we implement these verified mappings in Risotto, a QEMU-based DBT system that optimizes memory fence placement while ensuring correctness. Risotto further improves performance via cross-architecture dynamic linking of native shared libraries and faster yet correct translation of compare-and-swap operations.We evaluate Risotto using multi-threaded benchmark suites and real-world applications, and show that Risotto improves the emulation performance by 6.7% on average over “erroneous” QEMU, while ensuring correctness
    corecore