10 research outputs found

    Profilage continu et efficient de verrous pour Java pour les architectures multicœurs

    Get PDF
    Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de données est généralement parallélisé et effectué sur des machines multi-cœurs. Cependant, les verrous peuvent sérialiser l'exécution de ces coeurs et dégrader la latence et le débit du traitement. Détecter ces problèmes de contention de verrous in-vitro (i.e. pendant le développement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de créer une charge de travail réaliste représentative du contexte d’utilisation du logiciel et de tester toutes les configurations de déploiement possibles où s'exécutera le logiciel. Cette thèse présente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intègre une nouvelle métrique appelée Critical Section Pressure (CSP) évaluant avec précision l'impact de la synchronisation sur le progrès des threads. Free Lunch est directement intégré dans la JVM Hotspot pour minimiser le surcoût d'exécution et reporte régulièrement la CSP afin de pouvoir détecter les problèmes transitoires dus aux verrous. Free Lunch est évalué sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de données Cassandra. Nous avons identifié des phases de contention dans 6 applications dont certaines n'étaient pas détectées par les profilers actuels. Grâce à ces informations, nous avons amélioré la performance de Xalan de 15% en modifiant une seule ligne de code et identifié une phase de haute contention dans Cassandra. Free Lunch n’a jamais dégradé les performances de plus de 6% ce qui le rend approprié pour être déployé continuellement dans un environnement de production

    Profilage continu et efficient de verrous pour Java pour les architectures multicœurs

    Get PDF
    Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de données est généralement parallélisé et effectué sur des machines multi-cœurs. Cependant, les verrous peuvent sérialiser l'exécution de ces coeurs et dégrader la latence et le débit du traitement. Détecter ces problèmes de contention de verrous in-vitro (i.e. pendant le développement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de créer une charge de travail réaliste représentative du contexte d’utilisation du logiciel et de tester toutes les configurations de déploiement possibles où s'exécutera le logiciel. Cette thèse présente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intègre une nouvelle métrique appelée Critical Section Pressure (CSP) évaluant avec précision l'impact de la synchronisation sur le progrès des threads. Free Lunch est directement intégré dans la JVM Hotspot pour minimiser le surcoût d'exécution et reporte régulièrement la CSP afin de pouvoir détecter les problèmes transitoires dus aux verrous. Free Lunch est évalué sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de données Cassandra. Nous avons identifié des phases de contention dans 6 applications dont certaines n'étaient pas détectées par les profilers actuels. Grâce à ces informations, nous avons amélioré la performance de Xalan de 15% en modifiant une seule ligne de code et identifié une phase de haute contention dans Cassandra. Free Lunch n’a jamais dégradé les performances de plus de 6% ce qui le rend approprié pour être déployé continuellement dans un environnement de production

    Real-time operating system support for multicore applications

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2014Plataformas multiprocessadas atuais possuem diversos níveis da memória cache entre o processador e a memória principal para esconder a latência da hierarquia de memória. O principal objetivo da hierarquia de memória é melhorar o tempo médio de execução, ao custo da previsibilidade. O uso não controlado da hierarquia da cache pelas tarefas de tempo real impacta a estimativa dos seus piores tempos de execução, especialmente quando as tarefas de tempo real acessam os níveis da cache compartilhados. Tal acesso causa uma disputa pelas linhas da cache compartilhadas e aumenta o tempo de execução das aplicações. Além disso, essa disputa na cache compartilhada pode causar a perda de prazos, o que é intolerável em sistemas de tempo real críticos. O particionamento da memória cache compartilhada é uma técnica bastante utilizada em sistemas de tempo real multiprocessados para isolar as tarefas e melhorar a previsibilidade do sistema. Atualmente, os estudos que avaliam o particionamento da memória cache em multiprocessadores carecem de dois pontos fundamentais. Primeiro, o mecanismo de particionamento da cache é tipicamente implementado em um ambiente simulado ou em um sistema operacional de propósito geral. Consequentemente, o impacto das atividades realizados pelo núcleo do sistema operacional, tais como o tratamento de interrupções e troca de contexto, no particionamento das tarefas tende a ser negligenciado. Segundo, a avaliação é restrita a um escalonador global ou particionado, e assim não comparando o desempenho do particionamento da cache em diferentes estratégias de escalonamento. Ademais, trabalhos recentes confirmaram que aspectos da implementação do SO, tal como a estrutura de dados usada no escalonamento e os mecanismos de tratamento de interrupções, impactam a escalonabilidade das tarefas de tempo real tanto quanto os aspectos teóricos. Entretanto, tais estudos também usaram sistemas operacionais de propósito geral com extensões de tempo real, que afetamos sobre custos de tempo de execução observados e a escalonabilidade das tarefas de tempo real. Adicionalmente, os algoritmos de escalonamento tempo real para multiprocessadores atuais não consideram cenários onde tarefas de tempo real acessam as mesmas linhas da cache, o que dificulta a estimativa do pior tempo de execução. Esta pesquisa aborda os problemas supracitados com as estratégias de particionamento da cache e com os algoritmos de escalonamento tempo real multiprocessados da seguinte forma. Primeiro, uma infraestrutura de tempo real para multiprocessadores é projetada e implementada em um sistema operacional embarcado. A infraestrutura consiste em diversos algoritmos de escalonamento tempo real, tais como o EDF global e particionado, e um mecanismo de particionamento da cache usando a técnica de coloração de páginas. Segundo, é apresentada uma comparação em termos da taxa de escalonabilidade considerando o sobre custo de tempo de execução da infraestrutura criada e de um sistema operacional de propósito geral com extensões de tempo real. Em alguns casos, o EDF global considerando o sobre custo do sistema operacional embarcado possui uma melhor taxa de escalonabilidade do que o EDF particionado com o sobre custo do sistema operacional de propósito geral, mostrando claramente como diferentes sistemas operacionais influenciam os escalonadores de tempo real críticos em multiprocessadores. Terceiro, é realizada uma avaliação do impacto do particionamento da memória cache em diversos escalonadores de tempo real multiprocessados. Os resultados desta avaliação indicam que um sistema operacional "leve" não compromete as garantias de tempo real e que o particionamento da cache tem diferentes comportamentos dependendo do escalonador e do tamanho do conjunto de trabalho das tarefas. Quarto, é proposto um algoritmo de particionamento de tarefas que atribui as tarefas que compartilham partições ao mesmo processador. Os resultados mostram que essa técnica de particionamento de tarefas reduz a disputa pelas linhas da cache compartilhadas e provê garantias de tempo real para sistemas críticos. Finalmente, é proposto um escalonador de tempo real de duas fases para multiprocessadores. O escalonador usa informações coletadas durante o tempo de execução das tarefas através dos contadores de desempenho em hardware. Com base nos valores dos contadores, o escalonador detecta quando tarefas de melhor esforço o interferem com tarefas de tempo real na cache. Assim é possível impedir que tarefas de melhor esforço acessem as mesmas linhas da cache que tarefas de tempo real. O resultado desta estratégia de escalonamento é o atendimento dos prazos críticos e não críticos das tarefas de tempo real.Abstracts: Modern multicore platforms feature multiple levels of cache memory placed between the processor and main memory to hide the latency of ordinary memory systems. The primary goal of this cache hierarchy is to improve average execution time (at the cost of predictability). The uncontrolled use of the cache hierarchy by realtime tasks may impact the estimation of their worst-case execution times (WCET), specially when real-time tasks access a shared cache level, causing a contention for shared cache lines and increasing the application execution time. This contention in the shared cache may leadto deadline losses, which is intolerable particularly for hard real-time (HRT) systems. Shared cache partitioning is a well-known technique used in multicore real-time systems to isolate task workloads and to improve system predictability. Presently, the state-of-the-art studies that evaluate shared cache partitioning on multicore processors lack two key issues. First, the cache partitioning mechanism is typically implemented either in a simulated environment or in a general-purpose OS (GPOS), and so the impact of kernel activities, such as interrupt handlers and context switching, on the task partitions tend to be overlooked. Second, the evaluation is typically restricted to either a global or partitioned scheduler, thereby by falling to compare the performance of cache partitioning when tasks are scheduled by different schedulers. Furthermore, recent works have confirmed that OS implementation aspects, such as the choice of scheduling data structures and interrupt handling mechanisms, impact real-time schedulability as much as scheduling theoretic aspects. However, these studies also used real-time patches applied into GPOSes, which affects the run-time overhead observed in these works and consequently the schedulability of real-time tasks. Additionally, current multicore scheduling algorithms do not consider scenarios where real-time tasks access the same cache lines due to true or false sharing, which also impacts the WCET. This thesis addresses these aforementioned problems with cache partitioning techniques and multicore real-time scheduling algorithms as following. First, a real-time multicore support is designed and implemented on top of an embedded operating system designed from scratch. This support consists of several multicore real-time scheduling algorithms, such as global and partitioned EDF, and a cache partitioning mechanism based on page coloring. Second, it is presented a comparison in terms of schedulability ratio considering the run-time overhead of the implemented RTOS and a GPOS patched with real-time extensions. In some cases, Global-EDF considering the overhead of the RTOS is superior to Partitioned-EDF considering the overhead of the patched GPOS, which clearly shows how different OSs impact hard realtime schedulers. Third, an evaluation of the cache partitioning impacton partitioned, clustered, and global real-time schedulers is performed.The results indicate that a lightweight RTOS does not impact real-time tasks, and shared cache partitioning has different behavior depending on the scheduler and the task's working set size. Fourth, a task partitioning algorithm that assigns tasks to cores respecting their usage of cache partitions is proposed. The results show that by simply assigning tasks that shared cache partitions to the same processor, it is possible to reduce the contention for shared cache lines and to provideHRT guarantees. Finally, a two-phase multicore scheduler that provides HRT and soft real-time (SRT) guarantees is proposed. It is shown that by using information from hardware performance counters at run-time, the RTOS can detect when best-effort tasks interfere with real-time tasks in the shared cache. Then, the RTOS can prevent best effort tasks from interfering with real-time tasks. The results also show that the assignment of exclusive partitions to HRT tasks together with the two-phase multicore scheduler provides HRT and SRT guarantees, even when best-effort tasks share partitions with real-time tasks

    Investigating tools and techniques for improving software performance on multiprocessor computer systems

    Get PDF
    The availability of modern commodity multicore processors and multiprocessor computer systems has resulted in the widespread adoption of parallel computers in a variety of environments, ranging from the home to workstation and server environments in particular. Unfortunately, parallel programming is harder and requires more expertise than the traditional sequential programming model. The variety of tools and parallel programming models available to the programmer further complicates the issue. The primary goal of this research was to identify and describe a selection of parallel programming tools and techniques to aid novice parallel programmers in the process of developing efficient parallel C/C++ programs for the Linux platform. This was achieved by highlighting and describing the key concepts and hardware factors that affect parallel programming, providing a brief survey of commonly available software development tools and parallel programming models and libraries, and presenting structured approaches to software performance tuning and parallel programming. Finally, the performance of several parallel programming models and libraries was investigated, along with the programming effort required to implement solutions using the respective models. A quantitative research methodology was applied to the investigation of the performance and programming effort associated with the selected parallel programming models and libraries, which included automatic parallelisation by the compiler, Boost Threads, Cilk Plus, OpenMP, POSIX threads (Pthreads), and Threading Building Blocks (TBB). Additionally, the performance of the GNU C/C++ and Intel C/C++ compilers was examined. The results revealed that the choice of parallel programming model or library is dependent on the type of problem being solved and that there is no overall best choice for all classes of problem. However, the results also indicate that parallel programming models with higher levels of abstraction require less programming effort and provide similar performance compared to explicit threading models. The principle conclusion was that the problem analysis and parallel design are an important factor in the selection of the parallel programming model and tools, but that models with higher levels of abstractions, such as OpenMP and Threading Building Blocks, are favoured

    Building A Scalable And High-Performance Key-Value Store System

    Get PDF
    Contemporary web sites can store and process very large amounts of data. To provide timely service to their users, they have adopted key-value (KV) stores, which is a simple but effective caching infrastructure atop the conventional databases that store these data, to boost performance. Examples are Facebook, Twitter and Amazon. As yet little is known about the realistic workloads outside of the companies that operate them, this dissertation work provides a detailed workload study on Facebook\u27s Memcached, which is one of the world\u27s largest KV deployment. We analyze the Memcached workload from the perspective of server-side performance, request composition, caching efficacy, and key locality. The observations presented in this dissertation lead to several design insights and new research direction for KV stores - Hippos, a high-throughput, low-latency, and energy-efficient KV-store implementation. Long considered an application that is memory-bound and network-bound, re- cent KV-store implementations on multicore servers grow increasingly CPU-bound instead. This limitation often leads to under-utilization of available bandwidth and poor energy efficiency, as well as long response times under heavy load. To address these issues, Hippos moves the KV-store into the operating system\u27s kernel and thus removes most of the overhead associated with the network stack and system calls. It uses the Netfilter framework to quickly handle UDP packets, removing the overhead of UDP-based GET requests almost entirely. Combined with lock-free multithreaded data access, Hippos removes several performance bottlenecks both internal and external to the KV-store application. Hippos is prototyped as a Linux loadable kernel module and evaluated it against the ubiquitous Memcached using various micro-benchmarks and workloads from Face- book\u27s production systems. The experiments show that Hippos provides some 20- 200% throughput improvements on a 1Gbps network (up to 590% improvement on a 10Gbps network) and 5-20% saving of power compared with Memcached

    Performance analysis for parallel programs from multicore to petascale

    Get PDF
    Cutting-edge science and engineering applications require petascale computing. Petascale computing platforms are characterized by both extreme parallelism (systems of hundreds of thousands to millions of cores) and hybrid parallelism (nodes with multicore chips). Consequently, to effectively use petascale resources, applications must exploit concurrency at both the node and system level --- a difficult problem. The challenge of developing scalable petascale applications is only partially aided by existing languages and compilers. As a result, manual performance tuning is often necessary to identify and resolve poor parallel and serial efficiency. Our thesis is that it is possible to achieve unique, accurate, and actionable insight into the performance of fully optimized parallel programs by measuring them with asynchronous-sampling-based call path profiles; attributing the resulting binary-level measurements to source code structure; analyzing measurements on-the-fly and postmortem to highlight performance inefficiencies; and presenting the resulting context- sensitive metrics in three complementary views. To support this thesis, we have developed several techniques for identifying performance problems in fully optimized serial, multithreaded and petascale programs. First, we describe how to attribute very precise (instruction-level) measurements to source-level static and dynamic contexts in fully optimized applications --- all for an average run-time overhead of a few percent. We then generalize this work with the development of logical call path profiling and apply it to work-stealing-based applications. Second, we describe techniques for pinpointing and quantifying parallel inefficiencies such as parallel idleness, parallel overhead and lock contention in multithreaded executions. Third, we show how to diagnose scalability bottlenecks in petascale applications by scaling our our measurement, analysis and presentation tools to support large-scale executions. Finally, we provide a coherent framework for these techniques by sketching a unique and comprehensive performance analysis methodology. This work forms the basis of Rice University's HPCTOOLKIT performance tools

    Instrumenting and analyzing platform-independent communication in applications

    Get PDF
    The performance of microprocessors is limited by communication. This limitation, sometimes alluded to as the memory wall, refers to the hardware-level cost of communicating with memory. Recent studies have found that the promise of speedup from transistor scaling, or employing heterogeneous processors, such as GPUs, is diminished when such hardware communication costs are included. Based on the insight that hardware communication at run-time is a manifestation of communication in software, this dissertation proposes that automatically capturing and classifying software-level communication is the first step in performing fast, early-stage design space exploration of future multicore systems. Software-level communication refers to the exchange of data between software entities such as functions, threads or basic blocks. Communication classification helps differentiate the first-time use from the reuse of communicated data, and distinguishes between communication external to a software entity and local communication within a software entity. We present Sigil, a novel tool that automatically captures and classifies software-level communication in an efficient way. Due to its platform-independent nature, software-level communication can be useful during the early-stage design of future multicore systems. Using the two different representations of output data that Sigil produces, we show that the measurement of software-level communication can be used to analyze i) function-level interaction in single-threaded programs to determine which specialized logic can be included in future heterogeneous multicore systems, and ii) thread-level interaction in multi-threaded programs to aid in chip multi-processor(CMP) design space exploration.Ph.D., Electrical Engineering -- Drexel University, 201

    Analyse de performance de systèmes distribués et hétérogènes à l'aide de traçage noyau

    Get PDF
    RÉSUMÉ Les systèmes infonuagiques sont en utilisation croissante. La complexité de ces systèmes provient du fait qu'ils s'exécutent de manière distribuée sur des architectures multicoeurs. Cette composition de services est souvent hétérogène, c.-à-d. qui implique différentes technologies, librairies et environnements de programmation. L'interopérabilité est assurée plutôt par l'utilisation de protocoles ouverts. L'espace de configuration résultant croît de manière exponentielle avec le nombre de paramètres, et change continuellement en fonction des nouveaux besoins et de l'adaptation de la capacité. Lorsqu'un problème de performance survient, il doit être possible d'identifier rapidement la cause pour y remédier. Or, ce problème peut être intermittent et difficile à reproduire, et dont la cause peut être une interaction transitoire entre des tâches ou des ressources. Les outils utilisés actuellement pour le diagnostic des problèmes de performance comprennent les métriques d'utilisation des ressources, les outils de profilage, la surveillance du réseau, des outils de traçage, des débogueurs interactifs et les journaux systèmes. Or, chaque composant doit être analysé séparément, ou l'utilisateur doit corréler manuellement cette information pour tenter de déduire la cause du problème. L'observation globale de l'exécution de systèmes distribués est un enjeu majeur pour en maitriser la complexité et régler les problèmes efficacement. L'objectif principal de cette recherche est d'obtenir un outil d'analyse permettant de comprendre la performance d'ensemble d'une application distribuée. Ce type d'analyse existe au niveau applicatif, mais elles sont spécifiques à un environnement d'exécution ou un domaine particulier. Nos travaux se distinguent par l'utilisation d'une trace noyau, qui procure un niveau plus abstrait de l'exécution d'un programme, et qui est indépendant du langage ou des librairies utilisées. La présente recherche vise à déterminer si la sémantique des évènements du système d'exploitation peut servir à une analyse satisfaisante. Le surcout du traçage est un enjeu important, car il doit demeurer faible pour ne pas perturber le système et être utile en pratique. Nous proposons un nouvel algorithme permettant de retrouver les relations d'attente entre les tâches et les périphériques d'un ordinateur local. Nous avons établi que le chemin critique exact d'une application nécessite des évènements qui ne sont pas visibles depuis le système d'exploitation. Nous proposons donc une approximation du chemin critique, dénomée chemin actif d'exécution, où chaque attente est remplacée par sa cause racine. Les approches antérieures reposent sur l'analyse des appels système. L'analyse doit tenir en compte la sémantique de centaines d'appels système, ce qui n'est pas possible dans le cas général, car le fonctionnement d'un appel système dépend de l'état du système au moment de son exécution. Par exemple, le comportement de l'appel système read() est complètement différent si le fichier réside sur un disque local ou sur un serveur de fichier distant. L'appel système ioctl() est particulièrement problématique, car son comportement est défini par le programmeur. Le traçage des appels système contribue aussi à augmenter le surcout, alors qu'une faible proportion d'entre eux modifie le flot de l'exécution. Les tâches d'arrière-plan du noyau n'effectuent pas d'appels système et ne peuvent pas être prises en compte par cette méthode. À cause de ces propriétés, l'analyse basée sur des appels système est fortement limitée. Notre approche remplace les appels système par des évènements de l'ordonnanceur et des interruptions. Ces évènements de plus bas niveau sont indépendants de la sémantique des appels système et prennent en compte les tâches noyau. Le traçage des appels système est donc optionnel, ce qui contribue à réduire le surcout et simplifie drastiquement l'analyse. Les bancs d'essais réalisés avec des logiciels commerciaux populaires indiquent qu'environ 90% du surcout est lié aux évènements d'ordonnancement. En produisant des cycles d'ordonnancement à la fréquence maximale du système, il a été établi que le surcout moyen au pire cas est de seulement 11%. Nous avons aussi réalisé une interface graphique interactive montrant les résultats de l'analyse. Grâce à cet outil, il a été possible d'identifier avec succès plusieurs problèmes de performance et de synchronisation. Le fonctionnement interne et l'architecture du programme sont exposés par l'outil de visualisation, qui se révèle utile pour effectuer la rétro-ingénierie d'un système complexe. Dans un second temps, la dimension distribuée du problème a été ajoutée. L'algorithme de base a été étendu pour supporter l'attente indirecte pour un évènement distant, tout en préservant ses propriétés antérieures. Le même algorithme peut donc servir pour des processus locaux ou distants. Nous avons instrumenté le noyau de manière à pouvoir faire correspondre les paquets TCP/IP émis et reçus entre les machines impliqués dans le traitement à observer. L'algorithme tient en compte que la réception ou l'émission de paquets peut se produire de manière asynchrone. Les traces obtenues sur plusieurs systèmes n'ont pas une base de temps commune, car chacun possède sa propre horloge. Aux fins de l'analyse, toutes les traces doivent être synchronisées, et les échanges apparaitre dans l'ordre de causalité. Pour cette raison, les traces doivent être préalablement synchronisées. L'algorithme a été utilisé pour explorer le comportement de différentes architectures logicielles. Différentes conditions d'opérations ont été simulées (délais réseau, durée de traitement, retransmission, etc.) afin de valider le comportement et la robustesse de la technique. Il a été vérifié que le résultat obtenu sur une grappe d'ordinateurs est le même que celui obtenu lorsque les services s'exécutent dans des machines virtuelles. Le surcout moyen nécessaire pour tracer une requête Web s'établit à 5%. La borne supérieure du surcout pour des requêtes distantes est d'environ 18%. Pour compléter l'analyse, nous avons réalisé des cas d'utilisation impliquant six environnements logiciel et domaines différents, dont une application Web Django, un serveur de calcul Java-RMI, un système de fichier distribué CIFS, un service Erlang et un calcul parallèle MPI. Comme contribution secondaire, nous avons proposé deux améliorations à l'algorithme de synchronisation. La première consiste en une étape de présynchronisation qui réduit considérablement la consommation maximale de mémoire. La deuxième amélioration concerne la performance de la fonction de transformation du temps. Le temps est représenté en nanosecondes et le taux de variation à appliquer doit être très précis. L'utilisation de l'arithmétique à point flottant de précision double n'est pas assez précis et produit des inversions d'évènements. Un couteux calcul à haute précision est requis. Grâce à une simple factorisation de l'équation linéaire, la plupart des calculs à haute précision ont été remplacés par une arithmétique entière 64-bit. Les bancs d'essai ont mesuré que cette optimisation procure une accélération de 65 fois en moyenne et que la précision du résultat n'est pas affectée. Le troisième thème de la recherche porte sur le profilage des segments du chemin d'exécution. L'échantillonnage des compteurs de performance matériel permet le profilage du code natif avec un faible surcout. Une limitation concerne le code interprété qui peut se retrouver dans une application hétérogène. Dans ce cas, le code profilé est celui de l'interpréteur, et le lien avec les sources du programme est perdu. Nous avons conçu une technique permettant de transférer à un interpréteur l'évènement de débordement du compteur de performance, provenant d'une interruption non masquable du processeur. L'analyse de l'état de l'interpréteur peut être effectuée en espace utilisateur. Un module d'analyse pour Python a été développé. Nous avons comparé le cout des méthodes pour obtenir la pile d'appel de l'interpréteur Python et celle du code interprété. Ces données sont sauvegardées par l'entremise de LTTng-UST, dont la source de temps est cohérente avec les traces produites en mode noyau, ce qui permet d'associer les échantillons produits avec le chemin d'exécution. Nous avons validé le profil à l'aide d'une application d'étalonnage. Nous avons mesuré une erreur inférieure à 1%, et ce résultat est équivalent à celui produit par un profileur déterministe. La période d'échantillonnage est établie selon un compromis entre le surcout et la résolution de l'échantillonnage. Nos tests indiquent que, pour un chemin d'exécution de 50ms, une plage de taux d'échantillonnage existe et satisfait à la fois une marge d'erreur inférieure à 5% et un surcout de moins de 10%.----------ABSTRACT Cloud systems are increasingly used. These systems have a complex behavior, because they run on a cluster of multi-core computers. This composition of services is often heterogeneous, involving different technologies, libraries and programming environments. Interoperability is ensured using open protocols rather than standardizing runtime environments. The resulting configuration space grows exponentially with the number of parameters, and constantly changes in response to new needs and capacity adaptation. When a performance problem arises, it should be possible to quickly identify the cause in order to address it. However, performance problems can be intermittent and difficult to reproduce, and their cause can be a transient interaction between tasks or resources. The tools currently used to diagnose performance problems include resource utilization metrics, profiling tools, network monitoring, layout tools, interactive debuggers and system logs. However, each component must be analyzed separately, or the user must manually correlate that information to try deducing the root cause. Observing the performance of globally distributed systems is a major challenge, to master their complexity and solve problems effectively. The main objective of this research is to obtain an analysis tool for understanding the overall performance of a distributed application. This type of analysis exists at the application level, but they are specific to a runtime environment or a particular application domain. To address this issue, we propose to use kernel tracing, which provides a more abstract information about the execution of a program and is independent of the language or the libraries used. This research aims to determine whether the semantics of the operating system events are effective for performance analysis of such systems. The additional cost of tracing is an important issue because it must remain low, to avoid disturbing the system and be useful in practice. We propose a new algorithm to find the waiting relationships between tasks and devices on a local computer. We established that the exact critical path of an application requires events which are not visible from the operating system. We therefore propose an approximation of the critical path, that we named execution path. Previous approaches rely on system call analysis. However, the analysis must take into account the semantics of hundreds of system calls. Tracing all system calls increases the overhead, while most system calls do not change the flow of execution. Furthermore, the background kernel threads do not perform system calls and are not taken into account. Our approach relies instead on lower-level events, namely from the scheduler and the interruptions. These events are independent of the semantics of system calls and take into account the kernel threads. Tracing system calls is optional, which helps reduce the overhead and simplifies the analysis. The benchmarks made with popular commercial software indicate that about 90% of the overhead is related to scheduling events. By producing scheduling cycles at the maximum frequency, we established that the average worst case overhead is only 11%. Finally, we implemented an interactive graphical view showing the results of the analysis. With this tool, it was possible to identify quickly several performance and synchronization problems in actual applications. The tool also exposes the internal functioning and architecture of the program, which is useful for performing reverse engineering of a complex system. Secondly, we addressed the distributed dimension of the problem. The basic algorithm has been extended to support indirect network wait, while preserving its previous properties. The same algorithm can therefore be used for local or remote processes. We instrumented the kernel for matching the TCP/IP packets sent and received between machines involved in the processing. The algorithm takes into account the fact that reception and transmission of packets can occur asynchronously. The traces obtained on several systems do not have a common time base, as each has its own clock. The analysis requires that all traces have the same time reference and exchanges must appear in the causal order. For this reason, the traces must first be synchronized. The algorithm was used to explore the behavior of different software architectures. We simulated various operating conditions (network delays, processing delays, retransmission, etc.) to validate the behavior and robustness of the technique. We verified that the result on a cluster of physical computers is the same as the one obtained when the services are running inside virtual machines. The average overhead to trace Web requests is about 5%. The worst case overhead measured with the higest frequency remote procedure call (empty remote call) is approximately 18\%. To complete the analysis, we implemented use cases and software environments involving six different application domains, including a Django Web application, a Java-RMI server, a CIFS distributed file system, an Erlang service and a MPI parallel computation. As a secondary contribution, we proposed two improvements to the synchronization algorithm. The first is a pre-synchronization step that dramatically reduces the maximum memory consumption. The second improvement concerns the performance of the time transformation function. The time is represented in nanoseconds and the rate of change to apply must be very precise. The use of double precision floating point arithmetic is not accurate enough and produces event inversions. Expensive high-precision calculation is required. We replaced most of high-precision calculations by integer arithmetic of native register size, providing an average acceleration of approximately 65 times for the synchronization. The third area of research focuses on profiling the execution path segments. Sampling hardware performance counters allows efficient profiling of native code. One limitation concerns the interpreted code that may be found in an heterogeneous application. In this case, the native code running is the interpreter itself, and the link with the actual sources of the interpreted program is lost. We developed a technique to transfer to an interpreter the performance counter overflow event from the non-maskable interrupt of the processor. The analysis of the interpreter state can then be performed in user-space. To demonstrate the feasability of the approach, we implemented the analysis module for Python. We compared the cost of methods to get the call stack of the Python interpreter and the interpreted code. This data is saved through LTTng-UST, which has a time source consistent with the kernel mode trace and allows the association of the samples produced with the execution path. We validated the profile using a calibrated program. We measured less than 1% profile error, and this result is equivalent to the error rate of a deterministic profiler. The sampling period is a compromise between the overhead and the profile resolution. Our tests indicate that, for an execution path of 50ms, a range of sampling exists that satisfies both a margin of error lower than 5% and an overhead of less than 10%

    Systems Support for Trusted Execution Environments

    Get PDF
    Cloud computing has become a default choice for data processing by both large corporations and individuals due to its economy of scale and ease of system management. However, the question of trust and trustoworthy computing inside the Cloud environments has been long neglected in practice and further exacerbated by the proliferation of AI and its use for processing of sensitive user data. Attempts to implement the mechanisms for trustworthy computing in the cloud have previously remained theoretical due to lack of hardware primitives in the commodity CPUs, while a combination of Secure Boot, TPMs, and virtualization has seen only limited adoption. The situation has changed in 2016, when Intel introduced the Software Guard Extensions (SGX) and its enclaves to the x86 ISA CPUs: for the first time, it became possible to build trustworthy applications relying on a commonly available technology. However, Intel SGX posed challenges to the practitioners who discovered the limitations of this technology, from the limited support of legacy applications and integration of SGX enclaves into the existing system, to the performance bottlenecks on communication, startup, and memory utilization. In this thesis, our goal is enable trustworthy computing in the cloud by relying on the imperfect SGX promitives. To this end, we develop and evaluate solutions to issues stemming from limited systems support of Intel SGX: we investigate the mechanisms for runtime support of POSIX applications with SCONE, an efficient SGX runtime library developed with performance limitations of SGX in mind. We further develop this topic with FFQ, which is a concurrent queue for SCONE's asynchronous system call interface. ShieldBox is our study of interplay of kernel bypass and trusted execution technologies for NFV, which also tackles the problem of low-latency clocks inside enclave. The two last systems, Clemmys and T-Lease are built on a more recent SGXv2 ISA extension. In Clemmys, SGXv2 allows us to significantly reduce the startup time of SGX-enabled functions inside a Function-as-a-Service platform. Finally, in T-Lease we solve the problem of trusted time by introducing a trusted lease primitive for distributed systems. We perform evaluation of all of these systems and prove that they can be practically utilized in existing systems with minimal overhead, and can be combined with both legacy systems and other SGX-based solutions. In the course of the thesis, we enable trusted computing for individual applications, high-performance network functions, and distributed computing framework, making a <vision of trusted cloud computing a reality
    corecore