6 research outputs found

    Resource Modification On Multicore Server With Kernel Bypass

    Get PDF
    Technology develops very fast marked by many innovations both from hardware and software. Multicore servers with a growing number of cores require efficient software. Kernel and Hardware used to handle various operational needs have some limitations. This limitation is due to the high level of complexity especially in handling as a server such as single socket discriptor, single IRQ and lack of pooling so that it requires some modifications. The Kernel Bypass is one of the methods to overcome the deficiencies of the kernel. Modifications on this server are a combination increase throughput and decrease server latency. Modifications at the driver level with hashing rx signal and multiple receives modification with multiple ip receivers, multiple thread receivers and multiple port listener used to increase throughput. Modifications using pooling principles at either the kernel level or the program level are used to decrease the latency. This combination of modifications makes the server more reliable with an average throughput increase of 250.44% and a decrease in latency 65.83%

    TABARNAC: Visualizing and Resolving Memory Access Issues on NUMA Architectures

    Get PDF
    International audienceIn modern parallel architectures, memory accesses represent a common bottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMA machines, as the access time to data depends on its location in the memory. Many efforts were made to develop adaptive tools to improve memory accesses at the runtime by optimizing the mapping of data and threads to NUMA nodes. However, theses tools are not able to change the memory access pattern of the original application, therefore a code written without considering memory performance might not benefit from them. Moreover, automatic mapping tools take time to converge towards the best mapping, losing optimization opportunities. A deeper understanding of the memory behavior can help optimizing it, removing the need for runtime analysis. In this paper, we present TABARNAC , a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures. TABARNAC provides a new visualization of the memory access behavior, focusing on the distribution of accesses by thread and by structure. Such visualization allows the developer to easily understand why performance issues occur and how to fix them. Using TABARNAC , we explain why some applications do not benefit from data and thread mapping. Moreover, we propose several code modifications to improve the memory access behavior of several parallel applications. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credi

    TABARNAC: Tools for Analyzing Behavior of Applications Running on NUMA Architecture

    Get PDF
    In modern parallel architectures, memory accesses represent a commonbottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMAmachines, as the access time to data depends on its location inthe memory. Many efforts were made todevelop adaptive tools to improve memory accesses at the runtime by optimizingthe mapping of data and threads to NUMA nodes. However, theses tools are notable to change the memory access pattern of the original application,therefore a code written without considering memory performance mightnot benefit from them. Moreover, automatic mapping tools take time to convergetowards the best mapping, losing optimization opportunities. Adeeper understanding of the memory behavior can help optimizing it,removing the need for runtime analysis.In this paper, we present TABARNAC, a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures.TABARNAC provides a new visualization of the memory access behavior, focusing on thedistribution of accesses by thread and by structure. Such visualization allows thedeveloper to easily understand why performance issues occur and how to fix them.Using TABARNAC, we explain why some applications do not benefit from dataand thread mapping. Moreover, we propose several code modifications toimprove the memory access behavior of several parallel applications.Les accès mémoire représentent une source de problème de performance fréquenteavec les architectures parallèle moderne. Ainsi optimiser la manière dont lesapplications accèdent à la mémoire est un moyen efficace d'améliorer laperformance et la consommation d'énergie. Les accès mémoire prennent d'autantplus d'important avec les machines NUMA où le temps d'accès à une donnéedépend de sa localisation dans la mémoire. De nombreuse études ont proposéesdes outils adaptatif pour améliorer les accès mémoire en temps réel, cesoutils opèrent en changeant le placement des données et des thread sur lesnœuds NUMA. Cependant ces outils n'ont pas la possibilité de changer la façondont l'application accède à la mémoire. De ce fait un code développé sansprendre en compte les performances des accès mémoire pourrait ne pas en tirerparti. De plus les outils de placement automatique ont besoin de temps pourconverger vers le meilleur placement, perdant des opportunités d'optimisation.Mieux comprendre le comportement mémoire peut aider à l'optimiser et supprimerle besoin d'optimisation en temps réel.Cette étude présente TABARNAC un outil pour analyser le comportement mémoired'application parallèles s'exécutant sur architecture NUMA. TABARNAC offreune nouvelle forme de visualisation du comportement mémoire mettant l'accentsur la distribution des accès entre les thread et par structure de données. Cetype de visualisations permettent de comprendre facilement pourquoi lesproblèmes de performances apparaissent et comment les résoudre. En utilisantTABARNAC, nous expliquons pourquoi certaines applications ne tirent pas partid'outils placement de donnée et de thread. De plus nous proposons plusieursmodification de code permettant d'améliorer le comportement mémoire de plusieursapplications parallèles

    Moca: Un système efficient de collecte de traces mémoire

    Get PDF
    In modern High Performance Computing architectures, the memory subsystem is a common performance bottleneck. When optimizing an application, the developer has to study its memory access patterns and adapt accordingly the algorithms and data structures it uses. The objective is twofold: on one hand, it is necessary to avoid missuses of the memory hierarchy such as false sharing of cache lines or contention in a NUMA interconnect. On the other hand, it is essential to take advantage of the various cache levels and the memory hardware prefetcher. Still, most profiling tools focus on CPU metrics. The few of them able to provide an overview of the memory patterns involved by the execution rely on hardware instrumentation mechanisms and have two drawbacks. The first one is that they are based on sampling which precision is limited by hardware capabilities. The second one is that they trace a subset of all the memory accesses, usually the most frequent, without information about the other ones. In this study we present Moca an efficient tool for the collection of complete spatiotemporal memory traces. It is based on a Linux kernel module and provides a coarse grained trace of a superset of all the memory accesses performed by an application over its addressing space during the time of its execution. The overhead of Moca is reasonable when taking into account the fact that it is able to collect complete traces which are also more precise than the ones collected by comparable tools.Dans les architectures de calcul hautes performances, le système demémoire est une cause fréquente de baisse de performances. Afind'optimiser une application le.a développeur.euse doit étudier le schémad'accès mémoire de son application et adapter ses algorithmes etstructures de données en conséquence. L'objectif est double : tout d'abordil est nécessaire d'éviter les mauvaise utilisations de la hiérarchiemémoire telles que le faux partage de ligne de cache ou la contentiondans les interconnexion NUMA. De plus il est primordial de tirer lemeilleur parti des différents niveaux de cache et du pré-chargement mémoirematériel.Cependant, la plupart des outils d'analyse de performances se concentrent surdes métriques provenant du processeur. Les rare outils capables de proposerune vue générale des schémas d'accès mémoire se basent sur des mécanismesd'instrumentation matériels et soulèvent deux problèmes. Premièrement ilssont basés sur un échantillonnage dont la précision est limité par lescapacités du matériel. Ensuite ils ne tracent qu'une sous partie des accèsmémoire, en général les plus fréquents, sans informations sur les autresaccès.Dans cette étude, nous présentons Moca un outil efficient de collecte detraces mémoire spatiotemporelles complètes. Cet outil est basé sur unmodule noyau Linux et génère une trace a gros grain contenant unsurensemble des accès mémoire effectués par un application au cours dutemps et de l'espace d'adressage de l'exécution. Le surcout de Moca estraisonnable si on prends en compte le fait que la trace produite estcomplète et donc plus précise que celles produites par des outilscomparable

    スケジューリング遅延に基づいたタスク並列ランタイムシステムの性能差の解析

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 豊田 正史, 東京大学教授 田浦 健次朗, 東京大学准教授 入江 英嗣, 東京大学教授 中島 研吾, 理化学研究所チームリーダ 佐藤 三久, 東京工業大学准教授 横田 理央University of Tokyo(東京大学
    corecore