7 research outputs found

    Simulation of High-Performance Memory Allocators

    Get PDF
    This study presents a single-core and a multi-core processor architecture for health monitoring systems where slow biosignal events and highly parallel computations exist. The single-core architecture is composed of a processing core (PC), an instruction memory (IM) and a data memory (DM), while the multi-core architecture consists of PCs, individual IMs for each core, a shared DM and an interconnection crossbar between the cores and the DM. These architectures are compared with respect to power vs. performance trade-offs for a multi-lead electrocardiogram signal conditioning application exploiting near threshold computing. The results show that the multi-core solution consumes 66%less power for high computation requirements (50.1 MOps/s), whereas 10.4% more power for low computation needs (681 kOps/s)

    Ordonnancements de threads dirigé par la mémoire sur architecture NUMA.

    Get PDF
    Ces supports exécutifs sont destinés à faciliter l'exploitation efficace d'architectures de type "grappes de machines NUMA". L'équipe Runtime possède une bonne expérience dans le domaine de l'exploitation des machines multiprocesseurs et ses travaux ont notamment conduit à l'élaboration d'une bibliothèque (nommée Marcel) permettant d'ordonnancer de manière portable un grand nombre de processus légers. Pour guider l'ordonnancement depuis l'application, le programmeur peut former des "bulles" pour encapsuler des threads ou d'autres bulles. Ces abstractions permettent d'associer des threads possédant des caractéristiques communes telles que l'accès à des données partagées par exemple (affinité mémoire). À l'exécution, l'ordonnanceur peut alors s'aider de ces informations (contenu des bulles + attributs attachés aux bulles ) pour placer les threads de manière pertinente sur les processeurs de la machine. La plate-forme Marcel permet en outre au programmeur de définir lui-même la fonction d'ordonnancement ou d'utiliser l'un des politiques prédéfinies. Actuellement, les stratégies d'ordonnancement peuvent utiliser des propriétés d'affinités entre threads mais aucune information sur la localisation, le volume ou le taux d'accès aux données n'est disponible. L'objectif de ce mémoire est donc d'étudier l'ordonnancement de threads par bulles dans un contexte où ces informations seraient disponibles. Dans un premier temps, il s'agira d'enrichir la plate-forme Marcel de mécanismes permettant au programmeur de spécifier, pour des données statiques ou allouées dynamiquement, des liaisons pondérées entre les segments de données et les bulles y accédant (la pondération représentant typiquement un taux d'accès). À l'exécution, il devra donc être possible de déterminer à tout moment, pour une bulle donnée, la hiérarchie de ses "bassins d'attraction" sur la machine

    Scalable locality-conscious multithreaded memory allocation

    Full text link

    High-Performance Concurrent Memory Allocation

    Get PDF
    Memory management takes a sequence of program-generated allocation/deallocation requests and attempts to satisfy them within a fixed-sized block of memory while minimizing the total amount of memory used. A general-purpose dynamic-allocation algorithm cannot anticipate future allocation requests so its output is rarely optimal. However, memory allocators do take advantage of regularities in allocation patterns for typical programs to produce excellent results, both in time and space (similar to LRU paging). In general, allocators use a number of similar techniques, each optimizing specific allocation patterns. Nevertheless, memory allocators are a series of compromises, occasionally with some static or dynamic tuning parameters to optimize specific program-request patterns. The goal of this thesis is to build a low-latency memory allocator for both kernel and user multi-threaded systems, which is competitive with the best current memory allocators, while extending the feature set of existing and new allocator routines. A new llheap memory-allocator is created that achieves all of these goals, while maintaining and managing sticky allocation properties for zero-filled and aligned allocations without a performance loss. Hence, it becomes possible to use @realloc@ frequently as a safe operation, rather than just occasionally, because it preserves sticky properties when enlarging storage requests. Furthermore, the ability to query sticky properties and information allows programmers to write safer programs, as it is possible to dynamically match allocation styles from unknown library routines that return allocations. The C allocation API is also extended with @resize@, advanced @realloc@, @aalloc@, @amemalign@, and @cmemalign@ so programmers do not make mistakes writing theses useful allocation operations. llheap is embedded into the uC++ and C-for-all runtime systems, both of which have user-level threading. The ability to use C-for-all's advanced type-system (and possibly C++'s too) to combine advanced memory operations into one allocation routine using named arguments shows how far the allocation API can be pushed, which increases safety and greatly simplifies programmer's use of dynamic allocation. The llheap allocator also provides comprehensive statistics for all allocation operations, which are invaluable in understanding and debugging a program's dynamic behaviour. No other memory allocator examined in the thesis provides such comprehensive statistics gathering. As well, llheap provides a debugging mode where allocations are checked with internal pre/post conditions and invariants. It is extremely useful, especially for students. While not as powerful as the @valgrind@ interpreter, a large number of allocations mistakes are detected. Finally, contention-free statistics gathering and debugging have a low enough cost to be used in production code. A micro-benchmark test-suite is started for comparing allocators, rather than relying on a suite of arbitrary programs. It has been an interesting challenge. These micro-benchmarks have adjustment knobs to simulate allocation patterns hard-coded into arbitrary test programs. Existing memory allocators, glibc, dlmalloc, hoard, jemalloc, ptmalloc3, rpmalloc, tbmalloc, and the new allocator llheap are all compared using the new micro-benchmark test-suite

    Contribution à l'élaboration d'environnements de programmation dédiés au calcul scientifique hautes performances

    Get PDF
    Dans le cadre du calcul scientifique intensif, la quête des hautes performances se heurte actuellement à la complexité croissante des architectures des machines parallèles. Ces dernières exhibent en particulier une hiérarchie importante des unités de calcul et des mémoires, ce qui complique énormément la conception des applications parallèles. Cette thèse propose un support d’exécution permettant de programmer efficacement les architectures de type grappes de machines multiprocesseurs, en proposant un modèle de programmation centré sur les opérations collectives de communication et de synchronisation et sur l’équilibrage de charge. L’interface de programmation, nommée MPC, fournit des paradigmes de haut niveau qui sont implémentés demanière optimisée en fonction de l’architecture sous-jacente. L’environnement est opérationnel sur la plate-forme de calcul du CEA/ DAM (TERANOVA) et les évaluations valident la pertinence de l’approche choisie.In the field of intensive scientific computing, the quest for performance has to face the increasing complexity of parallel architectures. Nowadays, thesemachines exhibit a deepmemory hierarchy which complicates the design of efficient parallel applications. This thesis proposes a programming environment allowing to design efficient parallel programs on top of clusters of multiprocessors. It features a programming model centered around collective communications and synchronizations, and provides load balancing facilities. The programming interface, named MPC, provides high level paradigms which are optimized according to the underlying architecture. The environment is fully functional and used within the CEA/DAM(TERANOVA) computing center. The evaluations presented in this document confirm the relevance of our approach

    Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

    Get PDF
    This thesis presents the first-ever compile-time method for allocating a portion of a program's dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move dynamic data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy across a large number of benchmarks were also observed when compared against cache memory organizations, showing our method's success under constrained SRAM sizes when dealing with dynamic data. Lastly, our method is able to minimize the profile dependence issues which plague all similar allocation methods through careful analysis of static and dynamic profile information

    A Locality-Improving Dynamic Memory Allocator

    No full text
    In general-purpose applications, most data is dynamically allocated. The memory manager therefore plays a crucial role in application performance by determining the spatial locality of heap objects. Previous general-purpose allocators have focused on reducing fragmentation, while most locality-improving allocators have either focused on improving the locality of the allocator (not the application), or required programmer hints or profiling to guide object placement. We present a high-performance memory allocator called Vam that transparently improves both cache-level and pagelevel locality of the application while achieving low fragmentation. Over a range of large-footprint benchmarks, Vam improves application performance by an average of 4%--8% versus the Lea (Linux) and FreeBSD allocators. When memory is scarce, Vam improves application performance by up to 2X compared to the FreeBSD allocator, and by over 10X compared to the Lea allocator
    corecore