Search CORE

3 research outputs found

Linux kernel compaction through cold code swapping

Author: A. Milanova
B. Ford
B. Sutter De
C.-T. Lee
D. Citron
D. Ferrari
D. Gay
D.J. Hatfield
J.L. Hennessy
K. Pettis
N. Gloy
S. Bhatia
S. Debray
S.K. Debray
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

There is a growing trend to use general-purpose operating systems like Linux in embedded systems. Previous research focused on using compaction and specialization techniques to adapt a general-purpose OS to the memory-constrained environment, presented by most, embedded systems. However, there is still room for improvement: it has been shown that even after application of the aforementioned techniques more than 50% of the kernel code remains unexecuted under normal system operation. We introduce a new technique that reduces the Linux kernel code memory footprint, through on-demand code loading of infrequently executed code, for systems that support virtual memory. In this paper, we describe our general approach, and we study code placement algorithms to minimize the performance impact of the code loading. A code, size reduction of 68% is achieved, with a 2.2% execution speedup of the system-mode execution time, for a case study based on the MediaBench II benchmark suite

Crossref

Ghent University Academic Bibliography

On the programmability of multi-GPU computing systems

Author: Cabezas Rodríguez Javier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

Multi-GPU systems are widely used in High Performance Computing environments to accelerate scientific computations. This trend is expected to continue as integrated GPUs will be introduced to processors used in multi-socket servers and servers will pack a higher number of GPUs per node. GPUs are currently connected to the system through the PCI Express interconnect, which provides limited bandwidth (compared to the bandwidth of the memory in GPUs) and it often becomes a bottleneck for performance scalability. Current programming models present GPUs as isolated devices with their own memory, even if they share the host memory with the CPU. Programmers explicitly manage allocations in all GPU memories and use primitives to communicate data between GPUs. Furthermore, programmers are required to use mechanisms such as command queues and inter-GPU synchronization. This explicit model harms the maintainability of the code and introduces new sources for potential errors. The first proposal of this thesis is the HPE model. HPE builds a simple, consistent programming interface based on three major features. (1) All device address spaces are combined with the host address space to form a Unified Virtual Address Space. (2) Programs are provided with an Asymmetric Distributed Shared Memory system for all the GPUs in the system. It allows to allocate memory objects that can be accessed by any GPU or CPU. (3) Every CPU thread can request a data exchange between any two GPUs, through simple memory copy calls. Such a simple interface allows HPE to provide always the optimal implementation; eliminating the need for application code to handle different system topologies. Experimental results show improvements on real applications that range from 5% in compute-bound benchmarks to 2.6x in communication-bound benchmarks. HPE transparently implements sophisticated communication schemes that can deliver up to a 2.9x speedup in I/O device transfers. The second proposal of this thesis is a shared memory programming model that exploits the new GPU capabilities for remote memory accesses to remove the need for explicit communication between GPUs. This model turns a multi-GPU system into a shared memory system with NUMA characteristics. In order to validate the viability of the model we also perform an exhaustive performance analysis of remote memory accesses over PCIe. We show that the unique characteristics of the GPU execution model and memory hierarchy help to hide the costs of remote memory accesses. Results show that PCI Express 3.0 is able to hide the costs of up to a 10% of remote memory accesses depending on the access pattern, while caching of remote memory accesses can have a large performance impact on kernel performance. Finally, we introduce AMGE, a programming interface, compiler support and runtime system that automatically executes computations that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type for multidimensional arrays that allows for robust, transparent distribution of arrays across all GPU memories. The compiler extracts the dimensionality information from the type of each array, and is able to determine the access pattern in each dimension of the array. The runtime system uses the compiler-provided information to automatically choose the best computation and data distribution configuration to minimize inter-GPU communication and memory footprint. This model effectively frees programmers from the task of decomposing and distributing computation and data to exploit several GPUs. AMGE achieves almost linear speedups for a wide range of dense computation benchmarks on a real 4-GPU system with an interconnect with moderate bandwidth. We show that irregular computations can also benefit from AMGE, too.Los sistemas multi-GPU son muy comúnmente utilizados en entornos de computación de altas prestaciones para acelerar cálculos científicos. Esta tendencia continuará con la introducción de GPUs integradas en los procesadores de los servidores procesador y con una mayor densidad de GPUs por nodo. Las GPUs actualmente se contectan al sistema a través de una interconexión PCI Express, que provee un ancho de banda reducido (comparado con las memorias de las GPUs) y habitualmente se convierte en el cuello de botella para escalar el rendimiento. Los modelos de programación actuales exponen las GPUs como dispositivos aislados con su propia memoria, incluso si comparten la memoria física con la CPU. Los programadores manejan diferentes reservas en todas las memorias de GPU y usan primitivas para comunicar datos entre GPUs. Además, los programadores deben utilizar mecanismos como colas de comandos y sincronicación entre GPUs. Este modelo explícito empeora la programabilidad del código e introduce nuevas fuentes de errores potenciales. La primera propuesta de esta tesis es el modelo HPE. HPE construye una interfaz de programaci ón consistente basada en tres características principales. (1) Todos los espacios de direcciones de los dispositivos son combinados para formar un espacio de direcciones unificado. (2) Los programas usan un sistema asimétrico distribuido de memoria compartida para todas las GPUs del sistema, que permite declarar objetos de memoria que pueden ser accedidos por cualquier GPU o CPU. (3) Cada hilo de ejecución de la CPU puede lanzar un intercambio de datos entre dos GPUs a través de simples llamadas de copia de memoria. Esta interfaz simplificada permite a HPE usar la implementaci ón óptima; sinque la aplicación contemple diferentes topologías de sistema. Los resultados experimentales muestran mejoras en aplicaciones reales que van desde un 5% en aplicaciones limitadas por el cómputo a 2.6x aplicaciones imitadas por la comunicación. HPE implementa sofisticados esquemas de transferencia para dispositivos de E/S que proporcionan mejoras de rendimiento de 2.9x. La segunda propuesta de esta tesis es un modelo de programación basado en memoria compartida que aprovecha las nuevas capacidades acceso remoto de memoria de las GPUs para eliminar la comunicación explícita entre memorias de GPU. Este modelo convierte un sistema multi-GPU en un sistema de memoria compartida con características NUMA. Para validar la viabilidad del modelo realizamos un anlásis exhaustivo del rendimiento los accessos de memoria remotos sobre PCIe. Los resultados muestran que PCI Express 3.0 elimina los costes de hasta un 10% de accesos remotos, dependiendo en el patrón de acceso, mientras que guardar los accesos remotos en memorias cache tiene un gran inpacto en el rendimiento de las computaciones. Finalmente, presentamos AMGE, una interfaz de programación con soporte de compilación y un sistema que ejecuta, de forma automática, computaciones programadas para una única GPU en todas las GPUs del sistema. La interfaz de programación proporciona un tipo de datos para arreglos multidimensionales que permite una distribuci ón transparente y robusta de los datos en todas las memorias de GPU. El compilador extrae la información sobre la dimensionalidad de cada arreglo y puede determinar el patrón de acceso en cada dimensión de forma individual. El sistema utiliza, en tiempo de ejecución, la información del compilador para elegir la mejor descomposición de la computación y los datos para minimizar la comunicación entre GPUs y el uso de memoria. AMGE consigue mejoras de rendimiento que crecen de forma lineal con el número de GPUs para un amplio abanico de computaciones densas en un sistema real con 4 GPUs. También mostramos que las computaciones con patrones irregulares también se pueden beneficiar de AMGE

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Automated tailoring of system software stacks

Author: Ziegler Andreas
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität
Publication date: 01/01/2023
Field of study

In many industrial sectors, device manufacturers are moving away from expensive special-purpose hardware units and consolidate their systems on commodity hardware. As part of this change, developers are enabled to run their applications on general-purpose operating systems like Linux, which already supports thousands of different devices out of the box and can be used in a wide range of target scenarios. Furthermore, the Linux ecosystem allows them to integrate existing implementations of standard functionality in the form of shared libraries. However, as the libraries and the Linux kernel are designed as generic building blocks in order to support as many applications as possible, they cannot make assumptions about specific use cases for a single-purpose device. This generality leads to unnecessary overheads in narrowly defined target scenarios, as unneeded components do not only take up space on the target system but have to be maintained over the lifetime of the device as well. While the Linux kernel provides a configuration system to disable unneeded functionality like device drivers, determining the required features from over 16000 options is an infeasible task. Even worse, most shared libraries cannot be customized even though only around 10 percent of their functions are ever used by applications. In this thesis, I present my approaches for the automated identification and removal of unnecessary components in all layers of the software stack. As the configuration system is an integral part of the Linux kernel, we embrace its presence and automatically generate custom-fitted configurations for observed target scenarios with the help of an extracted variability model. For the much more diverse realm of shared libraries, with different programming languages, build systems, and a lack of configurability, I demonstrate a different approach. By identifying individual functions as logically distinct units, we construct a symbol-level dependency graph across the applications and all their required libraries. We then remove unneeded code at the binary level and rearrange the remaining parts to take up minimal space in the binary file by formulating their placement as an optimization problem. To lower the number of unnecessary updates to unused components in a deployed system, I lastly present an automated method to determine the impact of software changes on a target scenario and provide guidance for developers on whether they need to update their systems. Applying these techniques to different target systems, I demonstrate that we can disable up to 87 percent of configuration options in a Debian Linux kernel, shrink the size of an embedded OpenWrt kernel by 59 percent, and speed up the boot process of the embedded system by 21 percent. As part of the shared library tailoring process, we can remove 13060 functions from all libraries in OpenWrt and reduce their total size by 31 percent. In the memcached Docker container, we identify 381 entirely unneeded shared libraries and shrink the container image size by 82 percent. An analysis of the development history of two large library projects over the course of more than two years further shows that between 68 and 82 percent of all changes are not required for an OpenWrt appliance, reducing the number of patch days by up to 69 percent. These results demonstrate the broad applicability of our automated methods for both the Linux kernel and shared libraries to a wide range of scenarios. From embedded systems to server applications, custom-tailored system software stacks contribute to the reduction of overheads in space and time

Institutionelles Repositorium der Leibniz Universität Hannover