13 research outputs found

    System on fabrics utilising distributed computing

    Get PDF
    The main vision of wearable computing is to make electronic systems an important part of everyday clothing in the future which will serve as intelligent personal assistants. Wearable devices have the potential to be wearable computers and not mere input/output devices for the human body. The present thesis focuses on introducing a new wearable computing paradigm, where the processing elements are closely coupled with the sensors that are distributed using Instruction Systolic Array (ISA) architecture. The thesis describes a novel, multiple sensor, multiple processor system architecture prototype based on the Instruction Systolic Array paradigm for distributed computing on fabrics. The thesis introduces new programming model to implement the distributed computer on fabrics. The implementation of the concept has been validated using parallel algorithms. A real-time shape sensing and reconstruction application has been implemented on this architecture and has demonstrated a physical design for a wearable system based on the ISA concept constructed from off-the-shelf microcontrollers and sensors. Results demonstrate that the real time application executes on the prototype ISA implementation thus confirming the viability of the proposed architecture for fabric-resident computing devices

    Exploiting data locality in cache-coherent NUMA systems

    Get PDF
    The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips. Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours. Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer. These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems. The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine. The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs. Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia de reloj de los computadores. Con el objetivo de superar este problema, durante las últimas dos décadas los fabricantes han integrado más unidades de cómputo en los sistemas mediante la interconexión de nodos diferentes, la inclusión de múltiples chips en los nodos y el incremento de núcleos de procesador en cada chip. La rapidez de la memoria principal no ha evolucionado con el mismo factor que los procesadores; es mucho más lenta y hay la necesidad de proporcionar más ancho de banda a los procesadores, especialmente con el incremento del número de núcleos y chips. Aun manteniendo un sistema de direccionamiento compartido en el que todos los procesadores pueden acceder al conjunto de la memoria, las soluciones han oscilado alrededor de la integración de más memorias: usando tecnologías modernas como las memorias de alto ancho de banda (highbandwidth memories, HBM) y memorias no volátiles (non-volatile memories, NVM), haciendo que grupos de núcleos (como zócalos completos) tengan acceso más veloz a un subconjunto de la DRAM, o con la combinación de soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso a la memoria principal, en función del núcleo que solicita el acceso a una dirección de memoria en particular y la ubicación física de esta dirección, lo que provoca unos comportamientos no uniformes en el acceso a la memoria (non-uniform memory access, NUMA). Además, muchos de estos sistemas tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo que implica que cualquier cambio hecho en la memoria desde un núcleo de un procesador debe ser visible por el resto de procesadores de forma transparente para los programadores. Estos comportamientos NUMA reducen el rendimiento de las aplicaciones y pueden suponer un reto para los programadores. Para abordar dicho problema, en esta tesis se proponen soluciones, a nivel de software y hardware, que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia, el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez la precarga de direcciones de memoria mediante hardware (o hardware prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los datos en los sistemas NUMA, se pueden hallar mejores configuraciones que cuando se consideran ambos aspectos por separado. Con una combinación de los resultados de rendimiento y de los contadores disponibles en el sistema se construye un modelo de rendimiento, tanto por avanzado como en en tiempo de ejecución, de la mejor configuración para aplicaciones que no están incluidas en el modelo. La evaluación se realiza en dos sistemas NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas se usan para predecir las mejores configuraciones en el otro sistema. La segunda contribución se basa en la idea de que el prefetching puede tener un efecto considerable en los sistemas NUMA y propone un esquema de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este esquema es genérico y se puede aplicar a diferentes algoritmos de precarga existentes con un coste de hardware muy bajo pero que proporciona muy buenos resultados. Dichos resultados se obtienen y evalúan mediante un simulador arquitectural preciso a nivel de ciclo y proporciona resultados detallados del rendimiento, la reducción de las comunicaciones de datos y los costes energéticos. Finalmente, la tercera y última contribución consiste en algoritmos de planificación para modelos de programación basados en tareas. Estos modelos simplifican la programabilidad de las aplicaciones paralelas y proveen información muy útil al sistema en tiempo de ejecución (runtime system) que controla su funcionamiento. Esta información se utiliza para construir un grafo de dependencias entre tareas (task dependency graph, TDG), un grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son fragmentos de código secuencial, conocidos como tareas, y los arcos son las dependencias de datos entre las distintas tareas. Los algoritmos de planificación que se proponen usan técnicas e particionado de grafos y proporcionan una planificación de las tareas del TDG que minimiza la comunicación de datos entre las distintas regiones NUMA del sistema. Los resultados se han evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version

    Basic transformations in linear algebra for vector computing

    Get PDF

    Towards instantaneous performance analysis using coarse-grain sampled and instrumented data

    Get PDF
    Nowadays, supercomputers deliver an enormous amount of computation power; however, it is well-known that applications only reach a fraction of it. One limiting factor is the single processor performance because it ultimately dictates the overall achieved performance. Performance analysis tools help locating performance inefficiencies and their nature to ultimately improve the application performance. Performance tools rely on two collection techniques to invoke their performance monitors: instrumentation and sampling. Instrumentation refers to inject performance monitors into concrete application locations whereas sampling invokes the installed monitors to external events. Each technique has its advantages. The measurements obtained through instrumentation are directly associated to the application structure while sampling allows a simple way to determine the volume of measurements captured. However, the granularity of the measurements that provides valuable insight cannot be determined a priori. Should analysts study the performance of an application for the first time, they may consider using a performance tool and instrument every routine or use high-frequency sampling rates to provide the most detailed results. These approaches frequently lead to large overheads that impact the application performance and thus alter the measurements gathered and, therefore, mislead the analyst. This thesis introduces the folding mechanism that takes advantage of the repetitiveness found in many applications. The mechanism smartly combines metrics captured through coarse-grain sampling and instrumentation mechanisms to provide instantaneous metric reports within instrumented regions and without perturbing the application execution. To produce these reports, the folding processes metrics from different type of sources: performance and energy counters, source code and memory references. The process depends on their nature. While performance and energy counters represent continuous metrics, the source code and memory references refer to discrete values that point out locations within the application code or address space. This thesis evaluates and validates two fitting algorithms used in different areas to report continuous metrics: a Gaussian interpolation process known as Kriging and piece-wise linear regressions. The folding also takes benefit of analytical performance models to focus on a small set of performance metrics instead of exploring a myriad of performance counters. The folding also correlates the metrics with the source-code using two alternatives: using the outcome of the piece-wise linear regressions and a mechanism inspired by Multi-Sequence Alignment techniques. Finally, this thesis explores the applicability of the folding mechanism to captured memory references to detail which and how data objects are accessed. This thesis proposes an analysis methodology for parallel applications that focus on describing the most time-consuming computing regions. It is implemented on top of a framework that relies on a previously existing clustering tool and the folding mechanism. To show the usefulness of the methodology and the framework, this thesis includes the discussion of multiple first-time seen in-production applications. The discussions include high level of detail regarding the application performance bottlenecks and their responsible code. Despite many analyzed applications have been compiled using aggressive compiler optimization flags, the insight obtained from the folding mechanism has turned into small code transformations based on widely-known optimization techniques that have improved the performance in some cases. Additionally, this work also depicts power monitoring capabilities of recent processors and discusses the simultaneous performance and energy behavior on a selection of benchmarks and in-production applications.Actualment, els supercomputadors ofereixen una àmplia potència de càlcul però les aplicacions només en fan servir una petita fracció. Un dels factors limitants és el rendiment d'un processador, el qual dicta el rendiment en general. Les eines d'anàlisi de rendiment ajuden a localitzar els colls d'ampolla i la seva natura per a, eventualment, millorar el rendiment de l'aplicació. Les eines d'anàlisi de rendiment empren dues tècniques de recol·lecció de dades: instrumentació i mostreig. La instrumentació es refereix a la capacitat d'injectar monitors en llocs específics del codi mentre que el mostreig invoca els monitors quan ocórren esdeveniments externs. Cadascuna d'aquestes tècniques té les seves avantatges. Les mesures obtingudes per instrumentació s'associen directament a l'estructura de l'aplicació mentre que les obtingudes per mostreig permeten una forma senzilla de determinar-ne el volum capturat. Sigui com sigui, la granularitat de les mesures no es pot determinar a priori. Conseqüentment, si un analista vol estudiar el rendiment d'una aplicació sense saber-ne res, hauria de considerar emprar una eina d'anàlisi i instrumentar cadascuna de les rutines o bé emprar freqüències de mostreig altes per a proveir resultats detallats. En qualsevol cas, aquestes alternatives impacten en el rendiment de l'aplicació i per tant alterar les mètriques capturades, i conseqüentment, confondre a l'analista. Aquesta tesi introdueix el mecanisme anomenat folding, el qual aprofita la repetitibilitat existent en moltes aplicacions. El mecanisme combina intel·ligentment mètriques obtingudes mitjançant mostreig de gra gruixut i instrumentació per a proveir informes de mètriques instantànies dins de regions instrumentades sense pertorbar-ne l'execució. Per a produir aquests informes, el mecanisme processa les mètriques de diferents fonts: comptadors de rendiment i energia, codi font i referències de memoria. El procés depen de la natura de les dades. Mentre que les mètriques de rendiment i energia són valors continus, el codi font i les referències de memòria representen valors discrets que apunten ubicacions dins el codi font o l'espai d'adreces. Aquesta tesi evalua i valida dos algorismes d'ajust: un procés d'interpolació anomenat Kriging i una interpolació basada en regressions lineals segmentades. El mecanisme de folding també s'aprofita de models analítics de rendiment basats en comptadors hardware per a proveir un conjunt reduït de mètriques enlloc d'haver d'explorar una multitud de comptadors. El mecanisme també correlaciona les mètriques amb el codi font emprant dues alternatives: per un costat s'aprofita dels resultats obtinguts per les regressions lineals segmentades i per l'altre defineix un mecanisme basat en tècniques d'alineament de multiples seqüències. Aquesta tesi també explora l'aplicabilitat del mecanisme per a referències de memoria per a informar quines i com s'accessedeixen les dades de l'aplicació. Aquesta tesi proposa una metodología d'anàlisi per a aplicacions paral·leles centrant-se en descriure les regions de càlcul que consumeixen més temps. La metodología s'implementa en un entorn de treball que usa un mecanisme de clustering preexistent i el mecanisme de folding. Per a demostrar-ne la seva utilitat, aquesta tesi inclou la discussió de múltiples aplicacions analitzades per primera vegada. Les discussions inclouen un alt nivel de detall en referencia als colls d'ampolla de les aplicacions i de la seva natura. Tot i que moltes d'aquestes aplicacions s'han compilat amb opcions d'optimització agressives, la informació obtinguda per l'entorn de treball es tradueix en petites modificacions basades en tècniques d'optimització que permeten millorar-ne el rendiment en alguns casos. Addicionalment, aquesta tesi també reporta informació sobre el consum energètic reportat per processadors recents i discuteix el comportament simultani d'energia i rendiment en una selecció d'aplicacions sintètiques i aplicacions en producció

    Analyse des synchronisations dans un programme parallèle ordonnancé par vol de travail. Applications à la génération déterministe de nombres pseudo-aléatoires.

    Get PDF
    We present two contributions to the field of parallel programming.The first contribution is theoretical: we introduce SIPS analysis, a novel approach to estimate the number of synchronizations performed during the execution of a parallel algorithm.Based on the concept of logical clocks, it allows us: on one hand, to deliver new bounds for the number of synchronizations, in expectation; on the other hand, to design more efficient parallel programs by dynamic adaptation of the granularity.The second contribution is pragmatic: we present an efficient parallelization strategy for pseudorandom number generation, independent of the number of concurrent processes participating in a computation.As an alternative to the use of one sequential generator per process, we introduce a generic API called Par-R, which is designed and analyzed using SIPS.Its main characteristic is the use of a sequential generator that can perform a ``jump-ahead'' directly from one number to another on an arbitrary distance within the pseudorandom sequence.Thanks to SIPS, we show that, in expectation, within an execution scheduled by work stealing of a "very parallel" program (whose depth or critical path is subtle when compared to the work or number of operations), these operations are rare.Par-R is compared with the parallel pseudorandom number generator DotMix, written for the Cilk Plus dynamic multithreading platform.The theoretical overhead of Par-R compares favorably to DotMix's overhead, what is confirmed experimentally, while not requiring a fixed generator underneath.Nous présentons deux contributions dans le domaine de la programmation parallèle.La première est théorique : nous introduisons l'analyse SIPS, une approche nouvelle pour dénombrer le nombre d'opérations de synchronisation durant l'exécution d'un algorithme parallèle ordonnancé par vol de travail.Basée sur le concept d'horloges logiques, elle nous permet,: d'une part de donner de nouvelles majorations de coût en moyenne; d'autre part de concevoir des programmes parallèles plus efficaces par adaptation dynamique de la granularité.La seconde contribution est pragmatique: nous présentons une parallélisation générique d'algorithmes pour la génération déterministe de nombres pseudo-aléatoires, indépendamment du nombre de processus concurrents lors de l'exécution.Alternative à l'utilisation d'un générateur pseudo-aléatoire séquentiel par processus, nous introduisons une API générique, appelée Par-R qui est conçue et analysée grâce à SIPS.Sa caractéristique principale est d'exploiter un générateur séquentiel qui peut "sauter" directement d'un nombre à un autre situé à une distance arbitraire dans la séquence pseudo-aléatoire.Grâce à l'analyse SIPS, nous montrons qu'en moyenne, lors d'une exécution par vol de travail d'un programme très parallèle (dont la profondeur ou chemin critique est très petite devant le travail ou nombre d'opérations), ces opérations de saut sont rares.Par-R est comparé au générateur pseudo-aléatoire DotMix, écrit pour Cilk Plus, une extension de C/C++ pour la programmation parallèle par vol de travail.Le surcout théorique de Par-R se compare favorablement au surcoput de DotMix, ce qui apparait aussi expériemntalement.De plus, étant générique, Par-R est indépendant du générateur séquentiel sous-jacent

    A self-mobile skeleton in the presence of external loads

    Get PDF
    Multicore clusters provide cost-effective platforms for running CPU-intensive and data-intensive parallel applications. To effectively utilise these platforms, sharing their resources is needed amongst the applications rather than dedicated environments. When such computational platforms are shared, user applications must compete at runtime for the same resource so the demand is irregular and hence the load is changeable and unpredictable. This thesis explores a mechanism to exploit shared multicore clusters taking into account the external load. This mechanism seeks to reduce runtime by finding the best computing locations to serve the running computations. We propose a generic algorithmic data-parallel skeleton which is aware of its computations and the load state of the computing environment. This skeleton is structured using the Master/Worker pattern where the master and workers are distributed on the nodes of the cluster. This skeleton divides the problem into computations where all these computations are initiated by the master and coordinated by the distributed workers. Moreover, the skeleton has built-in mobility to implicitly move the parallel computations between two workers. This mobility is data mobility controlled by the application, the skeleton. This skeleton is not problem-specific and therefore it is able to execute different kinds of problems. Our experiments suggest that this skeleton is able to efficiently compensate for unpredictable load variations. We also propose a performance cost model that estimates the continuation time of the running computations locally and remotely. This model also takes the network delay, data size and the load state as inputs to estimate the transfer time of the potential movement. Our experiments demonstrate that this model takes accurate decisions based on estimates in different load patterns to reduce the total execution time. This model is problem-independent because it considers the progress of all current computations. Moreover, this model is based on measurements so it is not dependent on the programming language. Furthermore, this model takes into account the load state of the nodes on which the computation run. This state includes the characteristics of the nodes and hence this model is architecture-independent. Because the scheduling has direct impact on system performance, we support the skeleton with a cost-informed scheduler that uses a hybrid scheduling policy to improve the dynamicity and adaptivity of the skeleton. This scheduler has agents distributed over the participating workers to keep the load information up to date, trigger the estimations, and facilitate the mobility operations. On runtime, the skeleton co-schedules its computations over computational resources without interfering with the native operating system scheduler. We demonstrate that using a hybrid approach the system makes mobility decisions which lead to improved performance and scalability over large number of computational resources. Our experiments suggest that the adaptivity of our skeleton in shared environment improves the performance and reduces resource contention on nodes that are heavily loaded. Therefore, this adaptivity allows other applications to acquire more resources. Finally, our experiments show that the load scheduler has a low incurred overhead, not exceeding 0.6%, compared to the total execution time

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level

    Hardware / Software Architectural and Technological Exploration for Energy-Efficient and Reliable Biomedical Devices

    Get PDF
    Nowadays, the ubiquity of smart appliances in our everyday lives is increasingly strengthening the links between humans and machines. Beyond making our lives easier and more convenient, smart devices are now playing an important role in personalized healthcare delivery. This technological breakthrough is particularly relevant in a world where population aging and unhealthy habits have made non-communicable diseases the first leading cause of death worldwide according to international public health organizations. In this context, smart health monitoring systems termed Wireless Body Sensor Nodes (WBSNs), represent a paradigm shift in the healthcare landscape by greatly lowering the cost of long-term monitoring of chronic diseases, as well as improving patients' lifestyles. WBSNs are able to autonomously acquire biological signals and embed on-node Digital Signal Processing (DSP) capabilities to deliver clinically-accurate health diagnoses in real-time, even outside of a hospital environment. Energy efficiency and reliability are fundamental requirements for WBSNs, since they must operate for extended periods of time, while relying on compact batteries. These constraints, in turn, impose carefully designed hardware and software architectures for hosting the execution of complex biomedical applications. In this thesis, I develop and explore novel solutions at the architectural and technological level of the integrated circuit design domain, to enhance the energy efficiency and reliability of current WBSNs. Firstly, following a top-down approach driven by the characteristics of biomedical algorithms, I perform an architectural exploration of a heterogeneous and reconfigurable computing platform devoted to bio-signal analysis. By interfacing a shared Coarse-Grained Reconfigurable Array (CGRA) accelerator, this domain-specific platform can achieve higher performance and energy savings, beyond the capabilities offered by a baseline multi-processor system. More precisely, I propose three CGRA architectures, each contributing differently to the maximization of the application parallelization. The proposed Single, Multi and Interleaved-Datapath CGRA designs allow the developed platform to achieve substantial energy savings of up to 37%, when executing complex biomedical applications, with respect to a multi-core-only platform. Secondly, I investigate how the modeling of technology reliability issues in logic and memory components can be exploited to adequately adjust the frequency and supply voltage of a circuit, with the aim of optimizing its computing performance and energy efficiency. To this end, I propose a novel framework for workload-dependent Bias Temperature Instability (BTI) impact analysis on biomedical application results quality. Remarkably, the framework is able to determine the range of safe circuit operating frequencies without introducing worst-case guard bands. Experiments highlight the possibility to safely raise the frequency up to 101% above the maximum obtained with the classical static timing analysis. Finally, through the study of several well-known biomedical algorithms, I propose an approach allowing energy savings by dynamically and unequally protecting an under-powered data memory in a new way compared to regular error protection schemes. This solution relies on the Dynamic eRror compEnsation And Masking (DREAM) technique that reduces by approximately 21% the energy consumed by traditional error correction codes

    Predicting power scalability in a reconfigurable platform

    Get PDF
    This thesis focuses on the evolution of digital hardware systems. A reconfigurable platform is proposed and analysed based on thin-body, fully-depleted silicon-on-insulator Schottky-barrier transistors with metal gates and silicide source/drain (TBFDSBSOI). These offer the potential for simplified processing that will allow them to reach ultimate nanoscale gate dimensions. Technology CAD was used to show that the threshold voltage in TBFDSBSOI devices will be controllable by gate potentials that scale down with the channel dimensions while remaining within appropriate gate reliability limits. SPICE simulations determined that the magnitude of the threshold shift predicted by TCAD software would be sufficient to control the logic configuration of a simple, regular array of these TBFDSBSOI transistors as well as to constrain its overall subthreshold power growth. Using these devices, a reconfigurable platform is proposed based on a regular 6-input, 6-output NOR LUT block in which the logic and configuration functions of the array are mapped onto separate gates of the double-gate device. A new analytic model of the relationship between power (P), area (A) and performance (T) has been developed based on a simple VLSI complexity metric of the form ATσ = constant. As σ defines the performance “return” gained as a result of an increase in area, it also represents a bound on the architectural options available in power-scalable digital systems. This analytic model was used to determine that simple computing functions mapped to the reconfigurable platform will exhibit continuous power-area-performance scaling behavior. A number of simple arithmetic circuits were mapped to the array and their delay and subthreshold leakage analysed over a representative range of supply and threshold voltages, thus determining a worse-case range for the device/circuit-level parameters of the model. Finally, an architectural simulation was built in VHDL-AMS. The frequency scaling described by σ, combined with the device/circuit-level parameters predicts the overall power and performance scaling of parallel architectures mapped to the array
    corecore