118 research outputs found

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications

    Full text link
    The challenging deployment of compute-intensive applications from domains such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or performance. This radical paradigm shift has attracted interest from both academia and industry, resulting in significant research on approximation techniques and methodologies at different design layers (from system down to integrated circuits). Motivated by the wide appeal of Approximate Computing over the last 10 years, we conduct a two-part survey to cover key aspects (e.g., terminology and applications) and review the state-of-the art approximation techniques from all layers of the traditional computing stack. In Part II of our survey, we classify and present the technical details of application-specific and architectural approximation techniques, which both target the design of resource-efficient processors/accelerators & systems. Moreover, we present a detailed analysis of the application spectrum of Approximate Computing and discuss open challenges and future directions.Comment: Under Review at ACM Computing Survey

    Design and management of image processing pipelines within CPS : Acquired experience towards the end of the FitOptiVis ECSEL Project

    Get PDF
    Cyber-Physical Systems (CPSs) are dynamic and reactive systems interacting with processes, environment and, sometimes, humans. They are often distributed with sensors and actuators, characterized for being smart, adaptive, predictive and react in real-time. Indeed, image- and video-processing pipelines are a prime source for environmental information for systems allowing them to take better decisions according to what they see. Therefore, in FitOptiVis, we are developing novel methods and tools to integrate complex image- and video-processing pipelines. FitOptiVis aims to deliver a reference architecture for describing and optimizing quality and resource management for imaging and video pipelines in CPSs both at design- and run-time. The architecture is concretized in low-power, high-performance, smart components, and in methods and tools for combined design-time and run-time multi-objective optimization and adaptation within system and environment constraints.Peer reviewe

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    A Modular Platform for Adaptive Heterogeneous Many-Core Architectures

    Get PDF
    Multi-/many-core heterogeneous architectures are shaping current and upcoming generations of compute-centric platforms which are widely used starting from mobile and wearable devices to high-performance cloud computing servers. Heterogeneous many-core architectures sought to achieve an order of magnitude higher energy efficiency as well as computing performance scaling by replacing homogeneous and power-hungry general-purpose processors with multiple heterogeneous compute units supporting multiple core types and domain-specific accelerators. Drifting from homogeneous architectures to complex heterogeneous systems is heavily adopted by chip designers and the silicon industry for more than a decade. Recent silicon chips are based on a heterogeneous SoC which combines a scalable number of heterogeneous processing units from different types (e.g. CPU, GPU, custom accelerator). This shifting in computing paradigm is associated with several system-level design challenges related to the integration and communication between a highly scalable number of heterogeneous compute units as well as SoC peripherals and storage units. Moreover, the increasing design complexities make the production of heterogeneous SoC chips a monopoly for only big market players due to the increasing development and design costs. Accordingly, recent initiatives towards agile hardware development open-source tools and microarchitecture aim to democratize silicon chip production for academic and commercial usage. Agile hardware development aims to reduce development costs by providing an ecosystem for open-source hardware microarchitectures and hardware design processes. Therefore, heterogeneous many-core development and customization will be relatively less complex and less time-consuming than conventional design process methods. In order to provide a modular and agile many-core development approach, this dissertation proposes a development platform for heterogeneous and self-adaptive many-core architectures consisting of a scalable number of heterogeneous tiles that maintain design regularity features while supporting heterogeneity. The proposed platform hides the integration complexities by supporting modular tile architectures for general-purpose processing cores supporting multi-instruction set architectures (multi-ISAs) and custom hardware accelerators. By leveraging field-programmable-gate-arrays (FPGAs), the self-adaptive feature of the many-core platform can be achieved by using dynamic and partial reconfiguration (DPR) techniques. This dissertation realizes the proposed modular and adaptive heterogeneous many-core platform through three main contributions. The first contribution proposes and realizes a many-core architecture for heterogeneous ISAs. It provides a modular and reusable tilebased architecture for several heterogeneous ISAs based on open-source RISC-V ISA. The modular tile-based architecture features a configurable number of processing cores with different RISC-V ISAs and different memory hierarchies. To increase the level of heterogeneity to support the integration of custom hardware accelerators, a novel hybrid memory/accelerator tile architecture is developed and realized as the second contribution. The hybrid tile is a modular and reusable tile that can be configured at run-time to operate as a scratchpad shared memory between compute tiles or as an accelerator tile hosting a local hardware accelerator logic. The hybrid tile is designed and implemented to be seamlessly integrated into the proposed tile-based platform. The third contribution deals with the self-adaptation features by providing a reconfiguration management approach to internally control the DPR process through processing cores (RISC-V based). The internal reconfiguration process relies on a novel DPR controller targeting FPGA design flow for RISC-V-based SoC to change the types and functionalities of compute tiles at run-time

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications

    Exploiting data locality in cache-coherent NUMA systems

    Get PDF
    The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips. Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours. Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer. These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems. The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine. The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs. Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia de reloj de los computadores. Con el objetivo de superar este problema, durante las últimas dos décadas los fabricantes han integrado más unidades de cómputo en los sistemas mediante la interconexión de nodos diferentes, la inclusión de múltiples chips en los nodos y el incremento de núcleos de procesador en cada chip. La rapidez de la memoria principal no ha evolucionado con el mismo factor que los procesadores; es mucho más lenta y hay la necesidad de proporcionar más ancho de banda a los procesadores, especialmente con el incremento del número de núcleos y chips. Aun manteniendo un sistema de direccionamiento compartido en el que todos los procesadores pueden acceder al conjunto de la memoria, las soluciones han oscilado alrededor de la integración de más memorias: usando tecnologías modernas como las memorias de alto ancho de banda (highbandwidth memories, HBM) y memorias no volátiles (non-volatile memories, NVM), haciendo que grupos de núcleos (como zócalos completos) tengan acceso más veloz a un subconjunto de la DRAM, o con la combinación de soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso a la memoria principal, en función del núcleo que solicita el acceso a una dirección de memoria en particular y la ubicación física de esta dirección, lo que provoca unos comportamientos no uniformes en el acceso a la memoria (non-uniform memory access, NUMA). Además, muchos de estos sistemas tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo que implica que cualquier cambio hecho en la memoria desde un núcleo de un procesador debe ser visible por el resto de procesadores de forma transparente para los programadores. Estos comportamientos NUMA reducen el rendimiento de las aplicaciones y pueden suponer un reto para los programadores. Para abordar dicho problema, en esta tesis se proponen soluciones, a nivel de software y hardware, que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia, el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez la precarga de direcciones de memoria mediante hardware (o hardware prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los datos en los sistemas NUMA, se pueden hallar mejores configuraciones que cuando se consideran ambos aspectos por separado. Con una combinación de los resultados de rendimiento y de los contadores disponibles en el sistema se construye un modelo de rendimiento, tanto por avanzado como en en tiempo de ejecución, de la mejor configuración para aplicaciones que no están incluidas en el modelo. La evaluación se realiza en dos sistemas NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas se usan para predecir las mejores configuraciones en el otro sistema. La segunda contribución se basa en la idea de que el prefetching puede tener un efecto considerable en los sistemas NUMA y propone un esquema de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este esquema es genérico y se puede aplicar a diferentes algoritmos de precarga existentes con un coste de hardware muy bajo pero que proporciona muy buenos resultados. Dichos resultados se obtienen y evalúan mediante un simulador arquitectural preciso a nivel de ciclo y proporciona resultados detallados del rendimiento, la reducción de las comunicaciones de datos y los costes energéticos. Finalmente, la tercera y última contribución consiste en algoritmos de planificación para modelos de programación basados en tareas. Estos modelos simplifican la programabilidad de las aplicaciones paralelas y proveen información muy útil al sistema en tiempo de ejecución (runtime system) que controla su funcionamiento. Esta información se utiliza para construir un grafo de dependencias entre tareas (task dependency graph, TDG), un grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son fragmentos de código secuencial, conocidos como tareas, y los arcos son las dependencias de datos entre las distintas tareas. Los algoritmos de planificación que se proponen usan técnicas e particionado de grafos y proporcionan una planificación de las tareas del TDG que minimiza la comunicación de datos entre las distintas regiones NUMA del sistema. Los resultados se han evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version

    Analyzable dataflow executions with adaptive redundancy

    Get PDF
    Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors, and thus multicore processors are widely used in embedded systems today. Cars are an example for complex embedded systems in which the use of multicore processors is continuously increasing. A major reason for this is to consolidate different software components on one chip and thus reduce the number of electronic control units. However, the de facto standard in the automotive industry, AUTOSAR (AUTomotive Open System ARchitecture), was originally designed for singlecore processors. Although basic support for multicore processors was added, more complex architectures are currently not compatible with the software stack. Regarding the software components running on the ECUS of modern cars, requirements are diverse. On the one hand, there are safety-critical tasks, like the airbag control, anti-lock braking system, electronic stability control and emergency brake assist, and on the other hand, tasks which do not have any safety-related requirements at all, for example tasks controlling the infotainment system. Trends like autonomous driving lead to even more demanding tasks in the system since such tasks are both safety-critical and data-intensive. As embedded applications, like those in the automotive domain, become more complex, new approaches are necessary. Data-intensive tasks are usually tackled with large-scale computing frameworks. In this thesis, some major concepts of such frameworks are transferred to the high-performance embedded systems domain. For this purpose, the thesis describes a runtime environment (RTE) that is suitable for different kinds of multi- and manycore hardware architectures. The RTE follows a dataflow execution model based on directed acyclic graphs (DAGs). Graphs are divided into sections which are scheduled separately. For each section, the RTE uses a DAG scheduling heuristic to compute multiple schedules covering different redundancy configurations. This allows the RTE to dynamically change the redundancy of parts of the graph at runtime despite the use of fixed schedules. Alternatively, the RTE also provides an online scheduler. To specify suitable graphs, the RTE also provides a programming model which shares similarities with common large-scale computing frameworks, for example Apache Spark. Using this programming model, three common distributed algorithms, namely Cannon's algorithm, the Cooley-Tukey algorithm and bitonic sort, were implemented. With these three programs, the performance of the RTE was evaluated for a variety of configurations on two different hardware architectures. The results show that the proposed RTE is able to reach the performance of established parallel computation frameworks and that for suitable graphs with reasonable sectionings the negative influence on the runtime is either small or non-existent.Aufgrund steigender Anforderungen an die Leistungsfähigkeit von eingebetteten Systemen finden Mehrkernprozessoren mittlerweile auch in eingebetteten Systemen Verwendung. Autos sind ein Beispiel für eingebettete Systeme, in denen die Verbreitung von Mehrkernprozessoren kontinuierlich zunimmt. Ein Hauptgrund ist, dass es dadurch möglich wird, mehrere Applikationen, für die ursprünglich mehrere Electronic Control Units (ECUs) notwendig waren, auf ein und demselben Chip auszuführen und dadurch die Anzahl der ECUs im Gesamtsystem zu verringern. Der De-facto-Standard AUTOSAR (AUTomotive Open System ARchitecture) wurde jedoch ursprünglich nur im Hinblick auf Einkernprozessoren entworfen und, obwohl der Softwarestack um grundlegende Unterstützung für Mehrkernprozessoren erweitert wurde, sind komplexere Architekturen nicht damit kompatibel. Die Anforderungen der Softwarekomponenten von modernen Autos sind vielfältig. Einerseits gibt es hochgradig sicherheitskritische Tasks, die beispielsweise die Airbags, das Antiblockiersystem, die Fahrdynamikregelung oder den Notbremsassistenten steuern und andererseits Tasks, die keinerlei sicherheitskritische Anforderungen aufweisen, wie zum Beispiel Tasks zur Steuerung des Infotainment-Systems. Neue Trends wie autonomes Fahren führen zu weiteren anspruchsvollen Tasks, die sowohl hohe Leistungs- als auch Sicherheitsanforderungen aufweisen. Da die Komplexität eingebetteter Anwendungen, beispielsweise im Automobilbereich, stetig zunimmt, sind neue Ansätze erforderlich. Für komplexe, datenintensive Aufgaben werden in der Regel Cluster-Computing-Frameworks eingesetzt. In dieser Arbeit werden Konzepte solcher Frameworks auf den Bereich der eingebetteten Systeme übertragen. Dazu beschreibt die Arbeit eine Laufzeitumgebung (RTE) für eingebettete Mehrkernarchitekturen. Die RTE folgt einem Datenfluss-Ausführungsmodell, das auf gerichteten azyklischen Graphen basiert. Graphen können in Abschnitte eingeteilt werden, für welche separat mehrere unterschiedlich redundante Schedules mit Hilfe einer Scheduling-Heuristik berechnet werden. Dieser Ansatz erlaubt es, die Redundanz von Teilen der Anwendung zur Laufzeit zu verändern. Alternativ unterstützt die RTE auch Scheduling zur Laufzeit. Zur Erzeugung von Graphen stellt die RTE ein Programmiermodell bereit, welches sich an etablierten Frameworks, insbesondere Apache Spark, orientiert. Damit wurden drei Beispielanwendungen implementiert, die auf gängigen Algorithmen basieren. Konkret handelt es sich um Cannon's Algorithmus, den Cooley-Tukey-Algorithmus und bitonisches Sortieren. Um die Leistungsfähigkeit der RTE zu ermitteln, wurden diese drei Anwendungen mehrfach mit verschiedenen Konfigurationen auf zwei Hardware-Architekturen ausgeführt. Die Ergebnisse zeigen, dass die RTE in ihrer Leistungsfähigkeit mit etablierten Systemen vergleichbar ist und die Laufzeit bei einer sinnvollen Graphaufteilung im besten Fall nur geringfügig beeinflusst wird
    corecore