403 research outputs found
An evaluation of the TRIPS computer system
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies
Recommended from our members
JavaFlow : a Java DataFlow Machine
textThe JavaFlow, a Java DataFlow Machine is a machine design concept implementing a Java Virtual Machine aimed at addressing technology roadmap issues along with the ability to effectively utilize and manage very large numbers of processing cores. Specific design challenges addressed include: design complexity through a common set of repeatable structures; low power by featuring unused circuits and ability to power off sections of the chip; clock propagation and wire limits by using locality to bring data to processing elements and a Globally Asynchronous Locally Synchronous (GALS) design; and reliability by allowing portions of the design to be bypassed in case of failures. A Data Flow Architecture is used with multiple heterogeneous networks to connect processing elements capable of executing a single Java ByteCode instruction. Whole methods are cached in this DataFlow fabric, and the networks plus distributed intelligence are used for their management and execution. A mesh network is used for the DataFlow transfers; two ordered networks are used for management and control flow mapping; and multiple high speed rings are used to access the storage subsystem and a controlling General Purpose Processor (GPP). Analysis of benchmarks demonstrates the potential for this design concept. The design process was initiated by analyzing SPEC JVM benchmarks which identified a small number methods contributing to a significant percentage of the overall ByteCode operations. Additional analysis established static instruction mixes to prioritize the types of processing elements used in the DataFlow Fabric. The overall objective of the machine is to provide multi-threading performance for Java Methods deployed to this DataFlow fabric. With advances in technology it is envisioned that from 1,000 to 10,000 cores/instructions could be deployed and managed using this structure. This size of DataFlow fabric would allow all the key methods from the SPEC benchmarks to be resident. A baseline configuration is defined with a compressed dataflow structure and then compared to multiple configurations of instruction assignments and clock relationships. Using a series of methods from the SPEC benchmark running independently, IPC (Instructions per Cycle) performance of the sparsely populated heterogeneous structure is 40% of the baseline. The average ratio of instructions to required nodes is 3.5. Innovative solutions to the loading and management of Java methods along with the translation from control flow to DataFlow structure are demonstrated.Electrical and Computer Engineerin
Recommended from our members
High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware
With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions
Implementation of a general purpose dataflow multiprocessor
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1988.GRSN 409671Includes bibliographical references (leaves 151-155).by Gregory Michael Papadopoulos.Ph.D
Analyzable dataflow executions with adaptive redundancy
Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors, and thus multicore processors are widely used in embedded systems today.
Cars are an example for complex embedded systems in which the use of multicore processors is continuously increasing.
A major reason for this is to consolidate different software components on one chip and thus reduce the number of electronic control units.
However, the de facto standard in the automotive industry, AUTOSAR (AUTomotive Open System ARchitecture), was originally designed for singlecore processors.
Although basic support for multicore processors was added, more complex architectures are currently not compatible with the software stack.
Regarding the software components running on the ECUS of modern cars, requirements are diverse.
On the one hand, there are safety-critical tasks, like the airbag control, anti-lock braking system, electronic stability control and emergency brake assist, and on the other hand, tasks which do not have any safety-related requirements at all, for example tasks controlling the infotainment system.
Trends like autonomous driving lead to even more demanding tasks in the system since such tasks are both safety-critical and data-intensive.
As embedded applications, like those in the automotive domain, become more complex, new approaches are necessary.
Data-intensive tasks are usually tackled with large-scale computing frameworks.
In this thesis, some major concepts of such frameworks are transferred to the high-performance embedded systems domain.
For this purpose, the thesis describes a runtime environment (RTE) that is suitable for different kinds of multi- and manycore hardware architectures.
The RTE follows a dataflow execution model based on directed acyclic graphs (DAGs).
Graphs are divided into sections which are scheduled separately.
For each section, the RTE uses a DAG scheduling heuristic to compute multiple schedules covering different redundancy configurations.
This allows the RTE to dynamically change the redundancy of parts of the graph at runtime despite the use of fixed schedules.
Alternatively, the RTE also provides an online scheduler.
To specify suitable graphs, the RTE also provides a programming model which shares similarities with common large-scale computing frameworks, for example Apache Spark.
Using this programming model, three common distributed algorithms, namely Cannon's algorithm, the Cooley-Tukey algorithm and bitonic sort, were implemented.
With these three programs, the performance of the RTE was evaluated for a variety of configurations on two different hardware architectures.
The results show that the proposed RTE is able to reach the performance of established parallel computation frameworks and that for suitable graphs with reasonable sectionings the negative influence on the runtime is either small or non-existent.Aufgrund steigender Anforderungen an die LeistungsfÀhigkeit von eingebetteten Systemen finden Mehrkernprozessoren mittlerweile auch in eingebetteten Systemen Verwendung.
Autos sind ein Beispiel fĂŒr eingebettete Systeme, in denen die Verbreitung von Mehrkernprozessoren kontinuierlich zunimmt.
Ein Hauptgrund ist, dass es dadurch möglich wird, mehrere Applikationen, fĂŒr die ursprĂŒnglich mehrere Electronic Control Units (ECUs) notwendig waren, auf ein und demselben Chip auszufĂŒhren und dadurch die Anzahl der ECUs im Gesamtsystem zu verringern.
Der De-facto-Standard AUTOSAR (AUTomotive Open System ARchitecture) wurde jedoch ursprĂŒnglich nur im Hinblick auf Einkernprozessoren entworfen und, obwohl der Softwarestack um grundlegende UnterstĂŒtzung fĂŒr Mehrkernprozessoren erweitert wurde, sind komplexere Architekturen nicht damit kompatibel.
Die Anforderungen der Softwarekomponenten von modernen Autos sind vielfÀltig.
Einerseits gibt es hochgradig sicherheitskritische Tasks, die beispielsweise die Airbags, das Antiblockiersystem, die Fahrdynamikregelung oder den Notbremsassistenten steuern und andererseits Tasks, die keinerlei sicherheitskritische Anforderungen aufweisen, wie zum Beispiel Tasks zur Steuerung des Infotainment-Systems.
Neue Trends wie autonomes Fahren fĂŒhren zu weiteren anspruchsvollen Tasks, die sowohl hohe Leistungs- als auch Sicherheitsanforderungen aufweisen.
Da die KomplexitÀt eingebetteter Anwendungen, beispielsweise im Automobilbereich, stetig zunimmt, sind neue AnsÀtze erforderlich.
FĂŒr komplexe, datenintensive Aufgaben werden in der Regel Cluster-Computing-Frameworks eingesetzt.
In dieser Arbeit werden Konzepte solcher Frameworks auf den Bereich der eingebetteten Systeme ĂŒbertragen.
Dazu beschreibt die Arbeit eine Laufzeitumgebung (RTE) fĂŒr eingebettete Mehrkernarchitekturen.
Die RTE folgt einem Datenfluss-AusfĂŒhrungsmodell, das auf gerichteten azyklischen Graphen basiert.
Graphen können in Abschnitte eingeteilt werden, fĂŒr welche separat mehrere unterschiedlich redundante Schedules mit Hilfe einer Scheduling-Heuristik berechnet werden.
Dieser Ansatz erlaubt es, die Redundanz von Teilen der Anwendung zur Laufzeit zu verÀndern.
Alternativ unterstĂŒtzt die RTE auch Scheduling zur Laufzeit.
Zur Erzeugung von Graphen stellt die RTE ein Programmiermodell bereit, welches sich an etablierten Frameworks, insbesondere Apache Spark, orientiert.
Damit wurden drei Beispielanwendungen implementiert, die auf gÀngigen Algorithmen basieren.
Konkret handelt es sich um Cannon's Algorithmus, den Cooley-Tukey-Algorithmus und bitonisches Sortieren.
Um die LeistungsfĂ€higkeit der RTE zu ermitteln, wurden diese drei Anwendungen mehrfach mit verschiedenen Konfigurationen auf zwei Hardware-Architekturen ausgefĂŒhrt.
Die Ergebnisse zeigen, dass die RTE in ihrer LeistungsfĂ€higkeit mit etablierten Systemen vergleichbar ist und die Laufzeit bei einer sinnvollen Graphaufteilung im besten Fall nur geringfĂŒgig beeinflusst wird
An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime
This doctoral thesis focuses on how the application threads that are based on dataflow
execution model can be managed at Network-on-Chip (NoC) level. The roots of the
dataflow execution model date back to the early 1970âs. Applications adhering to such
program execution model follow a simple producer-consumer communication scheme for
synchronising parallel thread related activities. In dataflow execution environment, a
thread can run if and only if all its required inputs are available. Applications running
on a large and complex computing environment can significantly benefit from the
adoption of dataflow model.
In the first part of the thesis, the work is focused on the thread distribution mechanism.
It has been shown that how a scalable hash-based thread distribution mechanism
can be implemented at the router level with low overheads. To enhance the support further,
a tool to monitor the dataflow threadsâ status and a simple, functional model is
also incorporated into the design. Next, a software defined NoC has been proposed to
manage the distribution of dataflow threads by exploiting its reconfigurability.
The second part of this work is focused more on NoC microarchitecture level. Traditional
2D-mesh topology is combined with a standard ring, to understand how such
hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally,
a mixed-integer linear programming based analytical model has been proposed
to verify if the application threads mapped on to the free cores is optimal or not. The
proposed mathematical model can be used as a yardstick to verify the solution quality
of the newly developed mapping policy. It is not trivial to provide a complete low-level
framework for dataflow thread execution for better resource and power management.
However, this work could be considered as a primary framework to which improvements
could be carried out
SCALABLE TECHNIQUES FOR SCHEDULING AND MAPPING DSP APPLICATIONS ONTO EMBEDDED MULTIPROCESSOR PLATFORMS
A variety of multiprocessor architectures has proliferated even for off-the-shelf computing platforms. To make use of these platforms, traditional implementation frameworks focus on implementing Digital Signal Processing (DSP) applications using special platform features to achieve high performance. However, due to the fast evolution of the underlying architectures, solution redevelopment is error prone and re-usability of existing solutions and libraries is limited. In this thesis, we facilitate an efficient migration of DSP systems to multiprocessor platforms while systematically leveraging previous investment in optimized library kernels using dataflow design frameworks. We make these library elements, which are typically tailored to specialized architectures, more amenable to extensive analysis and optimization using an efficient and systematic process. In this thesis we provide techniques to allow such migration through four basic contributions:
1. We propose and develop a framework to explore efficient utilization of Single Instruction Multiple Data (SIMD) cores and accelerators available in heterogeneous multiprocessor platforms consisting of General Purpose Processors (GPPs) and Graphics Processing Units (GPUs). We also propose new scheduling techniques by applying extensive block processing in conjunction with appropriate task mapping and task ordering methods that match efficiently with the underlying architecture. The approach gives the developer the ability to prototype a GPU-accelerated application and explore its design space efficiently and effectively.
2. We introduce the concept of Partial Expansion Graphs (PEGs) as an implementation model and associated class of scheduling strategies. PEGs are designed to help realize DSP systems in terms of forms and granularities of parallelism that are well matched to the given applications and targeted platforms. PEGs also facilitate derivation of both static and dynamic scheduling techniques, depending on the amount of variability in task execution times and other operating conditions. We show how to implement efficient PEG-based scheduling methods using real time operating systems, and to re-use pre-optimized libraries of DSP components within such implementations.
3. We develop new algorithms for scheduling and mapping systems implemented using PEGs. Collectively, these algorithms operate in three steps. First, the amount of data parallelism in the application graph is tuned systematically over many iterations to profit from the available cores in the target platform. Then a mapping algorithm that uses graph analysis is developed to distribute data and task parallel instances over different cores while trying to balance the load of all processing units to make use of pipeline parallelism. Finally, we use a novel technique for performance evaluation by implementing the scheduler and a customizable solution on the programmable platform. This allows accurate fitness functions to be measured and used to drive runtime adaptation of schedules.
4. In addition to providing scheduling techniques for the mentioned applications and platforms, we also show how to integrate the resulting solution in the underlying environment. This is achieved by leveraging existing libraries and applying the GPP-GPU scheduling framework to augment a popular existing Software Defined Radio (SDR) development environment -- GNU Radio -- with a dataflow foundation and a stand-alone GPU-accelerated library. We also show how to realize the PEG model on real time operating system libraries, such as the Texas Instruments DSP/BIOS. A code generator that accepts a manual system designer solution as well as automatically configured solutions is provided to complete the design flow starting from application model to running system
Refactoring for introducing and tuning parallelism for heterogeneous multicore machines in Erlang
This research has been generously supported by the European Union Framework 7 Para-Phrase project (IST-288570), EU Horizon 2020 projects RePhrase (H2020-ICT-2014-1), agreement number 644235; Teamplay (H2020-ICT 2017-1) agreement number 779882, and EPSRC Discovery, EP/P020631/1. EU COST Action IC1202: Timing Analysis On Code-Level (TACLe), and by a travel grant from EU HiPEAC.This paper presents semiâautomatic software refactorings to introduce and tune structured parallelism in sequential Erlang code, as well as to generate code for running computations on GPUs and possibly other accelerators. Our refactorings are based on the lapedo framework for programming heterogeneous multiâcore systems in Erlang. lapedo is based on the PaRTE refactoring tool and also contains (1) a set of hybrid skeletons that target both CPU and GPU processors, (2) novel refactorings for introducing and tuning parallelism, and (3) a tool to generate the GPU offloading and scheduling code in Erlang, which is used as a component of hybrid skeletons. We demonstrate, on four realistic useâcase applications, that we are able to refactor sequential code and produce heterogeneous parallel versions that can achieve significant and scalable speedups of up to 220 over the original sequential Erlang program on a 24âcore machine with a GPU.PostprintPeer reviewe
- âŠ