1,455 research outputs found
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis
Dynamically scheduled high-level synthesis (HLS) achieves higher throughput
than static HLS for codes with unpredictable memory accesses and control flow.
However, excessive dataflow scheduling results in circuits that use more
resources and have a slower critical path, even when only a part of the circuit
exhibits dynamic behavior. Recent work has shown that marking parts of a
dataflow circuit for static scheduling can save resources and improve
performance (hybrid scheduling), but the dynamic part of the circuit still
bottlenecks the critical path. We propose instead to selectively introduce
dynamic scheduling into static HLS. This paper presents an algorithm for
identifying code regions amenable to dynamic scheduling and shows a methodology
for introducing dynamically scheduled basic blocks, loops, and memory
operations into static HLS. Our algorithm is informed by modulo-scheduling and
can be integrated into any modulo-scheduled HLS tool. On a set of ten
benchmarks, we show that our approach achieves on average an up to 3.7
and 3 speedup against dynamic and hybrid scheduling, respectively, with
an area overhead of 1.3 and frequency degradation of 0.74 when
compared to static HLS.Comment: To appear in the 33rd International Conference on Field-Programmable
Logic and Applications (2023
Reconfigurable Video Coding on multicore : an overview of its main objectives
International audienceThe current monolithic and lengthy scheme behind the standardization and the design of new video coding standards is becoming inappropriate to satisfy the dynamism and changing needs of the video coding community. Such scheme and specification formalism does not allow the clear commonalities between the different codecs to be shown, at the level of the specification nor at the level of the implementation. Such a problem is one of the main reasons for the typically long interval elapsing between the time a new idea is validated until it is implemented in consumer products as part of a worldwide standard. The analysis of this problem originated a new standard initiative within the International Organization for Standardization (ISO)/ International Electrotechnical Commission (IEC) Moving Pictures Experts Group (MPEG) committee, namely Reconfigurable Video Coding (RVC). The main idea is to develop a video coding standard that overcomes many shortcomings of the current standardization and specification process by updating and progressively incrementing a modular library of components. As the name implies, flexibility and reconfigurability are new attractive features of the RVC standard. Besides allowing for the definition of new codec algorithms, such features, as well as the dataflow-based specification formalism, open the way to define video coding standards that expressly target implementations on platforms with multiple cores. This article provides an overview of the main objectives of the new RVC standard, with an emphasis on the features that enable efficient implementation on platforms with multiple cores. A brief introduction to the methodologies that efficiently map RVC codec specifications to multicore platforms is accompanied with an example of the possible breakthroughs that are expected to occur in the design and deployment of multimedia services on multicore platforms
A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis
Dynamically scheduled high-level synthesis (HLS) enables the use of
load-store queues (LSQs) which can disambiguate data hazards at circuit
runtime, increasing throughput in codes with unpredictable memory accesses.
However, the increased throughput comes at the price of lower clock frequency
and higher resource usage compared to statically scheduled circuits without
LSQs. The lower frequency often nullifies any throughput improvements over
static scheduling, while the resource usage becomes prohibitively expensive
with large queue sizes. This paper presents a method for achieving dynamically
scheduled memory operations in HLS without significant clock period and
resource usage increase. We present a novel LSQ based on shift-registers
enabled by the opportunity to specialize queue sizes to a target code in HLS.
We show a method to speculatively allocate addresses to our LSQ, significantly
increasing pipeline parallelism in codes that could not benefit from an LSQ
before. In stark contrast to traditional load value speculation, we do not
require pipeline replays and have no overhead on misspeculation. On a set of
benchmarks with data hazards, our approach achieves an average speedup of
11 against static HLS and 5 against dynamic HLS that uses a
state of the art LSQ from previous work. Our LSQ also uses several times fewer
resources, scaling to queues with hundreds of entries, and supports both
on-chip and off-chip memory.Comment: To appear in the International Conference on Field Programmable
Technology (FPT'23), Yokohama, Japan, 11-14 December 202
Analyzable dataflow executions with adaptive redundancy
Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors, and thus multicore processors are widely used in embedded systems today.
Cars are an example for complex embedded systems in which the use of multicore processors is continuously increasing.
A major reason for this is to consolidate different software components on one chip and thus reduce the number of electronic control units.
However, the de facto standard in the automotive industry, AUTOSAR (AUTomotive Open System ARchitecture), was originally designed for singlecore processors.
Although basic support for multicore processors was added, more complex architectures are currently not compatible with the software stack.
Regarding the software components running on the ECUS of modern cars, requirements are diverse.
On the one hand, there are safety-critical tasks, like the airbag control, anti-lock braking system, electronic stability control and emergency brake assist, and on the other hand, tasks which do not have any safety-related requirements at all, for example tasks controlling the infotainment system.
Trends like autonomous driving lead to even more demanding tasks in the system since such tasks are both safety-critical and data-intensive.
As embedded applications, like those in the automotive domain, become more complex, new approaches are necessary.
Data-intensive tasks are usually tackled with large-scale computing frameworks.
In this thesis, some major concepts of such frameworks are transferred to the high-performance embedded systems domain.
For this purpose, the thesis describes a runtime environment (RTE) that is suitable for different kinds of multi- and manycore hardware architectures.
The RTE follows a dataflow execution model based on directed acyclic graphs (DAGs).
Graphs are divided into sections which are scheduled separately.
For each section, the RTE uses a DAG scheduling heuristic to compute multiple schedules covering different redundancy configurations.
This allows the RTE to dynamically change the redundancy of parts of the graph at runtime despite the use of fixed schedules.
Alternatively, the RTE also provides an online scheduler.
To specify suitable graphs, the RTE also provides a programming model which shares similarities with common large-scale computing frameworks, for example Apache Spark.
Using this programming model, three common distributed algorithms, namely Cannon's algorithm, the Cooley-Tukey algorithm and bitonic sort, were implemented.
With these three programs, the performance of the RTE was evaluated for a variety of configurations on two different hardware architectures.
The results show that the proposed RTE is able to reach the performance of established parallel computation frameworks and that for suitable graphs with reasonable sectionings the negative influence on the runtime is either small or non-existent.Aufgrund steigender Anforderungen an die Leistungsfähigkeit von eingebetteten Systemen finden Mehrkernprozessoren mittlerweile auch in eingebetteten Systemen Verwendung.
Autos sind ein Beispiel für eingebettete Systeme, in denen die Verbreitung von Mehrkernprozessoren kontinuierlich zunimmt.
Ein Hauptgrund ist, dass es dadurch möglich wird, mehrere Applikationen, für die ursprünglich mehrere Electronic Control Units (ECUs) notwendig waren, auf ein und demselben Chip auszuführen und dadurch die Anzahl der ECUs im Gesamtsystem zu verringern.
Der De-facto-Standard AUTOSAR (AUTomotive Open System ARchitecture) wurde jedoch ursprünglich nur im Hinblick auf Einkernprozessoren entworfen und, obwohl der Softwarestack um grundlegende Unterstützung für Mehrkernprozessoren erweitert wurde, sind komplexere Architekturen nicht damit kompatibel.
Die Anforderungen der Softwarekomponenten von modernen Autos sind vielfältig.
Einerseits gibt es hochgradig sicherheitskritische Tasks, die beispielsweise die Airbags, das Antiblockiersystem, die Fahrdynamikregelung oder den Notbremsassistenten steuern und andererseits Tasks, die keinerlei sicherheitskritische Anforderungen aufweisen, wie zum Beispiel Tasks zur Steuerung des Infotainment-Systems.
Neue Trends wie autonomes Fahren führen zu weiteren anspruchsvollen Tasks, die sowohl hohe Leistungs- als auch Sicherheitsanforderungen aufweisen.
Da die Komplexität eingebetteter Anwendungen, beispielsweise im Automobilbereich, stetig zunimmt, sind neue Ansätze erforderlich.
Für komplexe, datenintensive Aufgaben werden in der Regel Cluster-Computing-Frameworks eingesetzt.
In dieser Arbeit werden Konzepte solcher Frameworks auf den Bereich der eingebetteten Systeme übertragen.
Dazu beschreibt die Arbeit eine Laufzeitumgebung (RTE) für eingebettete Mehrkernarchitekturen.
Die RTE folgt einem Datenfluss-Ausführungsmodell, das auf gerichteten azyklischen Graphen basiert.
Graphen können in Abschnitte eingeteilt werden, für welche separat mehrere unterschiedlich redundante Schedules mit Hilfe einer Scheduling-Heuristik berechnet werden.
Dieser Ansatz erlaubt es, die Redundanz von Teilen der Anwendung zur Laufzeit zu verändern.
Alternativ unterstützt die RTE auch Scheduling zur Laufzeit.
Zur Erzeugung von Graphen stellt die RTE ein Programmiermodell bereit, welches sich an etablierten Frameworks, insbesondere Apache Spark, orientiert.
Damit wurden drei Beispielanwendungen implementiert, die auf gängigen Algorithmen basieren.
Konkret handelt es sich um Cannon's Algorithmus, den Cooley-Tukey-Algorithmus und bitonisches Sortieren.
Um die Leistungsfähigkeit der RTE zu ermitteln, wurden diese drei Anwendungen mehrfach mit verschiedenen Konfigurationen auf zwei Hardware-Architekturen ausgeführt.
Die Ergebnisse zeigen, dass die RTE in ihrer Leistungsfähigkeit mit etablierten Systemen vergleichbar ist und die Laufzeit bei einer sinnvollen Graphaufteilung im besten Fall nur geringfügig beeinflusst wird
Static Analysis and Transformation of Dataflow Multimedia Applications
An approach for merging statically schedulable subr egions in dataflow models is pr esented. The approach combines abstr act int erpr etation, loop analysis, and static scheduling of cyclo-static dataflow networ ks. The approach has been implemented in a Java-based tool that per forms automatic classification of dataflow act or s, generat ion of stat ic schedules using constr aint programming, and automatic merging of the finegrained act or s in the subnetwor k into a single, larger -grained actor . The approach is applied to an MPEG-4 SP video decoder implemented in the dataflow act or s language CAL
Dynamic Task Execution on Shared and Distributed Memory Architectures
Multicore architectures with high core counts have come to dominate the world of high performance computing, from shared memory machines to the largest distributed memory clusters. The multicore route to increased performance has a simpler design and better power efficiency than the traditional approach of increasing processor frequencies. But, standard programming techniques are not well adapted to this change in computer architecture design.
In this work, we study the use of dynamic runtime environments executing data driven applications as a solution to programming multicore architectures. The goals of our runtime environments are productivity, scalability and performance. We demonstrate productivity by defining a simple programming interface to express code. Our runtime environments are experimentally shown to be scalable and give competitive performance on large multicore and distributed memory machines.
This work is driven by linear algebra algorithms, where state-of-the-art libraries (e.g., LAPACK and ScaLAPACK) using a fork-join or block-synchronous execution style do not use the available resources in the most efficient manner. Research work in linear algebra has reformulated these algorithms as tasks acting on tiles of data, with data dependency relationships between the tasks. This results in a task-based DAG for the reformulated algorithms, which can be executed via asynchronous data-driven execution paths analogous to dataflow execution.
We study an API and runtime environment for shared memory architectures that efficiently executes serially presented tile based algorithms. This runtime is used to enable linear algebra applications and is shown to deliver performance competitive with state-of- the-art commercial and research libraries.
We develop a runtime environment for distributed memory multicore architectures extended from our shared memory implementation. The runtime takes serially presented algorithms designed for the shared memory environment, and schedules and executes them on distributed memory architectures in a scalable and high performance manner. We design a distributed data coherency protocol and a distributed task scheduling mechanism which avoid global coordination. Experimental results with linear algebra applications show the scalability and performance of our runtime environment
- …