57 research outputs found

    Efficient design space exploration of embedded microprocessors

    Get PDF

    Timing model derivation : static analysis of hardware description languages

    Get PDF
    Safety-critical hard real-time systems are subject to strict timing constraints. In order to derive guarantees on the timing behavior, the worst-case execution time (WCET) of each task comprising the system has to be known. The aiT tool has been developed for computing safe upper bounds on the WCET of a task. Its computation is mainly based on abstract interpretation of timing models of the processor and its periphery. These models are currently hand-crafted by human experts, which is a time-consuming and error-prone process. Modern processors are automatically synthesized from formal hardware specifications. Besides the processor’s functional behavior, also timing aspects are included in these descriptions. A methodology to derive sound timing models using hardware specifications is described within this thesis. To ease the process of timing model derivation, the methodology is embedded into a sound framework. A key part of this framework are static analyses on hardware specifications. This thesis presents an analysis framework that is build on the theory of abstract interpretation allowing use of classical program analyses on hardware description languages. Its suitability to automate parts of the derivation methodology is shown by different analyses. Practical experiments demonstrate the applicability of the approach to derive timing models. Also the soundness of the analyses and the analyses’ results is proved.Sicherheitskritische Echtzeitsysteme unterliegen strikten Zeitanforderungen. Um ihr Zeitverhalten zu garantieren müssen die Ausführungszeiten der einzelnen Programme, die das System bilden, bekannt sein. Um sichere obere Schranken für die Ausführungszeit von Programmen zu berechnen wurde aiT entwickelt. Die Berechnung basiert auf abstrakter Interpretation von Zeitmodellen des Prozessors und seiner Peripherie. Diese Modelle werden händisch in einem zeitaufwendigen und fehleranfälligen Prozess von Experten entwickelt. Moderne Prozessoren werden automatisch aus formalen Spezifikationen erzeugt. Neben dem funktionalen Verhalten beschreiben diese auch das Zeitverhalten des Prozessors. In dieser Arbeit wird eine Methodik zur sicheren Ableitung von Zeitmodellen aus der Hardwarespezifikation beschrieben. Um den Ableitungsprozess zu vereinfachen ist diese Methodik in eine automatisierte Umgebung eingebettet. Ein Hauptbestandteil dieses Systems sind statische Analysen auf Hardwarebeschreibungen. Diese Arbeit stellt eine Analyse-Umgebung vor, die auf der Theorie der abstrakten Interpretation aufbaut und den Einsatz von klassischen Programmanalysen auf Hardwarebeschreibungssprachen erlaubt. Die Eignung des Systems, Teile der Ableitungsmethodik zu automatisieren, wird anhand einiger Analysen gezeigt. Experimentelle Ergebnisse zeigen die Anwendbarkeit der Methodik zur Ableitung von Zeitmodellen. Die Korrektheit der Analysen und der Analyse-Ergebnisse wird ebenfalls bewiesen

    Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism

    Get PDF
    The shift of the microprocessor industry towards multicore architectures has placed a huge burden on the programmers by requiring explicit parallelization for performance. Implicit Parallelization is an alternative that could ease the burden on programmers by parallelizing applications ???under the covers??? while maintaining sequential semantics externally. This thesis develops a novel approach for thinking about parallelism, by casting the problem of parallelization in terms of instruction criticality. Using this approach, parallelism in a program region is readily identified when certain conditions about fetch-criticality are satisfied by the region. The thesis formalizes this approach by developing a criticality-driven model of task-based parallelization. The model can accurately predict the parallelism that would be exposed by potential task choices by capturing a wide set of sources of parallelism as well as costs to parallelization. The criticality-driven model enables the development of two key components for Implicit Parallelization: a task selection policy, and a bottleneck analysis tool. The task selection policy can partition a single-threaded program into tasks that will profitably execute concurrently on a multicore architecture in spite of the costs associated with enforcing data-dependences and with task-related actions. The bottleneck analysis tool gives feedback to the programmers about data-dependences that limit parallelism. In particular, there are several ???accidental dependences??? that can be easily removed with large improvements in parallelism. These tools combine into a systematic methodology for performance tuning in Implicit Parallelization. Finally, armed with the criticality-driven model, the thesis revisits several architectural design decisions, and finds several encouraging ways forward to increase the scope of Implicit Parallelization.unpublishednot peer reviewe

    Porting and tuning of the Mont-Blanc benchmarks to the multicore ARM 64bit architecture

    Get PDF
    This project is about porting and tuning the Mont-Blanc benchmarks to the multicore ARM 64 bits architecture. The Mont-Blanc benchmarks are part of the Mont-Blanc European project and they have been developed internally in the BSC (Barcelona Supercomputing Center). The project will explore the possibilities that an ARM architecture can offer running in a HPC (High Performance Computing) setup, this includes to learn how to tune and adapt a parallelized computer program and analyze its execution behavior. As part of the project, we will analyze the performance of each benchmark using instrumentation tools such like Extrae and Paraver. Each benchmark will be adapted, tuned and executed mainly in the three new Mont-Blanc mini-clusters, Thunder (ARMv8 custom), Merlin (ARMv8 custom) and Jetson TX (ARMv8 cortex-a57) using the OmpSs programming model. The evolution of the performance obtained will be shown followed by a brief analysis of the results after each optimization.Aquest projecte es basa en adaptar i afinar els Mont-Blanc benchmarks a l’arquitectura multinucli ARM 64 bits. Els Mont-Blanc benchmarks formen part del projecte Europeu Mont-Blanc i han estat desenvolupats internament en el BSC (Barcelona Supercomputing Center). Aquest projecte explorarà el potencial d’usar l’arquitectura ARM en un entorn HPC (High Performance Computing), això inclou aprendre a adaptar i afinar un programa paral·lel, i analitzar el seu comportament durant l’execució. Com a part del projecte, s’analitzarà el rendiment de cada benchmark usant eines d’instrumentació com Extrae o Paraver. Cada benchmark serà adaptat, afinat i executat en els tres nous miniclústers de Mont-Blanc, Thunder (ARMv8 personalitzat), Merlin (ARMv8 personalitzat) i Jetson TX (ARMv8 cortex-a57) usant el model de programació OmpSs. Es mostrarà l’evolució del rendiment, seguit d’una breu explicació dels resultats després de cada optimització.Este proyecto se basa en adaptar y afinar los Mont-blanc benchmarks a la arquitectura multi-núcleo ARM 64 bits. Los Mont-Blanc benchmarks forman parte del proyecto Europeo Mont-Blanc y han sido desarrollados internamente en el BSC (Barcelona Supercomputing Center). Este proyecto explorará el potencial de usar la arquitectura ARM en un entorno HPC (High Performance Computing), esto incluye aprender a adaptar y afinar un programa paralelo, y analizar su comportamiento durante la ejecución. Como parte del proyecto, se analizará el rendimiento de cada benchmark usando herramientas de instrumentación como Extrae o Paraver. Cada benchmark será adaptado, afinado y ejecutado en los tres nuevos mini-clústeres de Mont-Blanc, Thunder (ARMv8 personalizado), Merlin (ARMv8 personalizado) y Jetson TX (ARMv8 cortex-a57) usando el modelo de programación OmpSs. Se mostrará la evolución del rendimiento obtenido, y una breve explicación de los resultados después de cada optimización

    Efficient Real-Time Architectures and FPGA Implementations of Histogram-Based Median Filters for High Definition Videos

    Get PDF
    Digital filtering plays an important role in many signal processing applications. Filtering is performed to recover the original signal from its corrupted version. Median filter is a non-linear digital filter that replaces a sample in a given window by the median value of the samples in the window. For images corrupted with impulse noise, median filter provides a very high quality of filtered images. Several modifications of median filters have been proposed and implemented to achieve high image quality compared to that provided by conventional median filters. When these filters are implemented on hardware platforms such as FPGAs, the performance parameters, namely, the area, power and operating frequency should be taken into consideration in addition to the quality of the filtered image. Therefore, efficient implementation of median filters on FPGAs for image and video processing algorithms has been a topic of much interest. The existing hardware-based median filters for high definition video formats do not always satisfy the real-time throughput requirements or are inefficient with respect to hardware performance parameters, such as the area and frequency. This is due to the fact that most of the existing techniques use sorting-based median calculation, which results in a low hardware performance. In this thesis, architectures that use histogram-based median computation, which is a non-sorting-based operation, are designed with a view of efficient hardware implementation. This is carried out in two parts. We design and implement efficient architectures that satisfy the real-time throughput requirements of full high definition (FHD) videos in the first part and that of ultra high definition (UHD) videos in the second part. In the first part, an efficient real-time histogram-based median filter that uses the concept of bit-plane-slicing and adaptive switching median filter (ASMF) is designed and implemented on FPGAs. We term this architecture as hybrid architecture for median filtering (HAMF). The proposed HAMF computes an approximate median, since it uses only the most significant B-bits of the pixel values for median calculation. As a result, the algorithmic level implementation of the proposed HAMF results in a slight degradation in the filtered image quality compared to that provided by ASMF. The proposed HAMF provides a significant improvement over ASMF in terms of the area and operating frequency, when implemented on different generation FPGAs. Analysis of the different parameters, such as the number of bit-planes used in the computation of the median and the number of pipelining stages, is carried out to study the trade-off between the quality of the filtered image and hardware performance. Although the FPGA implementation of the proposed HAMF provides a very high operating frequency, the quality of the images filtered by its algorithmic level implementation decreases with increasing window size and noise density. This filter may be suitable for applications that require FHD filtering with cost constraints, but not for applications where the output image quality is as important as the hardware performance. Hence, in the second part, we design an efficient and real-time architecture of the hierarchical histogram-based median filter (HHMF). The proposed architecture is designed using a full synchronous pipeline, a synchronous accumulate-and-compare unit, and is scalable. The FPGA implementation of the proposed architecture of HHMF can perform real-time filtering of 4K and 8K UHD videos. The quality of the image filtered by HHMF is not compromised as in the case of HAMF, since HHMF uses all the bit-planes and computes the actual median. Although the FPGA implementation of HHMF results in more area utilization, the proposed implementation is more economical than a GPU-based HHMF implementation and provides a better throughput

    A Dynamically Scheduled HLS Flow in MLIR

    Get PDF
    In High-Level Synthesis (HLS), we consider abstractions that span from software to hardware and target heterogeneous architectures. Therefore, managing the complexity introduced by this is key to implementing good, maintainable, and extendible HLS compilers. Traditionally, HLS flows have been built on top of software compilation infrastructure such as LLVM, with hardware aspects of the flow existing peripherally to the core of the compiler. Through this work, we aim to show that MLIR, a compiler infrastructure with a focus on domain-specific intermediate representations (IR), is a better infrastructure for HLS compilers. Using MLIR, we define HLS and hardware abstractions as first-class citizens of the compiler, simplifying analysis, transformations, and optimization. To demonstrate this, we present a C-to-RTL, dynamically scheduled HLS flow. We find that our flow generates circuits comparable to those of an equivalent LLVM-based HLS compiler. Notably, we achieve this while lacking key optimization passes typically found in HLS compilers and through the use of an experimental front-end. To this end, we show that significant improvements in the generated RTL are but low-hanging fruit, requiring engineering effort to attain. We believe that our flow is more modular and more extendible than comparable open-source HLS compilers and is thus a good candidate as a basis for future research. Apart from the core HLS flow, we provide MLIR-based tooling for C-to-RTL cosimulation and visual debugging, with the ultimate goal of building an MLIR-based HLS infrastructure that will drive innovation in the field

    Unconventional Applications of Compiler Analysis

    Get PDF
    Previously, compiler transformations have primarily focused on minimizing program execution time. This thesis explores some examples of applying compiler technology outside of its original scope. Specifically, we apply compiler analysis to the field of software maintenance and evolution by examining the use of global data throughout the lifetimes of many open source projects. Also, we investigate the effects of compiler optimizations on the power consumption of small battery powered devices. Finally, in an area closer to traditional compiler research we examine automatic program parallelization in the form of thread-level speculation

    A New Approach to Learning in Neuromorphic Hardware

    Get PDF
    This thesis presents a novel, highly flexible approach to plasticity and learning in brain-inspired computing systems. A classical digital processor was combined with local analog processing to achieve flexibility and efficiency. In particular, this allows for the implementation of modulated spike-timing dependent plasticity. The approach was formalized into an abstract hybrid hardware model. This model was used to simulate a reward-based learning task to estimate the effect of hardware constraints. To investigate the feasibility of the proposed architecture, a synthesizeable plasticity processor was designed and tested using the CoreMark general purpose benchmark (best score: 1.89 per MHz). The processor was also produced as part of a 65 nm proto- type chip, requiring 0.14 mm2 of die-area, and reaching a maximum clock frequency of 769 MHz. In a preparatory step a non-programmable plasticity implementation was developed, that is now part of the operational BrainScaleS wafer-scale system. This design was later extended with the plasticity processor to implement the proposed hybrid architecture. Simulations show a speed improvement of 42 % over the non- programmable variant. By preparation for production, the area requirement for the digital part is estimated to be 6.2 % of total area

    Towards an embedded board-level tester: study of a configurable test processor

    Get PDF
    The demand for electronic systems with more features, higher performance, and less power consumption increases continuously. This is a real challenge for design and test engineers because they have to deal with electronic systems with ever-increasing complexity maintaining production and test costs low and meeting critical time to market deadlines. For a test engineer working at the board-level, this means that manufacturing defects must be detected as soon as possible and at a low cost. However, the use of classical test techniques for testing modern printed circuit boards is not sufficient, and in the worst case these techniques cannot be used at all. This is mainly due to modern packaging technologies, a high device density, and high operation frequencies of modern printed circuit boards. This leads to very long test times, low fault coverage, and high test costs. This dissertation addresses these issues and proposes an FPGA-based test approach for printed circuit boards. The concept is based on a configurable test processor that is temporarily implemented in the on-board FPGA and provides the corresponding mechanisms to communicate to external test equipment and co-processors implemented in the FPGA. This embedded test approach provides the flexibility to implement test functions either in the external test equipment or in the FPGA. In this manner, tests are executed at-speed increasing the fault coverage, test times are reduced, and the test system can be adapted automatically to the properties of the FPGA and devices located on the board. An essential part of the FPGA-based test approach deals with the development of a test processor. In this dissertation the required properties of the processor are discussed, and it is shown that the adaptation to the specific test scenario plays a very important role for the optimization. For this purpose, the test processor is equipped with configuration parameters at the instruction set architecture and microarchitecture level. Additionally, an automatic generation process for the test system and for the computation of some of the processor’s configuration parameters is proposed. The automatic generation process uses as input a model known as the device under test model (DUT-M). In order to evaluate the entire FPGA-based test approach and the viability of a processor for testing printed circuit boards, the developed test system is used to test interconnections to two different devices: a static random memory (SRAM) and a liquid crystal display (LCD). Experiments were conducted in order to determine the resource utilization of the processor and FPGA-based test system and to measure test time when different test functions are implemented in the external test equipment or the FPGA. It has been shown that the introduced approach is suitable to test printed circuit boards and that the test processor represents a realistic alternative for testing at board-level.Der Bedarf an elektronischen Systemen mit zusätzlichen Merkmalen, höherer Leistung und geringerem Energieverbrauch nimmt ständig zu. Dies stellt eine erhebliche Herausforderung für Entwicklungs- und Testingenieure dar, weil sie sich mit elektronischen Systemen mit einer steigenden Komplexität zu befassen haben. Außerdem müssen die Herstellungs- und Testkosten gering bleiben und die Produkteinführungsfristen so kurz wie möglich gehalten werden. Daraus folgt, dass ein Testingenieur, der auf Leiterplatten-Ebene arbeitet, die Herstellungsfehler so früh wie möglich entdecken und dabei möglichst niedrige Kosten verursachen soll. Allerdings sind die klassischen Testmethoden nicht in der Lage, die Anforderungen von modernen Leiterplatten zu erfüllen und im schlimmsten Fall können diese Testmethoden überhaupt nicht verwendet werden. Dies liegt vor allem an modernen Gehäuse-Technologien, der hohen Bauteildichte und den hohen Arbeitsfrequenzen von modernen Leiterplatten. Das führt zu sehr langen Testzeiten, geringer Testabdeckung und hohen Testkosten. Die Dissertation greift diese Problematik auf und liefert einen FPGA-basierten Testansatz für Leiterplatten. Das Konzept beruht auf einem konfigurierbaren Testprozessor, welcher im On-Board-FPGA temporär implementiert wird und die entsprechenden Mechanismen für die Kommunikation mit der externen Testeinrichtung und Co-Prozessoren im FPGA bereitstellt. Dadurch ist es möglich Testfunktionen flexibel entweder auf der externen Testeinrichtung oder auf dem FPGA zu implementieren. Auf diese Weise werden Tests at-speed ausgeführt, um die Testabdeckung zu erhöhen. Außerdem wird die Testzeit verkürzt und das Testsystem automatisch an die Eigenschaften des FPGAs und anderer Bauteile auf der Leiterplatte angepasst. Ein wesentlicher Teil des FPGA-basierten Testansatzes umfasst die Entwicklung eines Testprozessors. In dieser Dissertation wird über die benötigten Eigenschaften des Prozessors diskutiert und es wird gezeigt, dass die Anpassung des Prozessors an den spezifischen Testfall von großer Bedeutung für die Optimierung ist. Zu diesem Zweck wird der Prozessor mit Konfigurationsparametern auf der Befehlssatzarchitektur-Ebene und Mikroarchitektur-Ebene ausgerüstet. Außerdem wird ein automatischer Generierungsprozess für die Realisierung des Testsystems und für die Berechnung einer Untergruppe von Konfigurationsparametern des Prozessors vorgestellt. Der automatische Generierungsprozess benutzt als Eingangsinformation ein Modell des Prüflings (device under test model, DUT-M). Das entwickelte Testsystem wurde zum Testen von Leiterplatten für Verbindungen zwischen dem FPGA und zwei Bauteilen verwendet, um den FPGA-basierten Testansatz und die Durchführbarkeit des Testprozessors für das Testen auf Leiterplatte-Ebene zu evaluieren. Die zwei Bauteile sind ein Speicher mit direktem Zugriff (static random-access memory, SRAM) und eine Flüssigkristallanzeige (liquid crystal display, LCD). Die Experimente wurden durchgeführt, um den Ressourcenverbrauch des Prozessors und Testsystems festzustellen und um die Testzeit zu messen. Dies geschah durch die Implementierung von unterschiedlichen Testfunktionen auf der externen Testeinrichtung und dem FPGA. Dadurch konnte gezeigt werden, dass der FPGA-basierte Ansatz für das Testen von Leiterplatten geeignet ist und dass der Testprozessor eine realistische Alternative für das Testen auf Leiterplatten-Ebene ist
    • …
    corecore