Search CORE

26 research outputs found

Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

Author: Paim Guilherme Pereira
Publication venue
Publication date: 01/01/2021
Field of study

Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

Lume 5.8

Optimizing Dataflow Programs for Hardware Synthesis

Author: Ab Rahman Ab Al Hadi Bin
Publication venue: Lausanne, EPFL
Publication date: 14/01/2014
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Applications of MATLAB in Science and Engineering

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The book consists of 24 chapters illustrating a wide range of areas where MATLAB tools are applied. These areas include mathematics, physics, chemistry and chemical engineering, mechanical engineering, biological (molecular biology) and medical sciences, communication and control systems, digital signal, image and video processing, system modeling and simulation. Many interesting problems have been included throughout the book, and its contents will be beneficial for students and professionals in wide areas of interest

Directory of Open Access Books (DOAB)

Low power architectures for streaming applications

Author: He Y.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2013
Field of study

Repository TU/e

Pure OAI Repository

PIRANHA: an engine for a methodology of detecting covert communication via image-based steganography

Author: Pilson Christopher Shaun
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

In current cutting-edge steganalysis research, model-building and machine learning has been utilized to detect steganography. However, these models are computationally and cognitively cumbersome, and are specifically and exactly targeted to attack one and only one type of steganography. The model built and utilized in this thesis has shown capability in detecting a class or family of steganography, while also demonstrating that it is viable to construct a minimalist model for steganalysis. The notion of detecting steganographic primitives or families is one that has not been discussed in literature, and would serve well as a first-pass steganographic detection methodology. The model built here serves this end well, and it must be kept in mind that the model presented is posited to work as a front-end broad-pass filter for some of the more computationally advanced and directed stganalytic algorithms currently in use. This thesis attempts to convey a view of steganography and steganalysis in a manner more utilitarian and immediately useful to everyday scenarios. This is vastly different from a good many publications that treat the topic as one relegated only to cloak-and-dagger information passing. The subsequent view of steganography as primarily a communications tool useable by petty information brokers and the like directs the text and helps ensure that the notion of steganography as a digital dead-drop box is abandoned in favor of a more grounded approach. As such, the model presented underperforms specialized models that have been presented in current literature, but also makes use of a large image sample space (747 images) as well as images that are contextually diverse and representative of those seen in wide use. In future applications by either law-enforcement or corporate officials, it is hoped that the model presented in this thesis can aid in rapid and targeted responses without causing undue strain upon an eventual human operator. As such, a design constraint that was utilized for this research favored a False Negative as opposed to a False Positive - this methodology helps to ensure that, in the event of an alert, it is worthwhile to apply a more directed attack against the flagged image

Digital Repository @ Iowa State University (ISU)

Execution platform modeling for system-level architecture performance analysis

Author: Živković V.D.
Publication venue
Publication date: 23/09/2008
Field of study

Today's embedded systems are designed for more complex and more computationally-intensive applications than they were a decade ago. Most if not every embedded system designed today is a sort of parallel computing system - only called differently - a platform. A platform is essentially a heterogeneous system consisting of communicating processing units of different types and mostly distributed memory units. A platform can be anything: from multiprocessors comprising task-dedicated processors and a dedicated communication network, to a (semi-)programmable multiprocessor that can run parallel processes by means of both interleaving and overlapping. However, the specification, exploration and design of application multiprocessor system platforms from user requirements is still a painstaking process that takes too long and is too costly. Our answer to the above mentioned issues is the Archer approach. It embodies: Application representations (Symbolic Programs - SP), a platform-based library of the architecture components and their configurations (all-in-hardware, all-in-software, hybrid multiprocessor, with dedicated network, hared-bus, highway, burst-bus, or hybrid network), and a mapping methodology (managing the aforementioned representations while transforming application SPs to Archer architecture components), that we have been developing and experimenting with.UBL - phd migration 201

Leiden University Scholary Publications

Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

Author: Porrmann Mario
Romoth Johannes
Rückert Ulrich
Publication venue
Publication date: 01/01/2017
Field of study

Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs

Publications at Bielefeld University

Ordonnancement hybride des applications flots de données sur des systèmes embarqués multi-coeurs

Author: Dkhil Amira
Publication venue
Publication date: 14/04/2015
Field of study

Les systèmes embarqués sont de plus en plus présents dans l'industrie comme dans la vie quotidienne. Une grande partie de ces systèmes comprend des applications effectuant du traitement intensif des données: elles utilisent de nombreux filtres numériques, où les opérations sur les données sont répétitives et ont un contrôle limité. Les graphes "flots de données", grâce à leur déterminisme fonctionnel inhérent, sont très répandus pour modéliser les systèmes embarqués connus sous le nom de "data-driven". L'ordonnancement statique et périodique des graphes flot de données a été largement étudié, surtout pour deux modèles particuliers: SDF et CSDF. Dans cette thèse, on s'intéresse plus particulièrement à l'ordonnancement périodique des graphes CSDF. Le problème consiste à identifier des séquences périodiques infinies d'actionnement des acteurs qui aboutissent à des exécutions complètes à buffers bornés. L'objectif est de pouvoir aborder ce problème sous des angles différents : maximisation de débit, minimisation de la latence et minimisation de la capacité des buffers. La plupart des travaux existants proposent des solutions pour l'optimisation du débit et négligent le problème d'optimisation de la latence et propose même dans certains cas des ordonnancements qui ont un impact négatif sur elle afin de conserver les propriétés de périodicité. On propose dans cette thèse un ordonnancement hybride, nommé Self-Timed Périodique (STP), qui peut conserver les propriétés d'un ordonnancement périodique et à la fois améliorer considérablement sa performance en terme de latence.One of the most important aspects of parallel computing is its close relation to the underlying hardware and programming models. In this PhD thesis, we take dataflow as the basic model of computation, as it fits the streaming application domain. Cyclo-Static Dataflow (CSDF) is particularly interesting because this variant is one of the most expressive dataflow models while still being analyzable at design time. Describing the system at higher levels of abstraction is not sufficient, e.g. dataflow have no direct means to optimize communication channels generally based on shared buffers. Therefore, we need to link the dataflow MoCs used for performance analysis of the programs, the real time task models used for timing analysis and the low-level model used to derive communication times. This thesis proposes a design flow that meets these challenges, while enabling features such as temporal isolation and taking into account other challenges such as predictability and ease of validation. To this end, we propose a new scheduling policy noted Self-Timed Periodic (STP), which is an execution model combining Self-Timed Scheduling (STS) with periodic scheduling. In STP scheduling, actors are no longer strictly periodic but self-timed assigned to periodic levels: the period of each actor under periodic scheduling is replaced by its worst-case execution time. Then, STP retains some of the performance and flexibility of self-timed schedule, in which execution times of actors need only be estimates, and at the same time makes use of the fact that with a periodic schedule we can derive a tight estimation of the required performance metrics

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

Author: Derin Onur
Sami Mariagiovanna
Publication venue
Publication date: 19/10/2015
Field of study

Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

RERO DOC Digital Library

Effective network grid synthesis and optimization for high performance very large scale integration system design

Author: Yang Yun
Publication venue
Publication date: 01/02/2008
Field of study

制度:新 ; 文部省報告番号:甲2642号 ; 学位の種類:博士(工学) ; 授与年月日:2008/3/15 ; 早大学位記番号:新480

Waseda University Repository