Search CORE

55 research outputs found

High-level synthesis for reduction of WCET in real-time systems

Author: Kristensen Andreas Toftegaard
Pezzarossa Luca
Sparsø Jens
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Power and Energy Aware Heterogeneous Computing Platform

Author: Nouri Sajjad
Publication venue: Tampere University of Technology
Publication date: 01/01/2018
Field of study

During the last decade, wireless technologies have experienced signiﬁcant development, most notably in the form of mobile cellular radio evolution from GSM to UMTS/HSPA and thereon to Long-Term Evolution (LTE) for increasing the capacity and speed of wireless data networks. Considering the real-time constraints of the new wireless standards and their demands for parallel processing, reconﬁgurable architectures and in particular, multicore platforms are part of the most successful platforms due to providing high computational parallelism and throughput. In addition to that, by moving toward Internet-of-Things (IoT), the number of wireless sensors and IP-based high throughput network routers is growing at a rapid pace. Despite all the progression in IoT, due to power and energy consumption, a single chip platform for providing multiple communication standards and a large processing bandwidth is still missing.The strong demand for performing different sets of operations by the embedded systems and increasing the computational performance has led to the use of heterogeneous multicore architectures with the help of accelerators for computationally-intensive data-parallel tasks acting as coprocessors. Currently, highly heterogeneous systems are the most power-area efﬁcient solution for performing complex signal processing systems. Additionally, the importance of IoT has increased signiﬁcantly the need for heterogeneous and reconﬁgurable platforms.On the other hand, subsequent to the breakdown of the Dennardian scaling and due to the enormous heat dissipation, the performance of a single chip was obstructed by the utilization wall since all cores cannot be clocked at their maximum operating frequency. Therefore, a thermal melt-down might be happened as a result of high instantaneous power dissipation. In this context, a large fraction of the chip, which is switched-off (Dark) or operated at a very low frequency (Dim) is called Dark Silicon. The Dark Silicon issue is a constraint for the performance of computers, especially when the up-coming IoT scenario will demand a very high performance level with high energy efﬁciency. Among the suggested solution to combat the problem of Dark-Silicon, the use of application-speciﬁc accelerators and in particular Coarse-Grained Reconﬁgurable Arrays (CGRAs) are the main motivation of this thesis work.This thesis deals with design and implementation of Software Deﬁned Radio (SDR) as well as High Efﬁciency Video Coding (HEVC) application-speciﬁc accelerators for computationally intensive kernels and data-parallel tasks. One of the most important data transmission schemes in SDR due to its ability of providing high data rates is Orthogonal Frequency Division Multiplexing (OFDM). This research work focuses on the evaluation of Heterogeneous Accelerator-Rich Platform (HARP) by implementing OFDM receiver blocks as designs for proof-of-concept. The HARP template allows the designer to instantiate a heterogeneous reconﬁgurable platform with a very large amount of custom-tailored computational resources while delivering a high performance in terms of many high-level metrics. The availability of this platform lays an excellent foundation to investigate techniques and methods to replace the Dark or Dim part of chip with high-performance silicon dissipating very low power and energy. Furthermore, this research work is also addressing the power and energy issues of the embedded computing systems by tailoring the HARP for self-aware and energy-aware computing models. In this context, the instantaneous power dissipation and therefore the heat dissipation of HARP are mitigated on FPGA/ASIC by using Dynamic Voltage and Frequency Scaling (DVFS) to minimize the dark/dim part of the chip. Upgraded HARP for self-aware and energy-aware computing can be utilized as an energy-efﬁcient general-purpose transceiver platform that is cognitive to many radio standards and can provide high throughput while consuming as little energy as possible. The evaluation of HARP has shown promising results, which makes it a suitable platform for avoiding Dark Silicon in embedded computing platforms and also for diverse needs of IoT communications.In this thesis, the author designed the blocks of OFDM receiver by crafting templatebased CGRA devices and then attached them to HARP’s Network-on-Chip (NoC) nodes. The performance of application-speciﬁc accelerators generated from templatebased CGRAs, the performance of the entire platform subsequent to integrating the CGRA nodes on HARP and the NoC trafﬁc are recorded in terms of several highlevel performance metrics. In evaluating HARP on FPGA prototype, it delivers a performance of 0.012 GOPS/mW. Because of the scalability and regularity in HARP, the author considered its value as architectural constant. In addition to showing the gain and the beneﬁts of maximizing the number of reconﬁgurable processing resources on a platform in comparison to the scaled performance of several state-of-the-art platforms, HARP’s architectural constant ensures application-independent ﬁgure of merit. HARP is further evaluated by implementing various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for HEVC standard, which showed its ability to sustain Full HD 1080p format at 30 fps on FPGA. The author also integrated self-aware computing model in HARP to mitigate the power dissipation of an OFDM receiver. In the case of FPGA implementation, the total power dissipation of the platform showed 16.8% reduction due to employing the Feedback Control System (FCS) technique with Dynamic Frequency Scaling (DFS). Furthermore, by moving to ASIC technology and scaling both frequency and voltage simultaneously, signiﬁcant dynamic power reduction (up to 82.98%) was achieved, which proved the DFS/DVFS techniques as one step forward to mitigate the Dark Silicon issue

Trepo - Institutional Repository of Tampere University

Архитектура процессора вычисления дискретного косинусного преобразования для систем сжатия изображения по схеме losless-to-lossy

Author: V. V. Kliuchenia
В. В. Ключеня
Publication venue: 'Belarusian State University of Informatics and Radioelectronics'
Publication date: 26/08/2021
Field of study

The hardware implementations of fixed-point DCT blocks, known as IntDCT [1] and BinDCT [2], require some solutions. One of the main issues is the choice between the implementation of the conversion on FPGA, or the implementation on a digital signal processor (Digital Signal Processor, DSP). Each of the implementations has its own pros and cons. One of the most important advantages of the DSP implementation is the presence of special instructions used in DSP, in particular, the ability to multiply two numbers in one clock cycle. Therefore, with the advent of DSP, the limitation on the number of multiplications in algorithms was removed. On the other hand, when implementing a block on an FPGA, we can limit not ourselves to the bitness of the data (within reasonable limits), we have the ability to parallelize all incoming data and implement specialized computing cores for various tasks. In fact, designing multimedia systems on FPGAs reminds the design of similar systems based on the logic of a small and medium degree of integration. Such an implementation has the same limitations: a relatively small amount of available memory, the need to design basic structural elements (multipliers, divisors), etc. It is the inequality of the addition and multiplication operations when they are implemented on FPGAs that caused the search for DCT algorithms with the smallest number of factors. However, even this is not enough, since the structure of the multiplier is many times more complex than the structure of the adder, which made it necessary to look for ways to transform without using multiplications at all. This article shows how, on the basis of integer direct and inverse DCT and distributed arithmetic, to create a new universal architecture of decorrelated transform on FPGAs without multiplication operations for image transformation coding systems that operate on the principle of lossless-to-lossy (L2L), and to obtain the best experimental results in terms of hardware resources compared to comparable compression systems.Аппаратные реализации блоков дискретного косинусного преобразования (ДКП) на арифметике с фиксированной запятой, известные как IntDCT [1] и BinDCT [2], требуют решения некоторых вопросов. Один из главных вопросов – выбор между реализацией преобразования на ПЛИС или реализацией на цифровом сигнальном процессоре (Digital Signal Processor, DSP). Каждая из реализаций имеет как свои плюсы, так и минусы. Одним из самых главных достоинств реализации на DSP является наличие специальных инструкций, используемых в DSP, в частности, возможность перемножения двух чисел за один такт. Поэтому с появлением DSP было снято ограничение на количество умножений в алгоритмах. С другой стороны, при реализации блока на ПЛИС можно не ограничивать себя разрядностью данных (в разумных пределах), имеется возможность параллельной обработки всех поступающих данных и реализации специализированных вычислительных ядер для различных задач. По сути, проектирование систем мультимедиа на ПЛИС напоминает проектирование схожих систем на логике малой и средней степени интеграции. Такая реализация имеет те же ограничения: относительно малое количество доступной памяти, необходимость проектировать базовые элементы конструкции (умножители, делители) и т. д. Именно неравнозначность операций сложения и умножения при реализации их на ПЛИС и обусловила поиски алгоритмов ДКП с наименьшим числом множителей. Однако даже этого недостаточно, поскольку структура умножителя во много раз сложнее структуры сумматора, что заставило искать способы преобразования без использования умножений вообще. В статье показано, как на основе целочисленного прямого и обратного ДКП и распределенной арифметики создать новую универсальную архитектуру декоррелирующего преобразования на ПЛИС типа FPGA без операций умножения для систем трансформационного кодирования изображений, которые работают по принципу lossless-to-lossy (L2L), и получить лучшие экспериментальные результаты по аппаратным ресурсам по сравнению с аналогичными системами сжатия

Доклады БГУИР

Design and Implementation of IDCT/IDST-Specific Accelerators for HEVC Standard on Heterogeneous Accelerator-Rich Platform

Author: Pourabed Mohammad Ali
Publication venue
Publication date: 08/05/2019
Field of study

Having High Efficiency Video Coding (HEVC) is important for image processing, reducing bandwidth, and increasing video quality. There are different methods that can be used to implement HEVC. This thesis focuses on design and implementation of application-specific accelerators for IDCT/IDST algorithms dedicated for HEVC standard. Those algorithms are parallel-in-nature tasks which makes them suitable to be executed by heterogeneous multicore platforms. This is done using accelerators which are required for power efficient processing. In this study, Coarse-Grained Reconfigurable Arrays (CGRAs) are used for making a template for an accelerator. CGRA has one of the major roles in a Heterogeneous Accelerator-Rich Platforms (HARP) as it is capable of accelerating non-parallel loops with lower loop counts. This thesis includes various algorithms for the use of IDCT and IDST with different designs and templates, reaching a unique final architecture. The final output intended is to reach 4 points IDST together with a 4/8 points IDCT. Another feature added to the hypothesis is the use of different dimensions for the CGRA template in order to have a different type of accelerator. The many CGRAs are combined together in successive arrangement with Reduced Instructions Set Computers (RISC) over the Network-on-Chip (NoC). The aim is to study the performance of the accelerator used for the IDCT and the IDST. This can be evaluated as the data movement through NoC network along with comparison of performance of accelerator with clock cycles in order to calculate the efficiency of the system. The results show that a four point IDST and IDCT can be computed in 56 clock cycles. In addition, the 8 point IDCT can be implemented in 64 cycles. One important factor to consider during the study is the power and energy consumption which is important in this century. The dynamic power dissipation usage for the routing of data has reached a value of 4.03 mW. Whereas, the energy consumption was 1.76

\mu

J for the 4 points system (IDCT and IDST) and 3.06

\mu

J for the 8 points (IDCT). Processing Elements (PEs) are used for implementing the transform algorithm and units were operated at 200 MHz. Finally, these results show that 1080P image at 30 frames per second can be attained by using FPGA

Trepo - Institutional Repository of Tampere University

Recommended from our members

Learning-based system-level power modeling of hardware IPs

Author: Lee Dongwook
Publication venue
Publication date: 18/12/2017
Field of study

Accurate power models for hardware components at high levels of abstraction are a critical component to enable system-level power analysis and optimization. Virtual platform prototypes are widely utilized to support early system-level design space exploration. There is, however, a lack of accurate and fast power models of hardware components at such high-levels of abstraction. In this dissertation, we present novel learning‑based approaches for extending fast functional simulation models of white-, gray-, and black-box custom hardware intellectual property components (IPs) with accurate power estimates. Depending on the observability, we extend high-level functional models with the capability to capture data-dependent resource, block, or I/O activity without a significant loss in simulation speed. We further leverage state-of-the-art machine learning techniques to synthesize abstract power models that can predict cycle-, block-, and invocation-level power from low-level hardware implementations, where we introduce novel structural decomposition techniques to reduce model complexities and increase estimation accuracy. Our white-box approach integrates with existing high-level synthesis (HLS) tools to automatically extract resource mapping information, which is used to trace data-dependent resource-level activity and drive a cycle-accurate online power-performance model during functional simulation. Our gray-box approach supports power estimation at coarser basic block granularity. It uses only limited information about block inputs and outputs to extract light-weight block-level activity from a functional simulation and drive a basic block-level power model that utilizes a control flow decomposition to improve accuracy and speed. It is faster than cycle-level models, while providing a finer granularity than invocation-level models, which allows to further navigate accuracy and speed trade-offs. We finally propose a novel approach for extending behavioral models of black-box hardware IPs with an invocation-level power estimate. Our black-box model only uses input and output history to track data-dependent pipeline behavior, where we introduce a specialized ensemble learning that is composed out of individually selected cycle-by-cycle models with reduced complexity and increased accuracy. The proposed approaches are fully automated by integrating with existing, commercial HLS tools for custom hardware synthesized by HLS. Results of applying our approaches to various industrial‑strength design examples show that our power models can predict cycle‑, basic block-, and invocation-level power consumption to within 10%, 9%, and 3% of a commercial gate-level power estimation tool, respectively, all while running at several order of magnitude faster speeds of 1-10Mcycles/sec.Electrical and Computer Engineerin

Texas ScholarWorks

An FPGA Based Hardware Accelerator for Remote Surveillance Cameras

Author: Kane Alexander John Petre
Publication venue: 'Victoria University of Wellington Library'
Publication date: 01/01/2013
Field of study

The Blackeye II camera, produced by Kinopta, is used for remote security, conservation and traffic flow surveillance. The camera uses an image sensor to acquire photographs which undergo image processing and JPEG encoding on a microprocessor. Although the microprocessor performs other tasks, it is the processing and encoding of images that limit the frame rate of the camera to 2 frames per second (fps). Clients have requested an increase to 12.5 fps while adding more image processing to each photograph. The current microprocessor-based system is unable to achieve this. Custom digital logic systems perform well on processes that naturally form a pipeline, such as the Blackeye II image processing system. This project develops a digital logic system based on an FPGA to receive images from the image sensor, perform the required image processing operations, encode the images in JPEG format and send them on to the microprocessor. The objective is to implement a proof of concept device based upon the Blackeye II’s existing hardware and an FPGA development board. It will implement the proposed pipeline including one example of an image processing operation. A JPEG encoder is designed to process the 752 × 480 greyscale photographs from the image processor in real time. The JPEG encoder consists of four stages: discrete cosine transform (DCT), quantisation, zig-zag buffer and Huffman encoder. The DCT design is based upon the work of Woods et al. [1], which is improved on. An analysis of the relationship between precision and accuracy in the DCT and quantisation stages is used to minimise the system’s resource requirements. The JPEG encoder is successfully tested in simulation. Input and output stages are added to the design. The input stage receives data from the image sensor and removes breaks in the data stream. The output stage must concatenate the data from the JPEG encoder and transmit it to the microprocessor via the microprocessor’s ISI (image sensor interface) peripheral. An image sharpening filter is developed and inserted into the pipeline between the input and JPEG encoder. Because remote surveillance cameras are battery powered, the minimisation of power consumption is a key concern. To minimise power consumption a mechanism is introduced to track those modules in the pipeline that are in use at any time. Any not in use are paused by gating the module’s clock source. Once the system is complete and tested in simulation it is loaded into hardware. The FPGA development board is attached to the image sensor board and microprocessor board of the Blackeye II camera by a purpose-built breakout board. Plugging the microprocessor board into a PC provides a live stream of images proving the successful operation of the FPGA system. The project objectives were exceeded by increasing the frame rate of the Blackeye II to 20 fps, which will not decrease with additional image processing operations. The project was viewed as a success by Kinopta, who have committed to its further development

Victoria University of Wellington

ResearchArchive at Victoria University of Wellington

Dynamically Reconfigurable Architectures and Systems for Time-varying Image Constraints (DRASTIC) for Image and Video Compression

Author: jiang yuebing
Publication venue: UNM Digital Repository
Publication date: 12/07/2014
Field of study

In the current information booming era, image and video consumption is ubiquitous. The associated image and video coding operations require significant computing resources for both small-scale computing systems as well as over larger network systems. For different scenarios, power, bitrate and image quality can impose significant time-varying constraints. For example, mobile devices (e.g., phones, tablets, laptops, UAVs) come with significant constraints on energy and power. Similarly, computer networks provide time-varying bandwidth that can depend on signal strength (e.g., wireless networks) or network traffic conditions. Alternatively, the users can impose different constraints on image quality based on their interests. Traditional image and video coding systems have focused on rate-distortion optimization. More recently, distortion measures (e.g., PSNR) are being replaced by more sophisticated image quality metrics. However, these systems are based on fixed hardware configurations that provide limited options over power consumption. The use of dynamic partial reconfiguration with Field Programmable Gate Arrays (FPGAs) provides an opportunity to effectively control dynamic power consumption by jointly considering software-hardware configurations. This dissertation extends traditional rate-distortion optimization to rate-quality-power/energy optimization and demonstrates a wide variety of applications in both image and video compression. In each application, a family of Pareto-optimal configurations are developed that allow fine control in the rate-quality-power/energy optimization space. The term Dynamically Reconfiguration Architecture Systems for Time-varying Image Constraints (DRASTIC) is used to describe the derived systems. DRASTIC covers both software-only as well as software-hardware configurations to achieve fine optimization over a set of general modes that include: (i) maximum image quality, (ii) minimum dynamic power/energy, (iii) minimum bitrate, and (iv) typical mode over a set of opposing constraints to guarantee satisfactory performance. In joint software-hardware configurations, DRASTIC provides an effective approach for dynamic power optimization. For software configurations, DRASTIC provides an effective method for energy consumption optimization by controlling processing times. The dissertation provides several applications. First, stochastic methods are given for computing quantization tables that are optimal in the rate-quality space and demonstrated on standard JPEG compression. Second, a DRASTIC implementation of the DCT is used to demonstrate the effectiveness of the approach on motion JPEG. Third, a reconfigurable deblocking filter system is investigated for use in the current H.264/AVC systems. Fourth, the dissertation develops DRASTIC for all 35 intra-prediction modes as well as intra-encoding for the emerging High Efficiency Video Coding standard (HEVC)

Many-core approach to 2D-DCT calculation using an FPGA

Author: Mália Wilson Alexandre Borges
Publication venue: Instituto Superior de Engenharia de Lisboa
Publication date: 01/12/2014
Field of study

Trabalho Final de Mestrado para obtenção do grau de Mestre em Engenharia de Electrónica e TelecomunicaçõesHoje em dia a necessidade computacional cresce exponencialmente, requerendocom que os sistemas embebidos estejam em constante evolução de forma a apresentar novas soluções. Devido a limitações tecnológicas o uso de um core simples foi inevitavelmente ultrapassado pelas alternativas que optam por implementações multi-core. Apesar de plataformas como a Field-Programmable GateArray(FPGA) nos presentearem com grandes oportunidades, ainda se verifica aexistência de resoluções de algoritmos matemáticos ainda recorrerem a soluçãodedicadas com apenas um core. Neste documento vai-se introduzir um sistema embebido com arquitecturamany-core para cálculo da Transformada discreta de cosseno bi-dimensional(2DDCT), como alternativa viável às implementações actuais. No decorrer deste trabalho foi necessário desenvolver uma Network-On-aChip(NoC), que vai criar a infraestrutura de comunicação responsável por ligaros vários módulos dedicados. Ao analise a 2D-DCT foi possível implementar ummodulo suficientemente flexível que permita alcançar o paralelismo deste algoritmo. Cada core dedicado é capaz de calcular coeficientes individuais da DCT,fazendo com que a arquitectura many-core possa ser escalável com o objectivo deobter diferentes configurações, variando na performance e consumo de recursos.Abstract: Nowadays the need for more computing capacity has increased exponentially, requiring embedded systems to evolve and find new solutions. Due to technologylimitation the single-core unavoidably was replaced by multi-core alternatives. Beside platforms like the Field-Programmable Gate Array(FPGA) provide great opportunities, it is often seen mathematical algorithms done by dedicated single-coresolutions. This thesis introduces an embedded many-core architecture responsiblefor a 2D Discrete Cosine Transform(2D-DCT) calculation, with the goal of givinga viable alternative to the current implementations. During this work it was necessary to develop a Network-on-a-chip, that creates the communication infrastructure responsible for connecting the dedicatedcores. By analysing the 2D-DCT it was possible to implement a module that isflexible enough to enable algorithm parallelism. Each dedicated core is capable ofcalculating individual DCT coefficients, meaning that many-core architecture canbe scaled in order to obtain different configurations, that vary in performance orresources consumption

Repositório Científico do Instituto Politécnico de Lisboa