11 research outputs found

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706

    Complexity Analysis Of Next-Generation VVC Encoding and Decoding

    Full text link
    While the next generation video compression standard, Versatile Video Coding (VVC), provides a superior compression efficiency, its computational complexity dramatically increases. This paper thoroughly analyzes this complexity for both encoder and decoder of VVC Test Model 6, by quantifying the complexity break-down for each coding tool and measuring the complexity and memory requirements for VVC encoding/decoding. These extensive analyses are performed for six video sequences of 720p, 1080p, and 2160p, under Low-Delay (LD), Random-Access (RA), and All-Intra (AI) conditions (a total of 320 encoding/decoding). Results indicate that the VVC encoder and decoder are 5x and 1.5x more complex compared to HEVC in LD, and 31x and 1.8x in AI, respectively. Detailed analysis of coding tools reveals that in LD on average, motion estimation tools with 53%, transformation and quantization with 22%, and entropy coding with 7% dominate the encoding complexity. In decoding, loop filters with 30%, motion compensation with 20%, and entropy decoding with 16%, are the most complex modules. Moreover, the required memory bandwidth for VVC encoding/decoding are measured through memory profiling, which are 30x and 3x of HEVC. The reported results and insights are a guide for future research and implementations of energy-efficient VVC encoder/decoder.Comment: IEEE ICIP 202

    Towards Computational Efficiency of Next Generation Multimedia Systems

    Get PDF
    To address throughput demands of complex applications (like Multimedia), a next-generation system designer needs to co-design and co-optimize the hardware and software layers. Hardware/software knobs must be tuned in synergy to increase the throughput efficiency. This thesis provides such algorithmic and architectural solutions, while considering the new technology challenges (power-cap and memory aging). The goal is to maximize the throughput efficiency, under timing- and hardware-constraints

    Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices

    Full text link
    A recent trend in DNN development is to extend the reach of deep learning applications to platforms that are more resource and energy constrained, e.g., mobile devices. These endeavors aim to reduce the DNN model size and improve the hardware processing efficiency, and have resulted in DNNs that are much more compact in their structures and/or have high data sparsity. These compact or sparse models are different from the traditional large ones in that there is much more variation in their layer shapes and sizes, and often require specialized hardware to exploit sparsity for performance improvement. Thus, many DNN accelerators designed for large DNNs do not perform well on these models. In this work, we present Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations, and therefore is able to improve both processing speed and energy efficiency with sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than the original Eyeriss running MobileNet. We also present an analysis methodology called Eyexam that provides a systematic way of understanding the performance limits for DNN processors as a function of specific characteristics of the DNN model and accelerator design; it applies these characteristics as sequential steps to increasingly tighten the bound on the performance limits.Comment: accepted for publication in IEEE Journal on Emerging and Selected Topics in Circuits and Systems. This extended version on arXiv also includes Eyexam in the appendi

    Gem5-X: A Gem5-Based System Level Simulation Framework to Optimize Many-Core Platforms

    Get PDF
    The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively

    Energy-Efficient Digital Signal Processing Hardware Design.

    Full text link
    As CMOS technology has developed considerably in the last few decades, many SoCs have been implemented across different application areas due to reduced area and power consumption. Digital signal processing (DSP) algorithms are frequently employed in these systems to achieve more accurate operation or faster computation. However, CMOS technology scaling started to slow down recently and relatively large systems consume too much power to rely only on the scaling effect while system power budget such as battery capacity improves slowly. In addition, there exist increasing needs for miniaturized computing systems including sensor nodes that can accomplish similar operations with significantly smaller power budget. Voltage scaling is one of the most promising power saving techniques due to quadratic switching power reduction effect, making it necessary feature for even high-end processors. However, in order to achieve maximum possible energy efficiency, systems should operate in near or sub-threshold regimes where leakage takes significant portion of power. In this dissertation, a few key energy-aware design approaches are described. Considering prominent leakage and larger PVT variability in low operating voltages, multi-level energy saving techniques to be described are applied to key building blocks in DSP applications: architecture study, algorithm-architecture co-optimization, and robust yet low-power memory design. Finally, described approaches are applied to design examples including a visual navigation accelerator, ultra-low power biomedical SoC and face detection/recognition processor, resulting in 2~100 times power savings than state-of-the-art.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110496/1/djeon_1.pd

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

    Projeto e gerenciamento de arquitetura de memória energeticamente eficiente para codificadores de vídeo HEVC

    No full text
    This Thesis presents the design of an energy-efficient hybrid scratchpad video memory architecture (called Hy-SVM) for parallel High-Efficiency Video Coding. Video coding stands out as a high complex part in the video processing applications. HEVC standard brought innovations that increase the memory requirements, mainly due to: (a) the novel coding structures, which aggravates the computational complexity by providing a wider range of possibilities to be analyzed; and (b) the high-level parallelism features provided by the Tiles partitioning, which provides performance acceleration, but, at the same time, strongly adds hard challenges to the memory infrastructure. The main bottleneck in terms of external memory transmission and on-chip storage is the reference frames data: which consists of already coded (and reconstructed) entire frames that must be stored and intensively accessed during the encoding process of future frames. Due to the large volume of data required to represent the reference frames, they are typically stored in the external memory (especially when highdefinition videos are targeted). The proposed Hy-SVM architecture is inserted in a video coding system, which is based on multiple Tiles partitioning to enable parallel HEVC encoding: each Tile is assigned to a specific processing unit. The key ideas of Hy-SVM include: applicationspecific design and management; combined multiple levels of private and shared memories that jointly exploit intra-Tile and inter-Tiles data reuse; scratchpad memories (SPMs) as energyefficient on-chip data storage; combined SRAM and STT-RAM hybrid memory (HyM) design We propose a design methodology for Hy-SVM that leverages application-specific properties to properly define the HyMs parameters. In order to provide run-time adaptation (for both offand on-chip parts), Hy-SVM integrates a memory management layer composed of: (1) overlap prediction, which has the goal of identifying the redundant memory access behavior by analyzing monitored past frames encoding to increase inter-Tiles data reuse exploitation; (2) memory pressure management, which aims on balancing the Tiles-accumulated memory pressure targeting on improving external memory communication channel usage; and (3) lifetime-aware data management scheme that alleviates STT-RAM SPMs of high bit-toggling write accesses to increase the their cells lifetime, as well as to reduce overhead issues related to poor write characteristics of STT-RAM. Application-specific knowledge was exploited by inheriting HEVC properties and performing run-time monitoring of memory accesses. Such information is used to properly design the on-chip video memories, as well as being utilized as input parameters of the run-time memory management layer. Based on the run-time decisions from the proposed Hy-SVM management strategies, Hy-SVM integrates distributed memory access management units (MAMUs) to control the access dynamics of private and shared SPMs. Additionally, adaptive power management units (APMUs) are able to strongly reduce on-chip energy consumption due to an accurate overlap prediction The experimental results demonstrate Hy-SVM overall energy savings over related works under various HEVC encoding scenarios. Compared to traditional data reuse schemes, like Level-C, the combined intra-Tile and inter-Tiles data reuse provides 69%-79% of energy reduction. Regarding related HEVC video memory architectures, the savings varied from 2.8% (worst case) to 67% (best case). From the external memory perspective, Hy-SVM can improve data reuse (by also exploiting inter-Tiles data redundancy), resulting on 11%-71%% of reduced off-chip energy consumption. Additionally, our APMUs contribute by reducing on-chip energy consumption of Hy-SVM by 56%-95%, for the evaluated HEVC scenarios. Thus, compared to related works, Hy-SVM presents the lowest on-chip energy consumption. The memory pressure management scheme can reduce the variations in the memory bandwidth by 37%-83% when compared to the traditional raster scan processing for 4- and 16-core parallelized HEVC encoder. The lifetime-aware data management significantly extends the STT-RAM lifetime, achieving 0.83 of normalized lifetime (near to the optimal case). Moreover, the overhead of implementing our management units insignificantly affects the performance and energyefficiency of Hy-SVM.Esta tese de doutorado apresenta o projeto de uma arquitetura de memória híbrida energeticamente eficiente baseada em memórias do tipo scratchpad (Hy-SVM) para a codificação paralela de vídeos segundo o padrão HEVC. A codificação de vídeo se destaca como uma parte extremamente complexa nas aplicações de processamento de vídeo. O padrão HEVC traz inovações que complicam fortemente os requerimentos de memória de tais aplicações, principalmente devido a: (a) novas estruturas de codificação, as quais agravam a complexidade computacional por proporcionarem muitas modos possíveis de codificação que devem ser analisados; além do (b) suporte de alto nível à paralelização da codificação por meio do particionamento das unidades de codificação em múltiplos Tiles, o qual provê a aceleração da performance dos codificadores, porém, ao mesmo tempo, adiciona grandes desafios à infraestrutura de memória. O principal gargalo em termos de comunicação com a memória externa e de armazenamento interno (dentro do chip do codificador) é dados pelas informações dos quadros de referência: que consiste em uma série de quadros completos já codificados (e reconstruídos) que devem ser mantidos em memória e acessados de forma intensa durante o processamento dos quadros futuros. Devido ao grande volume de dados que são necessários para representar os quadros de referência, estes são tipicamente armazenados na memória externa dos codificadores (principalmente quando vídeos de alta e ultra alta resolução são processados) A arquitetura proposta Hy-SVM está inserida em um sistema de codificação baseado no particionamento dos quadros do vídeo de entrada em múltiplos Tiles, de forma a habilitar a codificação paralela das informações segundo o padrão HEVC: neste cenário, cada Tile é assinalado para uma específica unidade de processamento do codificador HEVC, o qual executa o processamento dos diferentes Tiles em paralelo. A ideias chave da arquitetura Hy- SVM incluem: projeto e gerenciamento de memórias para a aplicação específica de codificação de vídeo; uso de múltiplos níveis de memórias privadas e compartilhadas, com o objetivo de explorar o reuso de dados intra-Tile e inter-Tiles de forma combinada; uso de memórias do tipo scratchpad (SPMs) para o armazenamento interno da informações de forma eficiente em termos de consumo de energia; projeto de memórias híbridas utilizando as tecnologias SRAM e STTRAM como base. Uma metodologia de projeto é proposta para a arquitetura Hy-SVM, a qual aproveita propriedades específicas da aplicação para, de forma adequada, definir os parâmetros de projeto das memórias híbridas. De forma a prover adaptação em tempo de execução (para ambas as memórias on-chip e off-chip), a arquitetura Hy-SVM integra uma camada de gerenciamento composta pelas seguintes estratégias (1) predição do overlap (sobreposição de acessos), o qual busca identificar o comportamento dos acessos redundantes entre diferentes unidades de processamento do codificador HEVC a partir da análise dos acessos à memória das codificações dos quadros passados do vídeo, com o objetivo de aumentar o potencial de exploração do reuso de dados inter-Tiles; (2) gerenciamento dos acessos à memória externa, responsável por balancear a vazão de dados com a memória acumulada entre as múltiplas unidades de processamento do codificador HEVC paralelo, com o objetivo de melhorar o uso do barramento de comunicação com a memória externa; e (3) gerenciamento de dados das SPMs implementadas a partir de células de memória STT-RAM, o qual alivia estas células de acessos de escrita com alta atividade de chaveamento dos bits armazenados, com o objetivo de aumentar o tempo de vide destas células, bem como reduzir as penalidades relativas à ineficiência dos acessos de escrita nas memórias STT-RAM. O conhecimento específico da aplicação foi utilizado nas estratégias de gerenciamento em tempo de execução das seguintes formas: explorando parâmetros da codificação HEVC e realizando monitorando em tempo real dos acessos à memória realizados pelo codificador Estas informações são utilizadas tanto pelas técnicas de gerenciamento, quanto pelas metodologias de projeto das memórias. Baseadas nas decisões tomadas pela camada de gerenciamento, a arquitetura Hy-SVM integra unidades de gerenciamento de acessos à memória (memory access management units – MAMUs) para controlar as dinâmicas de acesso das memórias SPM privadas e compartilhadas. Além disso, unidades adaptativas de gerenciamento de potência (adaptive power management units – APMUs) são capazes de reduzir o consumo de energia interno do chip do codificador a partir das estimativas precisas de formação dos overlaps. Os resultados obtidos por meio dos experimentos realizados demonstram economias de consumo energético da arquitetura Hy-SVM, quando comparada a trabalhos relacionados, sob diversos cenários de teste. Quando comparada a estratégias de reuso de dados tradicionais para codificadores de vídeo, como o esquema Level-C, a exploração do reuso de dados combinado nos níveis intra-Tile e inter-Tiles provê 69%-79% de redução de energia. Considerando as arquiteturas de memória de vídeo com foco no padrão HEVC, os ganhos variaram desde 2,8% (pior caso) até 67% (melhor caso) Da perspectiva do consumo de energia relacionado à comunicação com a memória externa, a arquitetura Hy-SVM é capaz de melhorar o reuso de dados (por explorar também o reuso de dados inter-Tiles), resultando em um consumo de energia on-chip 11%-17% menor. Além disso, as APMUs contribuem para reduzir o consumo de energia on-chip da arquitetura Hy-SVM em 56%-95%, para os cenários de teste analisados. Desta forma, comparada aos trabalhos relacionados, a arquitetura Hy-SVM apresenta o menor consumo energético on-chip. O gerenciamento da vazão da comunicação com a memória externa é capaz de reduzir as variações de largura de banda em 37%-83%, quando comparado à ordem tradicional de processamento, para cenários de teste com 4 e 16 Tiles sendo processados em paralelo pelo codificador HEVC. O gerenciamento de dados pôde, de forma significativa, estender o tempo de vida das células de memória STT-RAM, alcançando 0,83 de tempo de vida normalizado (métrica adotada para comparação, ficando muito próximo do caso ideal). Além disso, as sobrecargas causadas pela implementação das unidades de gerenciamento não afetam de foram significativa a performance e a eficiência energética da arquitetura Hy- SVM propostas por este trabalho
    corecore