1,136 research outputs found

    DyPS: Dynamic Processor Switching for Energy-Aware Video Decoding on Multi-core SoCs

    Full text link
    In addition to General Purpose Processors (GPP), Multicore SoCs equipping modern mobile devices contain specialized Digital Signal Processor designed with the aim to provide better performance and low energy consumption properties. However, the experimental measurements we have achieved revealed that system overhead, in case of DSP video decoding, causes drastic performances drop and energy efficiency as compared to the GPP decoding. This paper describes DyPS, a new approach for energy-aware processor switching (GPP or DSP) according to the video quality . We show the pertinence of our solution in the context of adaptive video decoding and describe an implementation on an embedded Linux operating system with the help of the GStreamer framework. A simple case study showed that DyPS achieves 30% energy saving while sustaining the decoding performanc

    A QHD-capable parallel H.264 decoder

    Get PDF
    Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits function-level as well as data-level parallelism. Function-level parallelism is used to pipeline the H.264 decoding stages. Data-level parallelism is exploited within the two most time consuming stages, the entropy decoding stage and the macroblock decoding stage. The parallelization strategy has been implemented and optimized on three platforms with very different memory architectures, namely an 8-core SMP, a 64-core cc-NUMA, and an 18-core Cell platform. Evaluations have been performed using 4kx2k QHD sequences. On the SMP platform a maximum speedup of 4.5x is achieved. The SMP-implementation is reasonably performance portable as it achieves a speedup of 26.6x on the cc-NUMA system. However, to obtain the highest performance (speedup of 33.4x and throughput of 200 QHD frames per second), several cc-NUMA specific optimizations are necessary such as optimizing the page placement and statically assigning threads to cores. Finally, on the Cell platform a near ideal speedup of 16.5x is achieved by completely hiding the communication latency.EC/FP7/248647/EU/ENabling technologies for a programmable many-CORE/ENCOR

    Performance and power optimizations in chip multiprocessors for throughput-aware computation

    Get PDF
    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    A Reconfigurable Processor for Heterogeneous Multi-Core Architectures

    Get PDF
    A reconfigurable processor is a general-purpose processor coupled with an FPGA-like reconfigurable fabric. By deploying application-specific accelerators, performance for a wide range of applications can be improved with such a system. In this work concepts are designed for the use of reconfigurable processors in multi-tasking scenarios and as part of multi-core systems

    SIMD based multicore processor for image and video processing

    Get PDF
    制度:新 ; 報告番号:甲3602号 ; 学位の種類:博士(工学) ; 授与年月日:2012/3/15 ; 早大学位記番号:新595

    Performance and Energy Consumption Characterization and Modeling of Video Decoding on Multi-core Heterogenous SoC and their Applications

    Get PDF
    To meet the increasing complexity of mobile multimedia applications, the System on Chip (SoC) equipping modern mobile devices integrate powerful heterogeneous processing elements among which General Purpose Processors (GPP), Digital Signal Processors (DSP), hardware accelerator are the most common ones.Due to the ever-growing gap between battery lifetime and hardware/software complexity in addition to application computing power needs, the energy saving issue becomes crucial in the design of such systems. In this context, we propose a study aiming to enhance the understanding of the energy consumption behavior of video decoding on these kinds of systems. Accordingly, an end-to-end methodology for characterizing and modeling the performance and the energy consumption of video decoding on GPP and DSP is proposed. The characterization step is based on an exhaustive experimental methodology for evaluating, at different abstraction levels, the performance and the energy consumption of video decoding. It was achieved on embedded platforms on which were executed a wide range of video decoding configurations. This step highlighted the importance to consider different parameters which may pertain to different abstraction levels in evaluating the overall energy efficiency of a given system. The measurements obtained in this step were used to build empirically performance and energy models for video decoding on both GPP and DSP. The proposed models gave very accurate estimation (R 2 = 97%) of both the performance and the energy consumption of video decoding in terms of a rich set of parameters including the video quality and the processor frequency. Moreover, based on a multi-level characterization and sub-model decomposition approaches, we show how the developed models, unlike classic empirical models, are easily and rapidly generalizable to other platforms.Some possible applications using the developed models, in the context of adaptive video decoding, were proposed. In general, it consists to use the capability of the proposed performance model to predict the decoding time of a given video quality in dimensioning/scheduling the processing resources. Due to the increasing demand on High Definition (HD), the characterization methodology was extended to consider HD video decoding on both parallel multi-cores and hardware video accelerator. This part highlighted the potential of parallelism video decoding to increase the energy efficiency of video decoding and point out some open issues in this domain.Pour répondre à la complexité croissante des applications multimédia mobiles, les systèmes sur puce équipant les appareils mobiles modernes intègrent des unités de calcul puissantes et hétérogène. Parmi ces units de calcul, on peut trouver des processeurs à usage général, des processeur de traitement de signal et des accélérateurs matériels. En raison de l’écart toujours croissant entre la durée de vie des batteries et la demande de plus en plus importante en puissance de calcul, l’économie d’énergie devient un enjeu crucial dans la conception des systèmes mobiles. Cette problématique est accentuée par l’augmentation de la complexité des logiciels et architectures matériels utilisés. Dans ce contexte, nous proposons une étude visant à améliorer la compréhension des considérations énergétiques du décodage vidéo sur ce genre de systèmes. Nous proposerons ainsi une méthodologie pour la caractérisation et la modélisation des performances et de la consommation d’énergie du décodage vidéo, aussi bien sur des processeurs à usage général de type ARM que sur un processeurde traitement de signal. L’étape de caractérisation est basée sur une méthodologie expérimentale pour évaluer de façon exhaustive et à différents niveaux d’abstraction, les performances et la consommation d’énergie du décodage vidéo. Cette caractérisation a été réalisée sur des plates-formes embarquées sur lesquels ont été exécutés un large éventail de configurations du décodage vidéo. Cette étape a souligné l’importance d’examiner différents paramètres qui peuvent se rapporter à différents niveaux d’abstraction dans l’évaluation de l’efficacité énergétique globale d’un système donné. Les mesures obtenues dans cette étape ont été utilisées pour construire empiriquement des modèles de performance et de consommation d’énergie pour le décodage vidéo à la fois sur des processeurs à usage général type ARM et sur un processeur de traitement de signal. Les modèles proposés peuvent estimer avec une grande précision (R 2 = 97%) la performance et la consommation d’énergie de décodage vidéo en fonction d’un nombre de paramètres comprenant la qualité de la vidéo et la fréquence du processeur. En plus, en se basant sur une caractérisation multi-niveaux et une approches de modélisation par décomposition en sous-modèles, nous montrons comment les modèles développés, contrairement aux modèles empiriques classiques, sont facilement et rapidement généralisables à d’autres plates-formes. Nous proposerons également certaines applications possibles des modèles développés, dans le cadre du décodage vidéo adaptatif. En général, cela consiste à exploiter la capacité du modèle de performance proposé pour prédire le temps de décodage d’une qualité vidéo donnée afin de mieux dimensionner les ressources de calculs dans un but de réduire leur consommationd’énergie

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    A Scalable and Adaptive Network on Chip for Many-Core Architectures

    Get PDF
    In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced
    corecore