1,168 research outputs found

    Processor Enhancements for Media Streaming Applications

    Get PDF
    The development of more processing demanding applications on the Internet (video broadcasting) on one hand and the popularity of recent devices at the user level (digital cameras, wireless videophones, ...) on the other hand introduce challenges at several levels. Today, such devices present processing capabilities and bandwidth settings that are inefficient to manage scalable QoS requirements in a typical media delivery framework. In this paper, we present an impact study of such a scalable data representation optimized for QoS (Matching Pursuit 3D algorithms) on processor architectures to achieve the best performance and power efficiency. A review of state of the art techniques for processor architecture enhancement let us expect promising opportunities from the latest developments in the reconfigurable computing research field. We present here the first design steps of an efficient reconfigurable coprocessor especially designed to cope with future video delivery and multimedia processing requirements. Architecture perspectives are proposed with respect to low development cost constraints, backward compatibilty and easy coprocessor usage using an original strategy based on a hardware/software codesign methodolog

    VLSI architecture design approaches for real-time video processing

    Get PDF
    This paper discusses the programmable and dedicated approaches for real-time video processing applications. Various VLSI architecture including the design examples of both approaches are reviewed. Finally, discussions of several practical designs in real-time video processing applications are then considered in VLSI architectures to provide significant guidelines to VLSI designers for any further real-time video processing design works

    Domain-Specific Computing Architectures and Paradigms

    Full text link
    We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud. This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking. The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers. Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd

    Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles

    Get PDF
    RÉSUMÉ De nos jours, les industriels privilĂ©gient les architectures flexibles afin de rĂ©duire le temps et les coĂ»ts de conception d’un systĂšme. Les processeurs Ă  usage spĂ©cifique (ASIP) fournissent beaucoup de flexibilitĂ©, tout en atteignant des performances Ă©levĂ©es. Une tendance qui a de plus en plus de succĂšs dans le processus de conception d’un systĂšme sur puce consiste Ă  spĂ©cifier le comportement du systĂšme en langage Ă©voluĂ© tel que le C, SystemC, etc. La spĂ©cification est ensuite utilisĂ©e durant le partitionement pour dĂ©terminer les composantes logicielles et matĂ©rielles du systĂšme. Avec la maturitĂ© des gĂ©nĂ©rateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boĂźtes Ă  outils un nouveau type d’architecture, Ă  savoir les ASIP, en sachant que ces derniers sont conçus Ă  partir d’une spĂ©cification dĂ©crite en langage Ă©voluĂ©. D’un autre cĂŽtĂ©, dans le monde matĂ©riel, et cela depuis trĂšs longtemps, les chercheurs ont vu l’avantage de baser le processus de conception sur un langage Ă©voluĂ©. Cette recherche a abouti Ă  l’avĂ©nement de gĂ©nĂ©rateurs automatiques de matĂ©riel sur le marchĂ© qui sont des outils d’aide Ă  la conception comme CapatultC, Forte’s Cynthetizer, etc. Ainsi, avec tous ces outils basĂ©s sur le langage C, les concepteurs ont un choix de types de design Ă©largi mais, d’un autre cĂŽtĂ©, les options de designs possibles explosent, ce qui peut allonger au lieu de rĂ©duire le temps de conception. C’est dans ce cadre que notre thĂšse doctorale s’inscrit, puisqu’elle prĂ©sente des mĂ©thodologies d’exploration architecturale de design Ă  usage spĂ©cifique pour l’accĂ©lĂ©ration de boucles afin de rĂ©duire le temps de conception, entre autres. Cette thĂšse a dĂ©butĂ© par l’exploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates Ă  l’accĂ©lĂ©ration, si elles comportent de bonnes possibilitĂ©s de parallĂ©lisme et si ces derniĂšres sont bien exploitĂ©es. Le matĂ©riel est trĂšs efficace Ă  profiter des possibilitĂ©s de parallĂ©lisme au niveau instruction, donc, une mĂ©thode de conception a Ă©tĂ© proposĂ©e. Cette derniĂšre extrait le parallĂ©lisme d’une boucle afin d’exĂ©cuter plus d’opĂ©rations concurrentes dans des instructions spĂ©cialisĂ©es. Notre mĂ©thode se base aussi sur l’optimisation des donnĂ©es dans l’architecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forte’s Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances. Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation

    SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

    Get PDF
    In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp

    High performance computing with FPGAs

    Get PDF
    Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated. The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors
    • 

    corecore