Search CORE

67 research outputs found

Customized Nios II multi-cycle instructions to accelerate block-matching techniques

Author: Botella Juan Guillermo
García Sánchez Carlos
González Diego
Meyer Bäse Anke
Meyer Bäse Uwe
Prieto Matías Manuel
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 27/02/2015
Field of study

This study focuses on accelerating the optimization of motion estimation algorithms, which are widely used in video coding standards, by using both the paradigm based on Altera Custom Instructions as well as the efficient combination of SDRAM and On-Chip memory of Nios II processor. Firstly, a complete code profiling is carried out before the optimization in order to detect time leaking affecting the motion compensation algorithms. Then, a multi-cycle Custom Instruction which will be added to the specific embedded design is implemented. The approach deployed is based on optimizing SOC performance by using an efficient combination of On-Chip memory and SDRAM with regards to the reset vector, exception vector, stack, heap, read/write data (.rwdata), read only data (.rodata), and program text (.text) in the design. Furthermore, this approach aims to enhance the said algorithms by incorporating Custom Instructions in the Nios II ISA. Finally, the efficient combination of both methods is then developed to build the final embedded system. The present contribution thus facilitates motion coding for low-cost Soft-Core microprocessors, particularly the RISC architecture of Nios II implemented in FPGA. It enables us to construct an SOC which processes 50×50 @ 180 fps

Docta Complutense

Customized Nios II multi-cycle instructions to accelerate block-matching techniques

Author: Botella
Díaz
Koga
Kuo
Meyer-Baese
Mota
Zhu
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date
Field of study

Crossref

Automated design of domain-specific custom instructions = Geautomatiseerd ontwerp van domeinspecifieke gespecialiseerde instructies

Author: González Alvárez Cecilia Noemí
Publication venue: Universitat Politècnica de Catalunya
Publication date: 16/11/2015
Field of study

Cotutela Universitat Politècnica de Catalunya i Universiteit Gent, Faculteit Ingenieurswetenschappen en Architectuur Vakgroep Elektronica en InformatiesystemenIn the last years, hardware specialization has received renewed attention as chips approach a utilization wall. Specialized accelerators can take advantage of underutilized transistors implementing custom hardware that complements the main processor. However, specialization adds complexity to the design process and limits reutilization. Application-Specific Instruction Processors (ASIPs) balance performance and reusability, extending a general-purpose processor with custom instructions (CIs) specific for an application, implemented in Specialized Functional Units (SFUs). Still, time-to-market is a major issue with application-specific designs because, if CIs are not frequently executed, the acceleration benefits will not compensate for the overall design cost. Domain-specific acceleration increases the applicability of ASIPs, as it targets several applications that run on the same hardware. Also, reconfigurable SFUs and the automation of the CIs design can solve the aforementioned problems. In this dissertation, we explore different automated approaches to the design of CIs that extend a baseline processor for domain-specific acceleration to improve both performance and energy efficiency. First, we develop automated techniques to identify code sequences within a domain to create CI candidates. Due to the disparity among coding styles of different programs, it is difficult to find patterns that are represented by a unique CI across applications. Therefore, we propose an analysis at the basic block level that identifies equivalent CIs within and across different programs. We use the Taylor Expansion Diagram (TED) canonical representation to find not only structurally equivalent CIs, but also functionally similar ones, as opposed to the commonly applied directed acyclic graph (DAG) isomorphism detection. We combine both methods into a new Hybrid DAG/TED technique to identify more patterns across applications that map to the same CI. Then, we select a subset of the CI candidates that fits in the available SFU area. Because of the complexity of the problem, we propose four scoring heuristics to reduce the design space and smooth the potential performance speedup across applications. We include these methods in the FuSInG framework, and we estimate performance with hardware models on a set of media benchmarks. Results show that, when limiting core area devoted to specialization, the SFU customization with the largest speedups includes a mix of application and domain-specific custom instructions. If we target larger CIs to obtain higher speedups, reusability across applications becomes critical; without enough equivalences, CIs cannot be generalized for a domain. We aim to share partially common operations among CIs to accelerate more code, especially across basic blocks, and to reduce the hardware area needed for specialization. Hence, we create a new canonical representation across basic blocks, the Merging Diagram, to facilitate similarity detection and improve code coverage. We also introduce clustering-based partial matching to identify partially-similar domain-specific CIs, which generally leads to better performance than application-specific ones. Yet, at small areas, merging two CIs induces a high additional overhead that might penalize energy-efficiency. Thus, we also detect fragments of CIs and we join them with the existing merged clusters resulting in minimal extra overhead. Also, using speedup as the deciding factor for CI selection may not be optimal for devices with limited power budgets. For that reason, we propose a linear programming-based selection that balances performance and energy consumption. We implement these techniques in the MInGLE framework and evaluate them with media benchmarks. The selected CIs significantly improve the energy-delay product and performance, demonstrating that we can accelerate a domain covering more code while reducing the needed area for the CI implementation.La especialización de hardware ha recibido renovado interés debido al utilization wall, ya que transistores infrautilizados pueden implementar hardware a medida que complemente el procesador principal. Sin embargo, el proceso de diseño se complica y se limita la reutilización. Procesadores de instrucciones para aplicaciones específicas (ASIPs) equilibran rendimiento y reuso, extendiendo un procesador con instruciones especializadas (custom instructions ¿ CIs) para una aplicación, implementadas en unidades funcionales especializadas (SFUs). No obstante, los plazos de comercialización suponen un obstáculo en diseños específicos ya que, si las CIs no se ejecutan con frecuencia, los beneficios de la aceleración no compensan los costes de diseño. La aceleración de un dominio específico incrementa la aplicabilidad de los ASIPs, acelerando diferentes aplicaciones en el mismo hardware, mientras que una SFU reconfigurable y un diseño automatizado pueden resolver los problemas mencionados. En esta tesis, exploramos diferentes alternativas al diseño de CIs que extienden un procesador para acelerar un dominio, mejorando el rendimiento y la eficiencia energética. Proponemos primero técnicas automatizadas para identificar código acelerable en un dominio. Sin embargo, la identificación se ve dificultada por la diversidad de estilos entre diferentes programas. Por tanto, proponemos identificar en el bloque básico CIs equivalentes utilizando la representación canónica Taylor Expansion Diagram (TED). Con TEDs encontramos no sólo código estructuralmente equivalente, sino también con similitud funcional, en contraposición a la detección isomórfica de grafos acíclicos dirigidos (DAG). Combinamos ambos métodos en una nueva técnica híbrida DAG/TED, que identifica en varias aplicaciones más secuencias representadas por la misma CI. Tras esto, seleccionamos un subconjunto de CIs que puede ser contenido en el área de la SFU. Por la complejidad del problema, proponemos cuatro heurísticas de selección para reducir el espacio de búsqueda y homogeneizar el rendimiento de las aplicaciones. Incluimos estas técnicas en la infraestructura FuSInG y estimamos el rendimiento para un conjunto de benchmarks multimedia. Los resultados muestran que, al limitar el área de especialización, la configuración de la SFU con las mayores ganancias incluye una mezcla de CIs específicas tanto para una aplicación como para todo el dominio. Si nos centramos en CIs más grandes para obtener mayores ganancias, la reutilización se vuelve crítica; sin suficientes equivalencias las CIs no pueden ser generalizadas. Nuestro objetivo es que las CIs compartan parcialmente operaciones, especialmente a través de bloques básicos, y reducir el área de especialización. Por ello, creamos una representación canónica de CIs que cubre varios bloques básicos, Merging Diagram, para mejorar el alcance de la aceleración y facilitar la detección de similitud. Además, proponemos una búsqueda de coincidencias parciales basadas en clustering para identificar CIs de dominio específico parcialmente similares, las cuales derivan generalmente mejor rendimiento. Pero en áreas reducidas, la fusión de CIs induce un coste adicional que penalizaría la eficiencia energética. Así, detectamos fragmentos de CIs y los unimos con grupos de CIs previamente fusionadas con un coste extra mínimo. Usar el rendimiento como el factor decisivo en la selección puede no ser óptimo para disposivos con consumo de energía limitado. Por eso, proponemos un mecanismo de selección basado en programación lineal que equilibra rendimiento y consumo energético. Implementamos estas técnicas en la infraestructura MInGLE y las evaluamos con benchmarks multimedia. Las CIs seleccionadas mejoran notablemente la eficiencia energética y el rendimiento, demostrando que podemos acelerar un dominio cubriendo más código a la vez que reducimos el área de implementación.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Automated design of domain-specific custom instructions

Author: González-Álvarez Cecilia
Publication venue: Ghent University. Faculty of Engineering and Architecture
Publication date: 01/01/2015
Field of study

Ghent University Academic Bibliography

IPPro: FPGA based image processing processor

Author: Bardak Burak
Rafferty Karen
Russell Matthew
Siddiqui Fahad Manzoor
Woods Roger
Publication venue
Publication date: 01/10/2014
Field of study

Queen's University Belfast Research Portal

Crossref

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommended from our members

Hardware Monitors for Secure Processing in Embedded Operating Systems

Author: Thomas Tedy Mammen
Publication venue: ScholarWorks@UMass Amherst
Publication date: 23/11/2015
Field of study

Embedded processors are being increasingly used in our daily life and have become an important part of many types of infrastructure in the world. As people start depending more on embedded systems for personal and business processing operations, the attacks on these systems have also been on a rise. Existing defense mechanisms targeted for desktop and server processors cannot be used to defend embedded systems as these system exhibit constraints on processing performance and processing power and energy. Thus, embedded systems require low overhead security approaches to ensure that they are protected from attacks. This thesis describes a hardware based approach to monitor the operation of an embedded processor instruction-by-instruction, where deviations from expected program behavior are detected within the time associated with the execution of an instruction. Previous work in this area has focused on monitoring a single task on a CPU while here a novel hardware monitoring system that can monitor multiple active tasks in an operating-system-based platform is presented. This approach doesn’t need any change in application binary code. The hardware monitor is able to track context switches that occur in the operating system and ensure that monitoring is performed continuously, thus ensuring system security. This thesis describes the design of the system as well as results obtained from a prototype implementation of the system on an Altera DE4 FPGA board. It is demonstrated in hardware that applications can be monitored at instruction level without execution slow-down and buffer overflow attacks can be defeated using this system. When an attack occurs, it is detected within a cycle and the attack task is killed before it can harm the system. The system uses an off-chip DRAM for storing the application binary and the operating system kernel. A centralized graph memory is implemented on-chip to support the storage of all monitoring graphs associated with the system. MiBench benchmarks such as qsort, bitcount, stringmatch, basicmath and dijkstra are used to evaluate the system

ScholarWorks@UMass Amherst

Superscalar RISC-V Processor with SIMD Vector Extension

Author: He Jiongrui
Publication venue: 'University of Saskatchewan Library'
Publication date: 22/09/2020
Field of study

With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along

University of Saskatchewan Research Archive

Application of multimode HLS techniques to ASIP synthesis

Author: Engelhardt Nina
Publication venue: HAL CCSD
Publication date: 23/06/2010
Field of study

Automatic generation of ASIPs is still insufficiently resource-efficient compared to human design. The current best eort relies on extending existing processor cores with custom instruction pathways, an approach that has several drawbacks: resources already existing in the core cannot be reused, and data access is restricted to the original register reads and writes, creating a bottleneck in throughput. We present an ASIP synthesis algorithm that turns a semantic description of an instruction set into a synthesizable description of a processor's datapath in an efficient manner, based on a mod- ication of Moreano's datapath merge algorithm. Support for operator mobility across pipeline stages, combinatorial loop prevention and realistic multiplexer resource usage estimation for FPGAs is added. The translation chain is not yet completed, but preliminary results show efficient resource reuse between instruction datapaths

HAL-Rennes 1

Design and synthesis of a high-performance, hyper-programmable DSP on an FPGA

Author: Nichols Stephen
Publication venue: RIT Scholar Works
Publication date: 01/04/2011
Field of study

In the field of high performance digital signal processing, DSPs and FPGAs provide the most flexibility. Due to the extensive customization available on FPGAs, DSP algorithm implementation on an FPGA exhibits an increased development time over programming a processor. Because of this, traditional DSPs typically yield a faster time to market than an FPGA design. However, it is often desirable to have the ASIC-like performance that is attainable through the additional customization and parallel computation available through an FPGA. This can be achieved through the class of processors known as hyper-programmable DSPs. A hyper-programmable DSP is a DSP in which multiple aspects of the architecture are programmable. This thesis contributes such a DSP, targeted for high-performance and realized in hardware using an FPGA. The design consists of both a scalar datapath and a vector datapath capable of parallel operations, both of which are extensively customizable. To aid in the design of the datapaths, graphical tools are introduced as an efficient way to modify the design. A tool was also created to supply a graphical interface to help write instructions for the vector datapath. Additionally, an adaptive assembler was created to convert assembly programs to machine code for any datapath design. The resulting design was synthesized for a Cyclone III FPGA. The synthesis resulted in a design capable of running at 135MHz with 61% of the logic used by processing elements. Benchmarks were run on the design to evaluate its performance. The benchmarks showed similar performance between the proposed design and commercial DSPs for the simple benchmarks but significant improvement for the more complex ones

RIT Scholar Works