15 research outputs found

    FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels

    Get PDF
    Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach

    The Development of Hardware Multi-core Test-bed on Field Programmable Gate Array

    Get PDF
    The goal of this project is to develop a flexible multi-core hardware test-bed on field programmable gate array (FPGA) that can be used to effectively validate the theoretical research on multi-core computing, especially for the power/thermal aware computing. Based on a commercial FPGA test platform, i.e. Xilinx Virtex5 XUPV5 LX110T, we develop a homogeneous multi-core test-bed with four software cores, each of which can dynamically adjust its performance using software. We also enhance the operating system support for this test platform with the development of hardware and software primitives that are useful in dealing with inter-process communication, synchronization, and scheduling for processes on multiple cores. An application based on matrix addition and multiplication on multi-core is implemented to validate the applicability of the test bed

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    A novel access pattern-based multi-core memory architecture

    Get PDF
    Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system. This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit. The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture. The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research. These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.Les aplicacions de computació d'alt rendiment (HPC) s'executen cada vegada més en plataformes heterogènies de múltiples nuclis. El motiu bàsic de la creixent popularitat d'aquestes arquitectures és el seu baix consum i la seva natura orientada a alt throughput. No obstant, aquest thoughput imposa el requeriment de que les dades es proporcionin al sistema també amb alt throughput. Això resulta en la necessitat de gestionar eficientment les trasferències de memòria (dins i fora del chip), un repte significatiu. Els patrons de transferències de memòria regulars però complexos així com els irregulars són cada vegada més dominants per a diversos dominis d'aplicacions, incloent el científic i el processat d'imagte i senyals. Aquests accessos a dades poden ser organitzats en patrons independents que un gestor de memòria eficient pot explotar. Els mètodes basats en programari emprant memòries cau de propòsit general i memòries al chip són beneficioses fins a cert punt. No obstant, la tasca de gestionar eficientment les transferències de dades per a dispositius orientats a throughput pot ser millorada oferint mecanismes hardware que explotin el coneixement dels patrons d'accés de les aplicacions, així com la planificació dels accessos a una arquitectura de múltiples nuclis. Aquesta tesis està enfocada a explorar una arquitectura de memòria novedosa per a processadors de múltiples nuclis, basada en els patrons d'accés. En general, la recerca de la tesis cobreix quatres aspectes principals del sistema de memòria. Aquests aspectes són: i) sistema de memòria per a un únic nucli amb patrons regulars, ii) sistema de memòria per a múltiples nuclis amb patrons regulars, iii) sistema de memòria per a un únic nucli amb patrons irregulars, iv) sistema de memòria per a múltiples nuclis amb patrons irregulars

    Hardware thread management modeling for precision timed processors

    Get PDF
    Studies recently and currently in progress address timing demands for Cyber Physical Systems (CPS) applications. Certain areas of research seek to modify modern computer architecture to meet the needs of CPS applications. Moreover, specific modifications in current computer architecture have produced newer computer architectures and processors, such as precision timed (PRET) processors. This thesis focuses on identifying, modeling, and simulating thread management methods in hardware used by the current open-source PRET soft processor, the MultiFire

    Run-time reconfigurable, fault-tolerant FPGA systems for space applications

    Get PDF
    Cozzi D. Run-time reconfigurable, fault-tolerant FPGA systems for space applications. Bielefeld: Universität Bielefeld; 2016.The aim of this thesis is to investigate the use of Dynamic Partial Reconfiguration (DPR) on Commercial Off-the-Shelf (COTS) FPGAs in space applications. Reconfigurable systems gained interest in a wide range of application fields, including aerospace, where electronic devices are exposed to a harsh working environment. COTS SRAM-based FPGA devices represent an interesting hardware platform for this kind of systems since they combine low cost with the possibility to utilize state-of-the-art processing power as well as the flexibility of reconfigurable hardware. FPGA architectures have high computational power and thanks to their ability to be reconfigured at run-time, they became interesting candidates for payload processing in space applications. The presented Dynamic Reconfigurable Processing Module (DRPM) has been developed to investigate the use of the DPR approach for satellite payload processing. This scalable platform combines dynamically reconfigurable FPGAs with the required avionic interfaces (e.g., SpaceWire, MIL-STD-1553B, and SpaceFibre). In particular, a novel communication interface has been developed, the Heterogeneous Multi Processor Communication Interface (HMPCI), which allows inter-process communication with small latency and low memory footprint. Current synthesis tools do not support fully the DPR capabilities of FPGAs. Therefore, this thesis introduces INDRA 2.0: an INtegrated Design flow for Reconfigurable Architectures. The key part of INDRA 2.0 is DHHarMa: a Design flow for Homogeneous Hard Macros, which generates homogeneous hard macros for Xilinx FPGAs starting from a high-level description (e.g., VHDL). In particular, the homogeneous DHHarMa router is explained in detail, providing novel terminologies and algorithms, which have enabled the generation of homogeneous routed designs. Results have been shown that Design flow for Homogeneous Hard Macros (DHHarMa) can route homogeneously a communication infrastructure utilizing just between 1% and 31% more resources than the Xilinx router, which cannot provide a homogeneous solution. Furthermore, the permanent faults that can occur on FPGAs have been investigated. This thesis presents OLT(RE)2: an on-line on-demand approach to testing permanent faults induced by radiation in reconfigurable systems used in space missions. The proposed approach relies on a test circuit and custom placer and router. OLT(RE)2 exploits DPR to place the test circuits at run-time. Its goal is to test unprogrammed areas of the FPGA before using them. Experimental results of OLT(RE)2 have shown that is possible to generate, place, and route the test circuits needed to detect on average more than 99 % of the physical wires and on average about 97 % of the programmable interconnection points of a large arbitrary region of the FPGA in a reasonable time. Moreover, the test can be run on the target device without interfering the functional behavior of the system

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    Framework for media oriented transport systems

    Get PDF
    Dissertação de mestrado integrado em Engenharia Electrónica Industrial e ComputadoresThe natural evolution of embedded systems resulted in a faster execution of tasks, increased possibility for including additional features, allied to lower power consumption and benefiting from ever-growing rates of integration as far as silicon is concerned. The automotive industry is not an exception with regards to the integration of technology for a vast arrays of applications in systems which vary from entertainment of infotainment to systems related to vehicle safety and stability such as driver assists. The existence of diverse independent systems in modern cars, combined with the necessity of centralizing the user interface, simplifying the operation of the system and minimizing the user’s intervention, help to promote the comfort and reduce the likelihood of distractions taking place while driving. Modern communication oriented network standards, e.g. MOST or FlexRay, enable information compatibility when exchanged between systems communicating over different protocols. Moreover, the coexistence of packet, control and timesensitive information are ensured within timing requirements, providing a reliable QoS (Quality of Service) and by making use of a single physical transmission mean. Synchronized multimedia data (e.g. synchronized video and audio transmission) are example of this kind of (time-sensitive) information. This dissertation proposes a framework for design and development of network distributed applications in the field of automotive infotainment, compliant with the industry standards and using FPGA technology in order to ensure the system requirements satisfaction and promote IP Core re-utilization.A evolução natural dos sistemas embebidos traduziu-se numa maior rapidez na execução de tarefas, a possibilidade de incluir mais funcionalidades, aliado a menores consumos energéticos e beneficiando de crescentes e elevadas taxas de integração ao nível de silício. A indústria automóvel não é excepção no que diz respeito à integração de tecnologia para as mais variadas aplicações, com ou sem tolerância à falha, em sistemas que vão desde entretenimento ou infotainment a sistemas relacionados com a estabilidade e segurança do veículo, como é exemplo as driver assists. Existem de vários sistemas independentes nos modernos veículos automóveis. Estes, combinados com a necessidade de centralização ao nível de interface com o utilizador, tornam imperativa a simplicidade da operação. Para tal, requerem a minimizaccão da intervenção do utilizador, promovendo o conforto e diminuindo a probabilidade de desconcentração durante o exercício de condução. Os mais modernos standards de redes de comunicação como é exemplo o MOST ou o FlexRay, permitem a compatibilidade de informação trocada entre sistemas que comunicam através de distintos protocolos de comunicação. Para além disso, ainda garantem a coexistência de informação de controlo, informação do entretenimento e informação do tipo time-sensitive, onde os requisitos de temporização devem ser assegurados, mantendo uma qualidade de serviço fiàvel e fazendo uso de um único meio físico de transmissão. São exemplos deste tipo de informação, dados síncronos do tipo multimédia (e.g. streaming de àudio e vídeo de forma sincronizada). Pretende-se desenvolver uma framework para desenvolvimento de aplicações de rede distribuídas, do tipo infotainment e que beneficia a aplicação de tecnologias como FPGA, no offloading de computação para este dispositivo, como meio de garantir a satisfação dos requisitos, e promover a reutilização deste tipo de sistemas, mantendo o elevado desempenho na troca de dados e promovendo a portabilidade e a modularidade
    corecore