17 research outputs found

    Feasibility study and porting of the damped least square algorithm on FPGA

    Get PDF
    Modern embedded computing platforms used within Cyber-Physical Systems (CPS) are nowadays leveraging more and more often on heterogeneous computing substrates, such as newest Field Programmable Gate Array (FPGA) devices. Compared to general purpose platforms, which have a fixed datapath, FPGAs provide designers the possibility of customizing part of the computing infrastructure, to better shape the execution on the application needs/features, and offer high efficiency in terms of timing and power performance, while naturally featuring parallelism. In the context of FPGA-based CPSs, this article has a two fold mission. On the one hand, it presents an analysis of the Damped Least Square (DLS) algorithm for a perspective hardware implementation. On the other hand, it describes the implementation of a robotic arm controller based on the DLS to numerically solve Inverse Kinematics problems over a heterogeneous FPGA. Assessments involve a Trossen Robotics WidowX robotic arm controlled by a Digilent ZedBoard provided with a Xilinx Zynq FPGA that computes the Inverse Kinematic

    Implementing a streaming application on a processor array

    Get PDF
    A modern way of processing information is to do it in parallel. This Master Thesis conducts a case study of how to parallelize a streaming application on a highly parallel platform. This involves porting a real-world application, written in a stream processing language and compiled by tools, developed by the Embedded System Design research group at Lund University, onto a platform including an embedded processor array (the Adapteva’s Epiphany), an ARM processor, and programmable logic. The driver application that we used was a video decoder. The host platform was a Parallella board, with a 16-core Epiphany co-processor and a Zynq host processor that had dual ARM cores. Our Master Thesis covers the creation of some library elements to support complex applications on that platform, such as FIFOs between Epiphany cores and the ARM host, some components that handle access to external RAM and a component that draws pixels onto a screen.A modern way of processing information is to do it in parallel. This master’s thesis conducts a case study of how to parallelize an application on a highly parallel platform

    FPGA-Based Processor Acceleration for Image Processing Applications

    Get PDF
    FPGA-based embedded image processing systems offer considerable computing resources but present programming challenges when compared to software systems. The paper describes an approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based programming environment. The approach is demonstrated for a k-means clustering operation and a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using 16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU respectively

    Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

    Get PDF
    Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs

    Acquisition systems and decoding algorithms of peripheral neural signals for prosthetic applications

    Get PDF
    During the years, neuroprosthetic applications have obtained a great deal of attention by the international research, especially in the bioengineering field, thanks to the huge investments on several proposed projects funded by the political institutions which consider the treatment of this particular disease of fundamental importance for the global community. The aim of these projects is to find a possible solution to restore the functionalities lost by a patient subjected to an upper limb amputation trying to develop, according to physiological considerations, a communication link between the brain in which the significant signals are generated and a motor prosthesis device able to perform the desired action. Moreover, the designed system must be able to give back to the brain a sensory feedback about the surrounding world in terms of pressure or temperature acquired by tactile biosensors placed at the surface of the cybernetic hand. It in fact allows to execute involuntarymovements when for example the armcomes in contact with hot objects. The development of such a closed-loop architecture involves the need to address some critical issues which depend on the chosen approach. Several solutions have been proposed by the researches of the field, each one differing with respect to where the neural signals are acquired, either at the central nervous systemor at the peripheral one,most of themfollowing the former even that the latter is always considered by the amputees amore natural way to handle the artificial limb. This research work is based on the use of intrafascicular electrodes directly implanted in the residual peripheral nerves of the stump which represents a good compromise choice in terms of invasiveness and selectivity extracting electroneurographic (ENG) signals from which it is possible to identify the significant activity of a quite limited number of neuronal cells. In the perspective of the hardware implementation of the resulting solution which can work autonomously without any intervention by the amputee in an adaptive way according to the current characteristics of the processed signal and by using batteries as power source allowing portability, it is necessary to fulfill the tight constraints imposed by the application under consideration involved in each of the various phases which compose the considered closed-loop system. Regarding to the recording phase, the implementation must be able to remove the unwanted interferences mainly due to the electro-stimulations of themuscles placed near the electrodes featured by an order of magnitude much greater in comparison to that of the signals of interest amplifying the frequency components belonging to the significant bandwidth, and to convert them with a high resolution in order to obtain good performance at the next processing phases. To this aim, a recording module for peripheral neural signals will be presented, based on the use of a sigma-delta architecture which is composed by two main parts: an analog front-end stage for neural signal acquisition, pre-filtering and sigma-delta modulation and a digital unit for sigma-delta decimation and system configuration. Hardware/software cosimulations exploiting the Xilinx System Generator tool in Matlab Simulink environment and then transistor-level simulations confirmed that the system is capable of recording neural signals in the order of magnitude of tens of μV rejecting the huge low-frequency noise due to electromyographic interferences. The same architecture has been then exploited to implement a prototype of an 8-channel implantable electronic bi-directional interface between the peripheral nervous system and the neuro-controlled hand prosthesis. The solution includes a custom designed Integrated Circuit (0.35μm CMOS technology), responsible of the signal pre-filtering and sigma-delta modulation for each channel and the neural stimuli generation (in the opposite path) based on the directives sent by a digital control systemmapped on a low-cost Xilinx FPGA Spartan-3E 1600 development board which also involves the multi-channel sigma-delta decimation with a high-order band-pass filter as first stage in order to totally remove the unwanted interferences. In this way, the analog chip can be implanted near the electrodes thanks to its limited size avoiding to add a huge noise to theweak neural signals due to longwires connections and to cause heat-related infections, shifting the complexity to the digital part which can be hosted on a separated device in the stump of the amputeewithout using complex laboratory instrumentations. The system has been successfully tested from the electrical point of view and with in-vivo experiments exposing good results in terms of output resolution and noise rejection even in case of critical conditions. The various output channels at the Nyquist sampling frequency coming from the acquisition system must be processed in order to decode the intentions of movements of the amputee, applying the correspondent electro-mechanical stimulation in input to the cybernetic hand in order to perform the desired motor action. Different decoding approaches have been presented in the past, the majority of them were conceived starting from the relative implementation and performance evaluation of their off-line version. At the end of the research, it is necessary to develop these solutions on embedded systems performing an online processing of the peripheral neural signals. However, it is often possible only by using complex hardware platforms clocked at very high operating frequencies which are not be compliant with the low-power requirements needed to allow portability for the prosthetic device. At present, in fact, the important aspect of the real-time implementation of sophisticated signal processing algorithms on embedded systems has been often overlooked, notwithstanding the impact that limited resources of the former may have on the efficiency/effectiveness of any given algorithm. In this research work it has been addressed the optimization of a state-of-the-art algorithmfor PNS signals decoding that is a step forward for its real-time, full implementation onto a floating-point Digital Signal Processor (DSP). Beyond low-level optimizations, different solutions have been proposed at an high level in order to find the best trade-off in terms of effectiveness/efficiency. A latency model, obtained through cycle accurate profiling of the different code sections, has been drawn in order to perform a fair performance assessment. The proposed optimized real-time algorithmachieves up to 96% of correct classification on real PNS signals acquired through tf-LIFE electrodes on animals, and performs as the best off-line algorithmfor spike clustering on a synthetic cortical dataset characterized by a reasonable dissimilarity between the spikemorphologies of different neurons. When the real-time requirements are joined to the fulfilment of area and power minimization for implantable/portable applications, such as for the target neuroprosthetic devices, only custom VLSI implementations can be adopted. In this case, every part of the algorithmshould be carefully tuned. To this aim, the first preprocessing stage of the decoding algorithmbased on the use of aWavelet Denoising solution able to remove also the in-band noise sources has been deeply analysed in order to obtain an optimal hardware implementation. In particular, the usually overlooked part related to threshold estimation has been evaluated in terms of required hardware resources and functionality, exploiting the commercial Xilinx System Generator tool for the design of the architecture and the co-simulation. The analysis has revealed how the widely used Median Absolute Deviation (MAD) could lead o hardware implementations highly inefficient compared to other dispersion estimators demonstrating better scalability, relatively to the specific application. Finally, two different hardware implementations of the reference decoding algorithm have been presented highlighting pros and cons of each one of them. Firstly, a novel approach based on high-level dataflow description and automatic hardware generation is presented and evaluated on the on-line template-matching spike sorting algorithmwhich represents the most complex processing stage. It starts from the identification of the single kernels with the greater computational complexity and using their dataflow description to generate the HDL implementation of a coarse-grained reconfigurable global kernel characterized by theminimumresources in order to reduce the area and the energy dissipation for the fulfilment of the low-power requirements imposed by the application. Results in the best case have revealed a 71%of area saving compared tomore traditional solutions,without any accuracy penalty. With respect to single kernels execution, better latency performance are achievable stillminimizing the number of adopted resources. The performance in terms of latency can also be improved by tuning the implemented parallelismin the light of a defined number of channels and real-time constraints, by using more than one reconfigurable global kernel in order that they can be exploited to perform the same or different kernels at the same time in a parallel way, due to the fact that each one can execute the relative processing only in a sequential way. For this reason, a second FPGA-based prototype has been proposed based on the use of aMulti-Processor System-on-Chip (MPSoC) embedded architecture. This prototype is capable of respecting the real-time constraints posed by the application when clocked at less than 50 MHz, in comparison to 300 MHz of the previous DSP implementation. Considering that the application workload is extremely data dependent and unpredictable due to the sparsity of the neural signals, the architecture has to be dimensioned taking into account critical worst-case operating conditions in order to always ensure the correct functionality. To compensate the resulting overprovisioning of the system architecture, a software-controllable power management based on the use of clock gating techniques has been integrated in order tominimize the dynamic power consumption of the resulting solution. Summarizing, this research work can be considered a sort of proof-of-concept for the proposed techniques considering all the design issues which characterize each stage of the closed-loop system in the perspective of a portable low-power real-time hardware implementation of the neuro-controlled prosthetic device

    Acquisition systems and decoding algorithms of peripheral neural signals for prosthetic applications

    Get PDF
    During the years, neuroprosthetic applications have obtained a great deal of attention by the international research, especially in the bioengineering field, thanks to the huge investments on several proposed projects funded by the political institutions which consider the treatment of this particular disease of fundamental importance for the global community. The aim of these projects is to find a possible solution to restore the functionalities lost by a patient subjected to an upper limb amputation trying to develop, according to physiological considerations, a communication link between the brain in which the significant signals are generated and a motor prosthesis device able to perform the desired action. Moreover, the designed system must be able to give back to the brain a sensory feedback about the surrounding world in terms of pressure or temperature acquired by tactile biosensors placed at the surface of the cybernetic hand. It in fact allows to execute involuntarymovements when for example the armcomes in contact with hot objects. The development of such a closed-loop architecture involves the need to address some critical issues which depend on the chosen approach. Several solutions have been proposed by the researches of the field, each one differing with respect to where the neural signals are acquired, either at the central nervous systemor at the peripheral one,most of themfollowing the former even that the latter is always considered by the amputees amore natural way to handle the artificial limb. This research work is based on the use of intrafascicular electrodes directly implanted in the residual peripheral nerves of the stump which represents a good compromise choice in terms of invasiveness and selectivity extracting electroneurographic (ENG) signals from which it is possible to identify the significant activity of a quite limited number of neuronal cells. In the perspective of the hardware implementation of the resulting solution which can work autonomously without any intervention by the amputee in an adaptive way according to the current characteristics of the processed signal and by using batteries as power source allowing portability, it is necessary to fulfill the tight constraints imposed by the application under consideration involved in each of the various phases which compose the considered closed-loop system. Regarding to the recording phase, the implementation must be able to remove the unwanted interferences mainly due to the electro-stimulations of themuscles placed near the electrodes featured by an order of magnitude much greater in comparison to that of the signals of interest amplifying the frequency components belonging to the significant bandwidth, and to convert them with a high resolution in order to obtain good performance at the next processing phases. To this aim, a recording module for peripheral neural signals will be presented, based on the use of a sigma-delta architecture which is composed by two main parts: an analog front-end stage for neural signal acquisition, pre-filtering and sigma-delta modulation and a digital unit for sigma-delta decimation and system configuration. Hardware/software cosimulations exploiting the Xilinx System Generator tool in Matlab Simulink environment and then transistor-level simulations confirmed that the system is capable of recording neural signals in the order of magnitude of tens of μV rejecting the huge low-frequency noise due to electromyographic interferences. The same architecture has been then exploited to implement a prototype of an 8-channel implantable electronic bi-directional interface between the peripheral nervous system and the neuro-controlled hand prosthesis. The solution includes a custom designed Integrated Circuit (0.35μm CMOS technology), responsible of the signal pre-filtering and sigma-delta modulation for each channel and the neural stimuli generation (in the opposite path) based on the directives sent by a digital control systemmapped on a low-cost Xilinx FPGA Spartan-3E 1600 development board which also involves the multi-channel sigma-delta decimation with a high-order band-pass filter as first stage in order to totally remove the unwanted interferences. In this way, the analog chip can be implanted near the electrodes thanks to its limited size avoiding to add a huge noise to theweak neural signals due to longwires connections and to cause heat-related infections, shifting the complexity to the digital part which can be hosted on a separated device in the stump of the amputeewithout using complex laboratory instrumentations. The system has been successfully tested from the electrical point of view and with in-vivo experiments exposing good results in terms of output resolution and noise rejection even in case of critical conditions. The various output channels at the Nyquist sampling frequency coming from the acquisition system must be processed in order to decode the intentions of movements of the amputee, applying the correspondent electro-mechanical stimulation in input to the cybernetic hand in order to perform the desired motor action. Different decoding approaches have been presented in the past, the majority of them were conceived starting from the relative implementation and performance evaluation of their off-line version. At the end of the research, it is necessary to develop these solutions on embedded systems performing an online processing of the peripheral neural signals. However, it is often possible only by using complex hardware platforms clocked at very high operating frequencies which are not be compliant with the low-power requirements needed to allow portability for the prosthetic device. At present, in fact, the important aspect of the real-time implementation of sophisticated signal processing algorithms on embedded systems has been often overlooked, notwithstanding the impact that limited resources of the former may have on the efficiency/effectiveness of any given algorithm. In this research work it has been addressed the optimization of a state-of-the-art algorithmfor PNS signals decoding that is a step forward for its real-time, full implementation onto a floating-point Digital Signal Processor (DSP). Beyond low-level optimizations, different solutions have been proposed at an high level in order to find the best trade-off in terms of effectiveness/efficiency. A latency model, obtained through cycle accurate profiling of the different code sections, has been drawn in order to perform a fair performance assessment. The proposed optimized real-time algorithmachieves up to 96% of correct classification on real PNS signals acquired through tf-LIFE electrodes on animals, and performs as the best off-line algorithmfor spike clustering on a synthetic cortical dataset characterized by a reasonable dissimilarity between the spikemorphologies of different neurons. When the real-time requirements are joined to the fulfilment of area and power minimization for implantable/portable applications, such as for the target neuroprosthetic devices, only custom VLSI implementations can be adopted. In this case, every part of the algorithmshould be carefully tuned. To this aim, the first preprocessing stage of the decoding algorithmbased on the use of aWavelet Denoising solution able to remove also the in-band noise sources has been deeply analysed in order to obtain an optimal hardware implementation. In particular, the usually overlooked part related to threshold estimation has been evaluated in terms of required hardware resources and functionality, exploiting the commercial Xilinx System Generator tool for the design of the architecture and the co-simulation. The analysis has revealed how the widely used Median Absolute Deviation (MAD) could lead o hardware implementations highly inefficient compared to other dispersion estimators demonstrating better scalability, relatively to the specific application. Finally, two different hardware implementations of the reference decoding algorithm have been presented highlighting pros and cons of each one of them. Firstly, a novel approach based on high-level dataflow description and automatic hardware generation is presented and evaluated on the on-line template-matching spike sorting algorithmwhich represents the most complex processing stage. It starts from the identification of the single kernels with the greater computational complexity and using their dataflow description to generate the HDL implementation of a coarse-grained reconfigurable global kernel characterized by theminimumresources in order to reduce the area and the energy dissipation for the fulfilment of the low-power requirements imposed by the application. Results in the best case have revealed a 71%of area saving compared tomore traditional solutions,without any accuracy penalty. With respect to single kernels execution, better latency performance are achievable stillminimizing the number of adopted resources. The performance in terms of latency can also be improved by tuning the implemented parallelismin the light of a defined number of channels and real-time constraints, by using more than one reconfigurable global kernel in order that they can be exploited to perform the same or different kernels at the same time in a parallel way, due to the fact that each one can execute the relative processing only in a sequential way. For this reason, a second FPGA-based prototype has been proposed based on the use of aMulti-Processor System-on-Chip (MPSoC) embedded architecture. This prototype is capable of respecting the real-time constraints posed by the application when clocked at less than 50 MHz, in comparison to 300 MHz of the previous DSP implementation. Considering that the application workload is extremely data dependent and unpredictable due to the sparsity of the neural signals, the architecture has to be dimensioned taking into account critical worst-case operating conditions in order to always ensure the correct functionality. To compensate the resulting overprovisioning of the system architecture, a software-controllable power management based on the use of clock gating techniques has been integrated in order tominimize the dynamic power consumption of the resulting solution. Summarizing, this research work can be considered a sort of proof-of-concept for the proposed techniques considering all the design issues which characterize each stage of the closed-loop system in the perspective of a portable low-power real-time hardware implementation of the neuro-controlled prosthetic device

    Generación de una librería RVC – CAL para la etapa de determinación de endmembers en el proceso de análisis de imágenes hiperespectrales

    Get PDF
    El análisis de imágenes hiperespectrales permite obtener información con una gran resolución espectral: cientos de bandas repartidas desde el espectro infrarrojo hasta el ultravioleta. El uso de dichas imágenes está teniendo un gran impacto en el campo de la medicina y, en concreto, destaca su utilización en la detección de distintos tipos de cáncer. Dentro de este campo, uno de los principales problemas que existen actualmente es el análisis de dichas imágenes en tiempo real ya que, debido al gran volumen de datos que componen estas imágenes, la capacidad de cómputo requerida es muy elevada. Una de las principales líneas de investigación acerca de la reducción de dicho tiempo de procesado se basa en la idea de repartir su análisis en diversos núcleos trabajando en paralelo. En relación a esta línea de investigación, en el presente trabajo se desarrolla una librería para el lenguaje RVC – CAL – lenguaje que está especialmente pensado para aplicaciones multimedia y que permite realizar la paralelización de una manera intuitiva – donde se recogen las funciones necesarias para implementar dos de las cuatro fases propias del procesado espectral: reducción dimensional y extracción de endmembers. Cabe mencionar que este trabajo se complementa con el realizado por Raquel Lazcano en su Proyecto Fin de Grado, donde se desarrollan las funciones necesarias para completar las otras dos fases necesarias en la cadena de desmezclado. En concreto, este trabajo se encuentra dividido en varias partes. La primera de ellas expone razonadamente los motivos que han llevado a comenzar este Proyecto Fin de Grado y los objetivos que se pretenden conseguir con él. Tras esto, se hace un amplio estudio del estado del arte actual y, en él, se explican tanto las imágenes hiperespectrales como los medios y las plataformas que servirán para realizar la división en núcleos y detectar las distintas problemáticas con las que nos podamos encontrar al realizar dicha división. Una vez expuesta la base teórica, nos centraremos en la explicación del método seguido para componer la cadena de desmezclado y generar la librería; un punto importante en este apartado es la utilización de librerías especializadas en operaciones matriciales complejas, implementadas en C++. Tras explicar el método utilizado, se exponen los resultados obtenidos primero por etapas y, posteriormente, con la cadena de procesado completa, implementada en uno o varios núcleos. Por último, se aportan una serie de conclusiones obtenidas tras analizar los distintos algoritmos en cuanto a bondad de resultados, tiempos de procesado y consumo de recursos y se proponen una serie de posibles líneas de actuación futuras relacionadas con dichos resultados. ABSTRACT. Hyperspectral imaging allows us to collect high resolution spectral information: hundred of bands covering from infrared to ultraviolet spectrum. These images have had strong repercussions in the medical field; in particular, we must highlight its use in cancer detection. In this field, the main problem we have to deal with is the real time analysis, because these images have a great data volume and they require a high computational power. One of the main research lines that deals with this problem is related with the analysis of these images using several cores working at the same time. According to this investigation line, this document describes the development of a RVC – CAL library – this language has been widely used for working with multimedia applications and allows an optimized system parallelization –, which joins all the functions needed to implement two of the four stages of the hyperspectral imaging processing chain: dimensionality reduction and endmember extraction. This research is complemented with the research conducted by Raquel Lazcano in her Diploma Project, where she studies the other two stages of the processing chain. The document is divided in several chapters. The first of them introduces the motivation of the Diploma Project and the main objectives to achieve. After that, we study the state of the art of some technologies related with this work, like hyperspectral images and the software and hardware that we will use to parallelize the system and to analyze its performance. Once we have exposed the theoretical bases, we will explain the followed methodology to compose the processing chain and to generate the library; one of the most important issues in this chapter is the use of some C++ libraries specialized in complex matrix operations. At this point, we will expose the results obtained in the individual stage analysis and then, the results of the full processing chain implemented in one or several cores. Finally, we will extract some conclusions related with algorithm behavior, time processing and system performance. In the same way, we propose some future research lines according to the results obtained in this documen

    High-level synthesis of dataflow programs for heterogeneous platforms:design flow tools and design space exploration

    Get PDF
    The growing complexity of digital signal processing applications implemented in programmable logic and embedded processors make a compelling case the use of high-level methodologies for their design and implementation. Past research has shown that for complex systems, raising the level of abstraction does not necessarily come at a cost in terms of performance or resource requirements. As a matter of fact, high-level synthesis tools supporting such a high abstraction often rival and on occasion improve low-level design. In spite of these successes, high-level synthesis still relies on programs being written with the target and often the synthesis process, in mind. In other words, imperative languages such as C or C++, most used languages for high-level synthesis, are either modified or a constrained subset is used to make parallelism explicit. In addition, a proper behavioral description that permits the unification for hardware and software design is still an elusive goal for heterogeneous platforms. A promising behavioral description capable of expressing both sequential and parallel application is RVC-CAL. RVC-CAL is a dataflow programming language that permits design abstraction, modularity, and portability. The objective of this thesis is to provide a high-level synthesis solution for RVC-CAL dataflow programs and provide an RVC-CAL design flow for heterogeneous platforms. The main contributions of this thesis are: a high-level synthesis infrastructure that supports the full specification of RVC-CAL, an action selection strategy for supporting parallel read and writes of list of tokens in hardware synthesis, a dynamic fine-grain profiling for synthesized dataflow programs, an iterative design space exploration framework that permits the performance estimation, analysis, and optimization of heterogeneous platforms, and finally a clock gating strategy that reduces the dynamic power consumption. Experimental results on all stages of the provided design flow, demonstrate the capabilities of the tools for high-level synthesis, software hardware Co-Design, design space exploration, and power optimization for reconfigurable hardware. Consequently, this work proves the viability of complex systems design and implementation using dataflow programming, not only for system-level simulation but real heterogeneous implementations
    corecore