4 research outputs found

    Uma exploração de processamento associativo com o simulador RV-Across

    Get PDF
    Orientador: Lucas Francisco WannerDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Trabalhos na academia apontam para um gargalo de desempenho entre o processador e a memória. Esse gargalo se destaca na execução de aplicativos, como o Aprendizado de Máquina, que processam uma grande quantidade de dados. Nessas aplicações, a movimentação de dados representa uma parcela significativa tanto em termos de tempo de processamento quanto de consumo de energia. O uso de novas arquiteturas multi-core, aceleradores e Unidades de Processamento Gráfico (Graphics Processing Unit --- GPU) pode melhorar o desempenho desses aplicativos por meio do processamento paralelo. No entanto, a utilização dessas arquiteturas não elimina a necessidade de mover dados, que passam por diferentes níveis de uma hierarquia de memória para serem processados. Nosso trabalho explora o Processamento em Memória (Processing in Memory --- PIM), especificamente o Processamento Associativo, como alternativa para acelerar as aplicações, processando seus dados em paralelo na memória permitindo melhor desempenho do sistema e economia de energia. O Processamento Associativo fornece computação paralela de alto desempenho e com baixo consumo de energia usando uma Memória endereçável por conteúdo (Content-Adressable Memory --- CAM). Através do poder de comparação e escrita em paralelo do CAM, complementado por registradores especiais de controle e tabelas de consulta (Lookup Tables), é possível realizar operações entre vetores de dados utilizando um número pequeno e constante de ciclos por operação. Em nosso trabalho, analisamos o potencial do Processamento Associativo em termos de tempo de execução e consumo de energia em diferentes kernels de aplicações. Para isso, desenvolvemos o RV-Across, um simulador de Processamento Associativo baseado em RISC-V para teste, validação e modelagem de operações associativas. O simulador facilita o projeto de arquiteturas de processamento associativo e próximo à memória, oferecendo interfaces tanto para a construção de novas operações quanto para experimentação de alto nível. Criamos um modelo de arquitetura para o simulador com processamento associativo e o comparamos este modelo com os alternativas baseadas em CPU e multi-core. Para avaliação de desempenho, construímos um modelo de latência e energia fundamentado em dados da literatura. Aplicamos o modelo para comparar diferentes cenários, alterando características das entradas e o tamanho do Processador Associativo nas aplicações. Nossos resultados destacam a relação direta entre o tamanho dos dados e a melhoria potencial de desempenho do processamento associativo. Para a convolução 2D, o modelo de Processamento Associativo obteve um ganho relativo de 2x em latência, 2x em consumo de energia, e 13x no número de operações de load/store. Na multiplicação de matrizes, a aceleração aumenta linearmente com a dimensão das matrizes, atingindo 8x para matrizes de 200x200 bytes e superando a execução paralela em uma CPU de 8 núcleos. As vantagens do Processamento associativo evidenciadas nos resultados revelam uma alternativa para sistemas que necessitam manter um equilíbrio entre processamento e gasto energético, como os dispositivos embarcados. Finalmente, o ambiente de simulação e avaliação que construímos pode habilitar mais exploração dessa alternativa em diferentes aplicações e cenários de usoAbstract: Many works have pointed to a performance bottleneck between Processor and Memory. This bottleneck stands out when running applications, such as Machine Learning, which process large quantities of data. For these applications, the movement of data represents a significant fraction of processing time and energy consumption. The use of new multi-core architectures, accelerators, and Graphic Processing Units (GPU) can improve the performance of these applications through parallel processing. However, utilizing these architectures does not eliminate the need to move data, which are transported through different levels of a memory hierarchy to be processed. Our work explores Processing in Memory (PIM), and in particular Associative Processing, as an alternative to accelerate applications, by processing data in parallel in memory, thus allowing for better system performance and energy savings. Associative Processing provides high-performance and energy-efficient parallel computation using a Content-Addressable Memory (CAM). CAM provides parallel comparison and writing, and by augmenting a CAM with special control registers and Lookup Tables, it is possible to perform computation between vectors of data with a small and constant number of cycles per operation. In our work, we analyze the potential of Associative Processing in terms of execution time and energy consumption in different application kernels. To achieve this goal we developed RV-Across, an Associative Processing Simulator based on RISC-V for testing, validation, and modeling associative operations. The simulator eases the design of associative and near-memory processing architectures by offering interfaces to both building new operations and performing high-level experimentation. We created an architectural model for the simulator with associative processing and evaluated it by comparing it with the CPU-only and multi-core models. The simulator includes latency and energy models based on data from the literature to allow for evaluation and comparison. We apply the models to compare different scenarios, changing the input and size of the Associative Processor in the applications. Our results highlight the direct relation between data length and potential performance and energy improvement of associative processing. For 2D convolution, the Associative Processing model obtained a relative gain of 2x in latency, 2x in energy, and 13x in the number of load/store operations. For matrix multiplication, the speed-up increases linearly with input dimensions, achieving 8x for 200x200 bytes matrices and outperforming parallel execution in an 8-core CPU. The advantages of associative processing shown in the results are indicative of a real alternative for systems that need to maintain a balance between processing and energy expenditure, such as embedded devices. Finally, the simulation and evaluation environment we have built can enable further exploration of this alternative for different usage scenarios and applicationsMestradoCiência da ComputaçãoMestre em Ciência da Computação001CAPE

    Metodología de diseño y síntesis sobre hardware reconfigurable de arquitecturas de procesamiento de imágenes en tiempo real

    Get PDF
    [SPA] El diseño de sistemas por computador embebido es un campo de investigación con un enorme potencial de crecimiento. Cada vez se precisan más sistemas capaces de realizar tareas de identificación o reconocimiento de personas, animales u objetos sobre imágenes de gran tamaño y, en el caso de los sistemas portátiles, con el menor consumo de energía posible. Sin embargo, algunos de estos algoritmos son complejos y su integración sobre hardware no es sencilla, puesto que requieren, en una etapa previa, una simplificación que permita adaptarlos a las características propias y a las limitaciones de los recursos hardware de la tecnología electrónica sobre la que se van a implementar. Por otra parte, el diseño de sistemas electrónicos de visión por computador es un proceso que requiere cada vez más de la participación de ingenieros cualificados tanto en el campo del diseño electrónico como en el de la inteligencia artificial, y más concretamente en el ámbito de las redes neuronales de aprendizaje profundo. Durante la última década hemos asistido a la expansión de este tipo de redes que han comenzado a reemplazar, en algunas aplicaciones, a técnicas clásicas de extracción de características. Este hecho se debe en parte a la flexibilidad que presentan para resolver diferentes problemas de visión por computador en aplicaciones tan diversas como la robótica, la medicina, los sistemas de transporte autónomos, etc. exhibiendo, en muchos casos, una eficiencia superior al de los algoritmos clásicos de identificación o reconocimiento de patrones. Al amparo de esta expansión, se ha desarrollado igualmente un ecosistema de lenguajes y bibliotecas de programación orientados a facilitar y acelerar el diseño de este tipo de redes y su aplicación sobre bases de datos de imágenes de tamaño ingente. Sin embargo, el paso de generar diseños electrónicos a partir de los diseños obtenidos con estas herramientas no es trivial y depende de la experiencia del ingeniero y, sobre todo, de un largo proceso de diseño. Esta Tesis Doctoral aborda el estudio y la propuesta de arquitecturas hardware digitales para la realización de tareas de visión por computador. En concreto, la Tesis se centra en el diseño a nivel de registro tanto de módulos empleados en algoritmos de extracción de características como de módulos utilizados para la construcción de redes neuronales de aprendizaje profundo empleadas para el reconocimiento e identificación de objetos. En el caso de las técnicas de extracción de características, tras una revisión bibliográfica de los algoritmos más utilizados, esta Tesis toma como referencia el algoritmo SIFT, basado en la extracción y etiquetado de puntos característicos de una imagen, cuya primera etapa es común a otros algoritmos de este tipo como el SURF. Las diferentes implementaciones hardware de este algoritmo propuestas en la bibliografía científica tienden a obviar, debido a su complejidad, dos etapas que aumentan el número de puntos característicos y la precisión de la localización de los mismos sobre la imagen. En este trabajo se ha planteado la implementación hardware de dichas etapas como hipótesis para mejorar el rendimiento de las implementaciones digitales del algoritmo. Con respecto a las redes neuronales, esta Tesis propone el uso de una herramienta software de síntesis automática capaz de generar diseños hardware a nivel de registro a partir de las descripciones de alto nivel de redes neuronales de aprendizaje profundo. La hipótesis planteada es que el uso de esta herramienta puede generalizar y abaratar la implementación de este tipo de redes sobre dispositivos embebidos debido a la reducción del tiempo de diseño y la facilidad de su implementación. Los principales resultados obtenidos durante la realización de esta Tesis Doctoral han sido los siguientes: i) el desarrollo e implementación a nivel de registro de dos de los módulos hardware empleados en la primera etapa del algoritmo SIFT. Por una parte, se ha desarrollado el escalado de imágenes en tiempo real para su aplicación tanto en la primera etapa del algoritmo SIFT como en cualquier otro algoritmo que precise esta operación. Por otra parte, se ha creado un módulo más complejo capaz de realizar el refinamiento sub-pixel que permite incrementar la precisión de la localización de los puntos característicos de una imagen. Ambos módulos han sido diseñados a nivel de registro empleando VHDL y se han sintetizado sobre una FPGA; ii) la creación de una biblioteca de módulos hardware útil para la creación de sistemas de visión por computador basados en redes neuronales de aprendizaje profundo. Estos módulos son parametrizables, no solo para poder ser empleados en redes neuronales de diferentes topologías, sino también para dotar al diseñador de la capacidad de toma de decisiones respecto al equilibrio entre la latencia del sistema digital o la cantidad de recursos hardware necesarios para una tecnología de destino determinada; iii) La integración de dicha biblioteca de módulos en una herramienta software capaz de traducir descripciones de alto nivel de redes neuronales en diseños a nivel de registro aptos para ser sintetizados sobre una FPGA o un ASIC. Esta herramienta se ha complementado con otra dedicada a optimizar la cuantización de los parámetros de la red neuronal como paso previo a su síntesis, proceso imprescindible para reducir el tamaño de la red y la enorme cantidad de recursos de memoria y de procesamiento que requiere este tipo de algoritmos. El trabajo realizado en esta Tesis sobre las técnicas de implementación hardware de algoritmos de visión complejos sienta las bases para profundizar en el diseño de la denominada computación en el borde, en la que el procesamiento de los datos captados por un sensor se realiza en el propio sensor o sobre un dispositivo anexo cuya capacidad de cómputo es varias órdenes de magnitud inferior a la de un servidor central. En esta línea, los resultados obtenidos son igualmente aplicables al desarrollo de nuevos dispositivos de visión implementados sobre hardware embebido o sobre dispositivos portátiles que precisan de algoritmos de procesamiento adaptados a sus requisitos característicos de pequeño tamaño o bajo consumo energético. Finalmente, la contribución en el campo de la automatización del diseño electrónico abre la vía a la profundización en el desarrollo de herramientas CAD que reduzcan el tiempo del ciclo de desarrollo de un producto, haciendo que éste sea más competitivo. [ENG] [ENG] The design of embedded computer systems is a field of research with enormous growth potential. There is an increasing need for systems capable of performing identification or recognition tasks of people, animals or objects on large images and, in the case of portable systems, with the lowest possible power consumption. However, some of these algorithms are complex and their integration on hardware is not simple, since they require, at a previous stage, a simplification that allows them to be adapted to the characteristics and limitations of the hardware resources of the electronic technology on which they are going to be implemented. On the other hand, the design of electronic computer vision systems is a process that increasingly requires the participation of engineers qualified both in the field of electronic design and artificial intelligence, and more specifically in the field of deep learning neural networks. During the last decade we have witnessed the expansion of this type of networks that have begun to replace, in some applications, classical feature extraction techniques. This fact is due in part to the flexibility they present to solve different computer vision problems in applications as diverse as robotics, medicine, autonomous transportation systems, etc. exhibiting, in many cases, an efficiency superior to that of classical identification or pattern recognition algorithms. As a result of this expansion, an ecosystem of programming languages and libraries has also been developed to facilitate and accelerate the design of this type of networks and their application on image databases of enormous size. However, the step of generating electronic designs from the designs obtained with these tools is not trivial and depends on the engineer's experience and, above all, on a long design process. This PhD Thesis deals with the study and proposal of digital hardware architectures for computer vision tasks. Specifically, the Thesis focuses on the register-level design of both modules used in feature extraction algorithms and modules used for the construction of deep learning neural networks used for object recognition and identification. In the case of feature extraction techniques, after a literature review of the most widely used algorithms, this Thesis takes as a reference the SIFT algorithm, based on the extraction and labeling of characteristic points of an image, whose first stage is common to other algorithms of this type such as SURF. The different hardware implementations of this algorithm proposed in the scientific literature tend to omit, due to its complexity, two stages that increase the number of characteristic points and the accuracy of their localization on the image. In this work, the hardware implementation of these stages has been proposed as a hypothesis to improve the performance of digital implementations of the algorithm. With respect to neural networks, this Thesis proposes the use of an automatic synthesis software tool capable of generating register-level hardware designs from the high-level descriptions of deep learning neural networks. The hypothesis proposed is that the use of this tool can generalize and make cheaper the implementation of this type of networks on embedded devices due to the reduction of the design time and the ease of implementation. The main results obtained during this PhD Thesis have been the following: i) the development and implementation at register level of two of the hardware modules used in the first stage of the SIFT algorithm. On the one hand, real-time image scaling has been developed for its application both in the first stage of the SIFT algorithm and in any other algorithm that requires this operation. On the other hand, a more complex module capable of performing sub-pixel refinement has been created to increase the accuracy of the location of the characteristic points of an image. Both modules have been designed at the register level using VHDL and synthesized on an FPGA; ii) the creation of a library of hardware modules useful for the creation of computer vision systems based on Deep learning neural networks. These modules are parameterizable, not only to be used in neural networks of different topologies, but also to provide the designer with the ability to make decisions regarding the balance between the latency of the digital system or the amount of hardware resources required for a given target technology; iii) The integration of this library of modules in a software tool capable of translating high-level descriptions of neural networks into register-level designs suitable to be synthesized on an FPGA or an ASIC. This tool has been complemented with another one dedicated to optimize the quantification of the neural network parameters as a previous step to its synthesis, an essential process to reduce the size of the network and the enormous amount of memory and processing resources required by this type of algorithms. The work carried out in this Thesis on the hardware implementation techniques of complex vision algorithms lays the foundations to deepen in the design of the so-called edge computing, in which the processing of the data captured by a sensor is performed in the sensor itself or on an attached device whose computational capacity is several orders of magnitude lower than that of a central server. In this line, the results obtained are equally applicable to the development of new vision devices implemented on embedded hardware or on portable devices that need processing algorithms adapted to their characteristic requirements of small size or low energy consumption. Finally, the contribution in the field of electronic design automation opens the way to deepening the development of CAD tools that reduce the development cycle time of a product, making it more competitive.Escuela Internacional de Doctorado de la Universidad Politécnica de CartagenaUniversidad Politécnica de CartagenaPrograma Doctorado en Tecnologías de la Información y las Comunicacione

    A Classification of Memory-Centric Computing

    No full text
    Technological and architectural improvements have been constantly required to sustain the demand of faster and cheaper computers. However, CMOS down-scaling is suffering from three technology walls: leakage wall, reliability wall, and cost wall. On top of that, a performance increase due to architectural improvements is alsogradually saturating due to three well-known architecture walls: memory wall, power wall, and instruction level parallelism (ILP) wall. Hence, a lot of research is focusing on proposing and developing new technologies and architectures. In this article, we present a comprehensive classification of memory-centric computing architectures; it is based on three metrics: computation location, level of parallelism, and used memory technology. The classification not only provides an overview of existing architectures with their pros and cons but also unifies the terminology that uniquely identifies these architectures and highlights the potential future architectures that can be further explored. Hence, it sets up a direction for future research in the field.Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Computer Engineerin

    A Classification of Memory-Centric Computing

    No full text
    status: publishe
    corecore