11 research outputs found

    New Logic-In-Memory Paradigms: An Architectural and Technological Perspective

    Get PDF
    Processing systems are in continuous evolution thanks to the constant technological advancement and architectural progress. Over the years, computing systems have become more and more powerful, providing support for applications, such as Machine Learning, that require high computational power. However, the growing complexity of modern computing units and applications has had a strong impact on power consumption. In addition, the memory plays a key role on the overall power consumption of the system, especially when considering data-intensive applications. These applications, in fact, require a lot of data movement between the memory and the computing unit. The consequence is twofold: Memory accesses are expensive in terms of energy and a lot of time is wasted in accessing the memory, rather than processing, because of the performance gap that exists between memories and processing units. This gap is known as the memory wall or the von Neumann bottleneck and is due to the different rate of progress between complementary metal-oxide semiconductor (CMOS) technology and memories. However, CMOS scaling is also reaching a limit where it would not be possible to make further progress. This work addresses all these problems from an architectural and technological point of view by: (1) Proposing a novel Configurable Logic-in-Memory Architecture that exploits the in-memory computing paradigm to reduce the memory wall problem while also providing high performance thanks to its flexibility and parallelism; (2) exploring a non-CMOS technology as possible candidate technology for the Logic-in-Memory paradigm

    Automated Synthesis of Memristor Crossbar Networks

    Get PDF
    The advancement of semiconductor device technology over the past decades has enabled the design of increasingly complex electrical and computational machines. Electronic design automation (EDA) has played a significant role in the design and implementation of transistor-based machines. However, as transistors move closer toward their physical limits, the speed-up provided by Moore\u27s law will grind to a halt. Once again, we find ourselves on the verge of a paradigm shift in the computational sciences as newer devices pave the way for novel approaches to computing. One of such devices is the memristor -- a resistor with non-volatile memory. Memristors can be used as junctional switches in crossbar circuits, which comprise of intersecting sets of vertical and horizontal nanowires. The major contribution of this dissertation lies in automating the design of such crossbar circuits -- doing a new kind of EDA for a new kind of computational machinery. In general, this dissertation attempts to answer the following questions: a. How can we synthesize crossbars for computing large Boolean formulas, up to 128-bit? b. How can we synthesize more compact crossbars for small Boolean formulas, up to 8-bit? c. For a given loop-free C program doing integer arithmetic, is it possible to synthesize an equivalent crossbar circuit? We have presented novel solutions to each of the above problems. Our new, proposed solutions resolve a number of significant bottlenecks in existing research, via the usage of innovative logic representation and artificial intelligence techniques. For large Boolean formulas (up to 128-bit), we have utilized Reduced Ordered Binary Decision Diagrams (ROBDDs) to automatically synthesize linearly growing crossbar circuits that compute them. This cutting edge approach towards flow-based computing has yielded state-of-the-art results. It is worth noting that this approach is scalable to n-bit Boolean formulas. We have made significant original contributions by leveraging artificial intelligence for automatic synthesis of compact crossbar circuits. This inventive method has been expanded to encompass crossbar networks with 1D1M (1-diode-1-memristor) switches, as well. The resultant circuits satisfy the tight constraints of the Feynman Grand Prize challenge and are able to perform 8-bit binary addition. A leading edge development for end-to-end computation with flow-based crossbars has been implemented, which involves methodical translation of loop-free C programs into crossbar circuits via automated synthesis. The original contributions described in this dissertation reflect the substantial progress we have made in the area of electronic design automation for synthesis of memristor crossbar networks

    Exploring New Computing Paradigms for Data-Intensive Applications

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Uma exploração de processamento associativo com o simulador RV-Across

    Get PDF
    Orientador: Lucas Francisco WannerDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Trabalhos na academia apontam para um gargalo de desempenho entre o processador e a memória. Esse gargalo se destaca na execução de aplicativos, como o Aprendizado de Máquina, que processam uma grande quantidade de dados. Nessas aplicações, a movimentação de dados representa uma parcela significativa tanto em termos de tempo de processamento quanto de consumo de energia. O uso de novas arquiteturas multi-core, aceleradores e Unidades de Processamento Gráfico (Graphics Processing Unit --- GPU) pode melhorar o desempenho desses aplicativos por meio do processamento paralelo. No entanto, a utilização dessas arquiteturas não elimina a necessidade de mover dados, que passam por diferentes níveis de uma hierarquia de memória para serem processados. Nosso trabalho explora o Processamento em Memória (Processing in Memory --- PIM), especificamente o Processamento Associativo, como alternativa para acelerar as aplicações, processando seus dados em paralelo na memória permitindo melhor desempenho do sistema e economia de energia. O Processamento Associativo fornece computação paralela de alto desempenho e com baixo consumo de energia usando uma Memória endereçável por conteúdo (Content-Adressable Memory --- CAM). Através do poder de comparação e escrita em paralelo do CAM, complementado por registradores especiais de controle e tabelas de consulta (Lookup Tables), é possível realizar operações entre vetores de dados utilizando um número pequeno e constante de ciclos por operação. Em nosso trabalho, analisamos o potencial do Processamento Associativo em termos de tempo de execução e consumo de energia em diferentes kernels de aplicações. Para isso, desenvolvemos o RV-Across, um simulador de Processamento Associativo baseado em RISC-V para teste, validação e modelagem de operações associativas. O simulador facilita o projeto de arquiteturas de processamento associativo e próximo à memória, oferecendo interfaces tanto para a construção de novas operações quanto para experimentação de alto nível. Criamos um modelo de arquitetura para o simulador com processamento associativo e o comparamos este modelo com os alternativas baseadas em CPU e multi-core. Para avaliação de desempenho, construímos um modelo de latência e energia fundamentado em dados da literatura. Aplicamos o modelo para comparar diferentes cenários, alterando características das entradas e o tamanho do Processador Associativo nas aplicações. Nossos resultados destacam a relação direta entre o tamanho dos dados e a melhoria potencial de desempenho do processamento associativo. Para a convolução 2D, o modelo de Processamento Associativo obteve um ganho relativo de 2x em latência, 2x em consumo de energia, e 13x no número de operações de load/store. Na multiplicação de matrizes, a aceleração aumenta linearmente com a dimensão das matrizes, atingindo 8x para matrizes de 200x200 bytes e superando a execução paralela em uma CPU de 8 núcleos. As vantagens do Processamento associativo evidenciadas nos resultados revelam uma alternativa para sistemas que necessitam manter um equilíbrio entre processamento e gasto energético, como os dispositivos embarcados. Finalmente, o ambiente de simulação e avaliação que construímos pode habilitar mais exploração dessa alternativa em diferentes aplicações e cenários de usoAbstract: Many works have pointed to a performance bottleneck between Processor and Memory. This bottleneck stands out when running applications, such as Machine Learning, which process large quantities of data. For these applications, the movement of data represents a significant fraction of processing time and energy consumption. The use of new multi-core architectures, accelerators, and Graphic Processing Units (GPU) can improve the performance of these applications through parallel processing. However, utilizing these architectures does not eliminate the need to move data, which are transported through different levels of a memory hierarchy to be processed. Our work explores Processing in Memory (PIM), and in particular Associative Processing, as an alternative to accelerate applications, by processing data in parallel in memory, thus allowing for better system performance and energy savings. Associative Processing provides high-performance and energy-efficient parallel computation using a Content-Addressable Memory (CAM). CAM provides parallel comparison and writing, and by augmenting a CAM with special control registers and Lookup Tables, it is possible to perform computation between vectors of data with a small and constant number of cycles per operation. In our work, we analyze the potential of Associative Processing in terms of execution time and energy consumption in different application kernels. To achieve this goal we developed RV-Across, an Associative Processing Simulator based on RISC-V for testing, validation, and modeling associative operations. The simulator eases the design of associative and near-memory processing architectures by offering interfaces to both building new operations and performing high-level experimentation. We created an architectural model for the simulator with associative processing and evaluated it by comparing it with the CPU-only and multi-core models. The simulator includes latency and energy models based on data from the literature to allow for evaluation and comparison. We apply the models to compare different scenarios, changing the input and size of the Associative Processor in the applications. Our results highlight the direct relation between data length and potential performance and energy improvement of associative processing. For 2D convolution, the Associative Processing model obtained a relative gain of 2x in latency, 2x in energy, and 13x in the number of load/store operations. For matrix multiplication, the speed-up increases linearly with input dimensions, achieving 8x for 200x200 bytes matrices and outperforming parallel execution in an 8-core CPU. The advantages of associative processing shown in the results are indicative of a real alternative for systems that need to maintain a balance between processing and energy expenditure, such as embedded devices. Finally, the simulation and evaluation environment we have built can enable further exploration of this alternative for different usage scenarios and applicationsMestradoCiência da ComputaçãoMestre em Ciência da Computação001CAPE

    Accélérateurs Matériels pour l'Intelligence Artificielle. Etude de Cas : Voitures Autonomes

    No full text
    Since the early days of the DARPA challenge, the design of self-driving cars is catching increasing interest. This interest is growing even more with the recent successes of Machine Learning algorithms in perception tasks. While the accuracy of thesealgorithms is irreplaceable, it is very challenging to harness their potential. Realtime constraints as well as reliability issues heighten the burden of designing efficient platforms.We discuss the different implementations and optimization techniques in this work. We tackle the problem of these accelerators from two perspectives: performance and reliability. We propose two acceleration techniques that optimize time and resource usage. On reliability, we study the resilience of Machine Learning algorithms. We propose a tool that gives insights whether these algorithms are reliable enough forsafety critical systems or not. The Resistive Associative Processor accelerator achieves high performance due to its in-memory design which remedies the memory bottleneck present in most Machine Learning algorithms. As for the constant multiplication approach, we opened the door for a new category of optimizations by designing instance specific accelerators. The obtained results outperforms the most recent techniques in terms of execution time and resource usage. Combined with the reliability study we conducted, safety-critical systems can profit from these accelerators without compromising its security.Depuis les débbuts du défi DARPA, la conception de voitures autonomes suscite unintérêt croissant. Cet intéreêt grandit encore plus avec les récents succès des algorithmesd'apprentissage automatique dans les tâches de perception. Bien que laprécision de ces algorithmes soit irremplaçable, il est très diffcile d'exploiter leurpotentiel. Les contraintes en temps réel ainsi que les problèmes de fiabilité alourdissentle fardeau de la conception de plateformes matériels efficaces. Nous discutonsles différentes implémentations et techniques d'optimisation de ces plateformes dansce travail. Nous abordons le problème de ces accélérateurs sous deux perspectives: performances et fiabilité. Nous proposons deux techniques d'accélération qui optimisentl'utilisation du temps et des ressources. Sur le voletfiabilité, nous étudionsla résilience des algorithmes de Machine Learning face aux fautes matérielles. Nousproposons un outil qui indique si ces algorithmes sont suffisamment fiables pour êtreemployés dans des systèmes critiques avec de fortes critères sécuritaire ou non. Unaccélérateur sur processeur associatif résistif est présenté. Cet accélérateur atteint desperformances élevées en raison de sa conception en mémoire qui remédie au goulotd'étranglement de la mémoire présent dans la plupart des algorithmes d'apprentissageautomatique. Quant à l'approche de multiplication constante, nous avons ouvertla porte à une nouvelle catégorie d'optimisations en concevant des accélérateursspécifiques aux instances. Les résultats obtenus surpassent les techniques les plusrécentes en termes de temps d'exécution et d'utilisation des ressources. Combinés àl'étude de fiabilité que nous avons menée, les systèmes ou la sécurité est de prioritépeuvent profiter de ces accélérateurs sans compromettre cette dernière
    corecore