2 research outputs found

    Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework

    Full text link
    Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.Comment: 13 pages, 15 figure

    Porting machine learning algorithms to vector-in-memory architecture

    Get PDF
    Orientador: Prof. Dr. Marco Antonio Zanata AlvesDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 25/11/2020Inclui referências: p. 60-68Área de concentração: Ciência da ComputaçãoResumo: A Aprendizagem de Maquina surgiu por volta de 1960, com o foco na capacidade de aprendizagem do computador e, desde entao, se tornou uma ferramenta util para analisar a vasta quantidade de dados que e gerada em todos os campos da ciencia nos dias de hoje. Ao longo dos anos, diversos algoritmos foram criados para analisar, reconhecer padroes e fazer previsoes a partir de amostras de dados e, simultaneamente, a movimentacao de dados dentro de sistemas computacionais ganhou foco devido ao seu alto impacto no tempo de execucao e no consumo de energia. Nesse contexto, as arquiteturas de processamento proximo a memoria surgiram como uma solucao promissora para o processamento massivo de dados, reduzindo drasticamente a movimentacao destes. Alem das abordagens mais comuns para o problema, como Central Processing Units (CPUs) e Graphic Processing Units (GPUs), tambem existem abordagens diferentes, como Application Specific Integrated Circuits (ASICs) e Field-Programmable Gate Arrays (FPGAs). Esses aceleradores sao opcoes interessantes para executar algoritmos de aprendizagem de maquina, no entanto, eles ainda apresentam problemas relacionados ao Memory-Wall, pois exigem movimentacao de dados fora do chip entre a memoria e os dispositivos de processamento e, como as solucoes de processamento proximo a memoria conectam a unidade de processamento ao dispositivo de armazenamento, elas reduzem os problemas originados pela movimentacao de dados. Este trabalho avalia se e possivel obter alto desempenho computacional para algoritmos de aprendizagem de maquina usando uma arquitetura de processamento proximo a memoria que seja de proposito geral e que executa instrucoes vetoriais. Assim, e apresentada uma abordagem para executar alguns kernels de inferencia como k-Nearest Neighbors (kNN), Multi Layer Perceptron (MLP) e Convolutional Neural Network (CNN) usando a arquitetura de Vetor-em-Memoria (VIMA), uma arquitetura de processamento proximo a memoria que permite reutilizacao de dados e reducao da latencia de execucao. A ideia e migrar esses algoritmos de aprendizagem de maquina com a Intrinsics-VIMA, uma biblioteca que emula o cojunto de instrucoes da VIMA e simula as aplicacoes usando um simulador orientado a tracos para avaliar seu desempenho computacional e o consumo de energia. As contribuicoes deste trabalho sao: (i) uma nova biblioteca Intrinsics que emula o conjunto de instrucoes da VIMA de forma facil; (ii) ideias sobre como migrar algoritmos de aprendizagem de maquina usando Intrinsics-VIMA, e; (iii) a avaliacao dos resultados dos algoritmos considerando o ambiente de simulacao. Os resultados indicam aceleracoes de ate 10× para o kNN, 11× para o MLP e 3× para a convolucao ao executa-los na VIMA comparado com uma versao de alto desempenho do x86. Palavras-chave: memorias inteligentes, processamento proximo a memoria, aprendizagem de maquina, arquitetura vetorizadaAbstract: Machine Learning (ML) emerged around 1960, focusing on the computer learning capacity. Since then, it became a handy tool to analyze the vast amount of data currently generated in every field of science. For this purpose, several algorithms were created to analyze, recognize patterns, and make predictions from data samples. Simultaneously, data movement inside computer systems gains more focus due to its high impact on time and energy consumption. In this context, the Near-Data Processing (NDP) architectures emerged as a prominent solution to massive data processing by drastically reducing data movement. Besides the most common approaches to the problem, such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), there are also different approaches, such as Application Specific Integrated Circuits (ASICs) and Field-Programmable Gate Arrays (FPGAs). These accelerators are all exciting options to solve ML algorithms. Nevertheless, they still present problems related to the Memory Wall, as they still requiring off-chip data movement between the memory and the processing devices. As NDP solutions attach the processing unit to the storage device, they mitigate problems originated by data movement. This work evaluates whether it is possible to achieve high computational performance for ML algorithms using a general-purpose NDP architecture that operates on vector instructions. Thus, it presents an approach to execute inference kernels from k-Nearest Neighbors (kNN), Multi Layer Perceptron (MLP), and Convolutional Neural Network (CNN) algorithms using Vector-in-Memory Architecture (VIMA), an NDP architecture that allows data reuse and latency reduction. The idea is to port those ML algorithms with Intrinsics-VIMA. This library emulates VIMA Instruction Set Architecture (ISA), and simulate the applications using a trace-driven simulator to evaluate its computational performance and energy consumption. The contribution of this work are: (i) a new Intrinsics library that emulates VIMA ISA in an easy way; (ii) insights on how to migrate ML algorithms using Intrinsics-VIMA, and; (iii) results evaluation of the algorithms considering the simulation environment indicate speedups up to 10× for KNN, 11× for MLP, and 3× for convolution when executing near-data compared to a high-performance x86 baseline. Keywords: smart-memories, near-data processing, machine learning, vector architectur
    corecore