575 research outputs found

    Adaptable Security in Wireless Sensor Networks by Using Reconfigurable ECC Hardware Coprocessors

    Get PDF
    Specific features of Wireless Sensor Networks (WSNs) like the open accessibility to nodes, or the easy observability of radio communications, lead to severe security challenges. The application of traditional security schemes on sensor nodes is limited due to the restricted computation capability, low-power availability, and the inherent low data rate. In order to avoid dependencies on a compromised level of security, a WSN node with a microcontroller and a Field Programmable Gate Array (FPGA) is used along this work to implement a state-of-the art solution based on ECC (Elliptic Curve Cryptography). In this paper it is described how the reconfiguration possibilities of the system can be used to adapt ECC parameters in order to increase or reduce the security level depending on the application scenario or the energy budget. Two setups have been created to compare the software- and hardware-supported approaches. According to the results, the FPGA-based ECC implementation requires three orders of magnitude less energy, compared with a low power microcontroller implementation, even considering the power consumption overhead introduced by the hardware reconfiguratio

    Low power architectures for streaming applications

    Get PDF

    The Chameleon project in retrospective

    Get PDF
    In this paper we describe in retrospective the main results of a four year project, called Chameleon. As part of this project we developed a coarse-grained reconfigurable core for DSP algorithms in wireless devices denoted MONTIUM. After presenting the main achievements within this project we present the lessons learned from this project

    A hardware mechanism to reduce the energy consumption of the register file of in-order architectures

    Get PDF
    This paper introduces an efficient hardware approach to reduce the register file energy consumption by turning unused registers into a low power state. Bypassing the register fields of the fetch instruction to the decode stage allows the identification of registers required by the current instruction (instruction predecode) and allows the control logic to turn them back on. They are put into the low-power state after the instruction use. This technique achieves an 85% energy reduction with no performance penalty

    Lessons Learned from Designing the Montium - a Coarse-Grained Reconfigurable Processing Tile

    Get PDF
    In this paper we describe in retrospective the main results of a four year project, called Chameleon. As part of this project we developed a coarse-grained reconfigurable core for DSP algorithms in wirelessdevices denoted MONTIUM. After presenting the main achievements within this project we present the lessons learned from this project

    Domain-specific and reconfigurable instruction cells based architectures for low-power SoC

    Get PDF

    Efficient Interconnection Network Design for Heterogeneous Architectures

    Get PDF
    The onset of big data and deep learning applications, mixed with conventional general-purpose programs, have driven computer architecture to embrace heterogeneity with specialization. With the ever-increasing interconnected chip components, future architectures are required to operate under a stricter power budget and process emerging big data applications efficiently. Interconnection network as the communication backbone thus is facing the grand challenges of limited power envelope, data movement and performance scaling. This dissertation provides interconnect solutions that are specialized to application requirements towards power-/energy-efficient and high-performance computing for heterogeneous architectures. This dissertation examines the challenges of network-on-chip router power-gating techniques for general-purpose workloads to save static power. A voting approach is proposed as an adaptive power-gating policy that considers both local and global traffic status through router voting. In addition, low-latency routing algorithms are designed to guarantee performance in irregular power-gating networks. This holistic solution not only saves power but also avoids performance overhead. This research also introduces emerging computation paradigms to interconnects for big data applications to mitigate the pressure of data movement. Approximate network-on-chip is proposed to achieve high-throughput communication by means of lossy compression. Then, near-data processing is combined with in-network computing to further improve performance while reducing data movement. The two schemes are general to play as plug-ins for different network topologies and routing algorithms. To tackle the challenging computational requirements of deep learning workloads, this dissertation investigates the compelling opportunities of communication algorithm-architecture co-design to accelerate distributed deep learning. MultiTree allreduce algorithm is proposed to bond with message scheduling with network topology to achieve faster and contention-free communication. In addition, the interconnect hardware and flow control are also specialized to exploit deep learning communication characteristics and fulfill the algorithm needs, thereby effectively improving the performance and scalability. By considering application and algorithm characteristics, this research shows that interconnection network can be tailored accordingly to improve the power-/energy-efficiency and performance to satisfy heterogeneous computation and communication requirements

    Field-Configurable GPU

    Get PDF
    Nesta dissertação pretende-se desenvolver uma arquitetura de processamento dedicada destinada à aceleração de aplicações específicas, inspirada na estrutura de unidades de processamento do tipo GPU. A unidade de processamento deverá ser programável e configurável para os requisitos de aplicações específicas, sendo adaptada aos tipos e à quantidade de recursos lógicos disponíveis num dispositivo FPGA selecionado. Pretende-se que o acelerador consiga tirar o máximo partido dos recursos disponíveis num determinado dispositivo FPGA (memória, unidades aritméticas, recursos lógicos) com o objetivo de maximizar o desempenho de aplicações selecionadas. Serão consideradas aplicações alvo no domínio do processamento de imagem e de "machine learning". Uma vez selecionada uma arquitetura base, a especialização para uma aplicação (ou classe de aplicações) terá por base a configuração de 3 componentes fundamentais: organização do sistema de memória distribuída (construído com os blocos de memória RAM internos da FPGA), organização das unidades de processamento aritmético (que podem ser heterogéneas) e dimensão dos caminhos de dados. O sistema a desenvolver deverá ser desenhado ao nível RTL, em Verilog, e contemplar um processo automatizado para personalizar o acelerador a partir de um conjunto de especificações definidas com base nas características da aplicação alvo. Esse processo de personalização poderá ser feito com base na definição de parâmetros em Verilog, ou também recorrendo a aplicações dedicadas, a desenvolver, para gerar diretamente código Verilog. Deverá também ser desenvolvido um conjunto elementar de ferramentas de suporte, nomeadamente para geração do código a executar pelo processador. Como validação final, pretende-se integrar e demonstrar o acelerador num sistema de processamento de imagem em tempo real

    Maximizing resource utilization by slicing of superscalar architecture

    Full text link
    Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed
    corecore