454 research outputs found

    XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes

    Get PDF
    Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU

    lLTZVisor: a lightweight TrustZone-assisted hypervisor for low-end ARM devices

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e ComputadoresVirtualization is a well-established technology in the server and desktop space and has recently been spreading across different embedded industries. Facing multiple challenges derived by the advent of the Internet of Things (IoT) era, these industries are driven by an upgrowing interest in consolidating and isolating multiple environments with mixed-criticality features, to address the complex IoT application landscape. Even though this is true for majority mid- to high-end embedded applications, low-end systems still present little to no solutions proposed so far. TrustZone technology, designed by ARM to improve security on its processors, was adopted really well in the embedded market. As such, the research community became active in exploring other TrustZone’s capacities for isolation, like an alternative form of system virtualization. The lightweight TrustZone-assisted hypervisor (LTZVisor), that mainly targets the consolidation of mixed-criticality systems on the same hardware platform, is one design example that takes advantage of TrustZone technology for ARM application processors. With the recent introduction of this technology to the new generation of ARM microcontrollers, an opportunity to expand this breakthrough form of virtualization to low-end devices arose. This work proposes the development of the lLTZVisor hypervisor, a refactored LTZVisor version that aims to provide strong isolation on resource-constrained devices, while achieving a low-memory footprint, determinism and high efficiency. The key for this is to implement a minimal, reliable, secure and predictable virtualization layer, supported by the TrustZone technology present on the newest generation of ARM microcontrollers (Cortex-M23/33).Virtualização é uma tecnologia já bem estabelecida no âmbito de servidores e computadores pessoais que recentemente tem vindo a espalhar-se através de várias indústrias de sistemas embebidos. Face aos desafios provenientes do surgimento da era Internet of Things (IoT), estas indústrias são guiadas pelo crescimento do interesse em consolidar e isolar múltiplos sistemas com diferentes níveis de criticidade, para atender ao atual e complexo cenário aplicativo IoT. Apesar de isto se aplicar à maioria de aplicações embebidas de média e alta gama, sistemas de baixa gama apresentam-se ainda com poucas soluções propostas. A tecnologia TrustZone, desenvolvida pela ARM de forma a melhorar a segurança nos seus processadores, foi adoptada muito bem pelo mercado dos sistemas embebidos. Como tal, a comunidade científica começou a explorar outras aplicações da tecnologia TrustZone para isolamento, como uma forma alternativa de virtualização de sistemas. O "lightweight TrustZone-assisted hypervisor (LTZVisor)", que tem sobretudo como fim a consolidação de sistemas de criticidade mista na mesma plataforma de hardware, é um exemplo que tira vantagem da tecnologia TrustZone para os processadores ARM de alta gama. Com a recente introdução desta tecnologia para a nova geração de microcontroladores ARM, surgiu uma oportunidade para expandir esta forma inovadora de virtualização para dispositivos de baixa gama. Este trabalho propõe o desenvolvimento do hipervisor lLTZVisor, uma versão reestruturada do LTZVisor que visa em proporcionar um forte isolamento em dispositivos com recursos restritos, simultâneamente atingindo um baixo footprint de memória, determinismo e alta eficiência. A chave para isto está na implementação de uma camada de virtualização mínima, fiável, segura e previsível, potencializada pela tecnologia TrustZone presente na mais recente geração de microcontroladores ARM (Cortex-M23/33)

    Security for constrained IoT devices

    Get PDF
    Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2020In the recent past the Internet of Things has been the target of a great evolution, both in terms of applicability and of use. Society increasingly wants to use and massify the IoT to obtain information and act in the environment, for example, to remotely control an irrigation system. The reduction in the cost of devices and the constant evolution of personal mobile devices has largely contributed to their spread. However, its implementation is carried out in adverse environments and outside the typical information systems. The devices are, as a rule, limited in terms of resources, both computation and memory. The applicability to the IoT of the security techniques already known to conventional systems has therefore to be adapted, because it does not take into account the characteristics of the resources of the devices and require additional load when exchanging messages between these system elements. In addition, the development of applications is difficult because there is not yet developed tools and standards as there are for the traditional HTTPS or TLS when considering conventional systems. In this work, we intend to present a prototype of a low-cost solution (compared to existing equivalent solutions) that uses a secure communication channel based on standard protocols. An application is also developed based on technologies more familiar to programmers, similar to traditional Web development. We took into account the ”Green By Web” project as a case study. We have concluded that it is possible to have a secure communication, using UDP/DTLS over the CoAP protocol. With this approach we optimized the number of exchanged messages between the client and the server to be up to 8 times less and their size to be up to 10%, comparing against applications that use TCP/TLS connections, such as web applications that use HTTPS. This allows the energy spent by the low-cost components to be lower and increases their battery lifetime

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

    A Study of Multiprocessor Systems using the Picoblaze 8-bit Microcontroller Implemented on Field Programmable Gate Arrays

    Get PDF
    As Field Programmable Gate Arrays (FPGAs) are becoming more capable of implementing complex logic circuits, designers are increasingly choosing them over traditional microprocessor-based systems for implementing digital controllers and digital signal processing applications. Indeed, as FPGAs are being built using state-of-the-art deep submicron CMOS processes, the increased amount of logic and memory resources allows such FPGA-based implementations to compete in terms of speed, complexity, and power dissipation with most custom-built chips, but at a fraction of the development costs. The modern FPGA is now capable of implementing multiple instances of configurable processors that are completely specified by a high-level descriptor language. Such arrays of soft processor cores have opened up new design possibilities that include complex embedded systems applications that were previously implemented by custom multiprocessor chips. As the FPGA-based multiprocessor system is completely configurable by the user, it can be optimized for speed and power dissipation to fit a given application. The goal of this thesis is to investigate design methods for implementing an array of soft processor cores using the Xilinx FPGA-based 8-bit microcontroller known as PicoBlaze. While development tools exist for the larger 32-bit processor from Xilinx known as MicroBlaze, no such resources are currently available for the PicoBlaze microcontroller. PicoBlaze benefits in applications that requires only less data bits (less than 8 bits). For example, consider the gene sequencing or DNA sequencing in which the processing requires only 2 to 5 bits. In such an application, PicoBlaze can be a simple processor to produce the results. Also, the PicoBlaze unit offers a finer level of granularity and hence consumes fewer resources than the larger 32-bit MicroBlaze processor. Hence, the former will find applications in embedded systems requiring a complex design to be partitioned over several processors but where only an 8-bit datapath is required

    Flexible Computing Systems For AI Acceleration At The Extreme Edge Of The IoT

    Get PDF
    Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions

    Efficient software implementation of elliptic curves and bilinear pairings

    Get PDF
    Orientador: Júlio César Lopez HernándezTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O advento da criptografia assimétrica ou de chave pública possibilitou a aplicação de criptografia em novos cenários, como assinaturas digitais e comércio eletrônico, tornando-a componente vital para o fornecimento de confidencialidade e autenticação em meios de comunicação. Dentre os métodos mais eficientes de criptografia assimétrica, a criptografia de curvas elípticas destaca-se pelos baixos requisitos de armazenamento para chaves e custo computacional para execução. A descoberta relativamente recente da criptografia baseada em emparelhamentos bilineares sobre curvas elípticas permitiu ainda sua flexibilização e a construção de sistemas criptográficos com propriedades inovadoras, como sistemas baseados em identidades e suas variantes. Porém, o custo computacional de criptossistemas baseados em emparelhamentos ainda permanece significativamente maior do que os assimétricos tradicionais, representando um obstáculo para sua adoção, especialmente em dispositivos com recursos limitados. As contribuições deste trabalho objetivam aprimorar o desempenho de criptossistemas baseados em curvas elípticas e emparelhamentos bilineares e consistem em: (i) implementação eficiente de corpos binários em arquiteturas embutidas de 8 bits (microcontroladores presentes em sensores sem fio); (ii) formulação eficiente de aritmética em corpos binários para conjuntos vetoriais de arquiteturas de 64 bits e famílias mais recentes de processadores desktop dotadas de suporte nativo à multiplicação em corpos binários; (iii) técnicas para implementação serial e paralela de curvas elípticas binárias e emparelhamentos bilineares simétricos e assimétricos definidos sobre corpos primos ou binários. Estas contribuições permitiram obter significativos ganhos de desempenho e, conseqüentemente, uma série de recordes de velocidade para o cálculo de diversos algoritmos criptográficos relevantes em arquiteturas modernas que vão de sistemas embarcados de 8 bits a processadores com 8 coresAbstract: The development of asymmetric or public key cryptography made possible new applications of cryptography such as digital signatures and electronic commerce. Cryptography is now a vital component for providing confidentiality and authentication in communication infra-structures. Elliptic Curve Cryptography is among the most efficient public-key methods because of its low storage and computational requirements. The relatively recent advent of Pairing-Based Cryptography allowed the further construction of flexible and innovative cryptographic solutions like Identity-Based Cryptography and variants. However, the computational cost of pairing-based cryptosystems remains significantly higher than traditional public key cryptosystems and thus an important obstacle for adoption, specially in resource-constrained devices. The main contributions of this work aim to improve the performance of curve-based cryptosystems, consisting of: (i) efficient implementation of binary fields in 8-bit microcontrollers embedded in sensor network nodes; (ii) efficient formulation of binary field arithmetic in terms of vector instructions present in 64-bit architectures, and on the recently-introduced native support for binary field multiplication in the latest Intel microarchitecture families; (iii) techniques for serial and parallel implementation of binary elliptic curves and symmetric and asymmetric pairings defined over prime and binary fields. These contributions produced important performance improvements and, consequently, several speed records for computing relevant cryptographic algorithms in modern computer architectures ranging from embedded 8-bit microcontrollers to 8-core processorsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Efficient and Side-Channel Resistant Implementations of Next-Generation Cryptography

    Get PDF
    The rapid development of emerging information technologies, such as quantum computing and the Internet of Things (IoT), will have or have already had a huge impact on the world. These technologies can not only improve industrial productivity but they could also bring more convenience to people’s daily lives. However, these techniques have “side effects” in the world of cryptography – they pose new difficulties and challenges from theory to practice. Specifically, when quantum computing capability (i.e., logical qubits) reaches a certain level, Shor’s algorithm will be able to break almost all public-key cryptosystems currently in use. On the other hand, a great number of devices deployed in IoT environments have very constrained computing and storage resources, so the current widely-used cryptographic algorithms may not run efficiently on those devices. A new generation of cryptography has thus emerged, including Post-Quantum Cryptography (PQC), which remains secure under both classical and quantum attacks, and LightWeight Cryptography (LWC), which is tailored for resource-constrained devices. Research on next-generation cryptography is of importance and utmost urgency, and the US National Institute of Standards and Technology in particular has initiated the standardization process for PQC and LWC in 2016 and in 2018 respectively. Since next-generation cryptography is in a premature state and has developed rapidly in recent years, its theoretical security and practical deployment are not very well explored and are in significant need of evaluation. This thesis aims to look into the engineering aspects of next-generation cryptography, i.e., the problems concerning implementation efficiency (e.g., execution time and memory consumption) and security (e.g., countermeasures against timing attacks and power side-channel attacks). In more detail, we first explore efficient software implementation approaches for lattice-based PQC on constrained devices. Then, we study how to speed up isogeny-based PQC on modern high-performance processors especially by using their powerful vector units. Moreover, we research how to design sophisticated yet low-area instruction set extensions to further accelerate software implementations of LWC and long-integer-arithmetic-based PQC. Finally, to address the threats from potential power side-channel attacks, we present a concept of using special leakage-aware instructions to eliminate overwriting leakage for masked software implementations (of next-generation cryptography)

    Design, Cryptanalysis and Protection of Symmetric Encryption Algorithms

    Get PDF
    This thesis covers results from several areas related to symmetric cryptography, secure and efficient implementation and is divided into four main parts: In Part II, Benchmarking of AEAD, two articles will be presented, showing the results of the FELICS framework for Authenticated encryption algorithms, and multiarchitecture benchmarking of permutations used as construction block of AEAD algorithms. The Sparkle family of Hash and AEAD algorithms will be shown in Part III. Sparkle is currently a finalist of the NIST call for standardization of lightweight hash and AEAD algorithms. In Part IV, Cryptanalysis of ARX ciphers, it is discussed two cryptanalysis techniques based on differential trails, applied to ARX ciphers. The first technique, called Meet-in-the-Filter uses an offline trail record, combined with a fixed trail and a reverse differential search to propose long differential trails that are useful for key recovery. The second technique is an extension of ARX analyzing tools, that can automate the generation of truncated trails from existing non-truncated ones, and compute the exact probability of those truncated trails. In Part V, Masked AES for Microcontrollers, is shown a new method to efficiently compute a side-channel protected AES, based on the masking scheme described by Rivain and Prouff. This method introduces table and execution-order optimizations, as well as practical security proofs
    corecore