454 research outputs found
XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU
Recommended from our members
Transiently Powered Computers
Demand for compact, easily deployable, energy-efficient computers has driven the development of general-purpose transiently powered computers (TPCs) that lack both batteries and wired power, operating exclusively on energy harvested from their surroundings.
TPCs\u27 dependence solely on transient, harvested power offers several important design-time benefits. For example, omitting batteries saves board space and weight while obviating the need to make devices physically accessible for maintenance. However, transient power may provide an unpredictable supply of energy that makes operation difficult. A predictable energy supply is a key abstraction underlying most electronic designs. TPCs discard this abstraction in favor of opportunistic computation that takes advantage of available resources. A crucial question is how should a software-controlled computing device operate if it depends completely on external entities for power and other resources? The question poses challenges for computation, communication, storage, and other aspects of TPC design.
The main idea of this work is that software techniques can make energy harvesting a practicable form of power supply for electronic devices. Its overarching goal is to facilitate the design and operation of usable TPCs.
This thesis poses a set of challenges that are fundamental to TPCs, then pairs these challenges with approaches that use software techniques to address them. To address the challenge of computing steadily on harvested power, it describes Mementos, an energy-aware state-checkpointing system for TPCs. To address the dependence of opportunistic RF-harvesting TPCs on potentially untrustworthy RFID readers, it describes CCCP, a protocol and system for safely outsourcing data storage to RFID readers that may attempt to tamper with data. Additionally, it describes a simulator that facilitates experimentation with the TPC model, and a prototype computational RFID that implements the TPC model.
To show that TPCs can improve existing electronic devices, this thesis describes applications of TPCs to implantable medical devices (IMDs), a challenging design space in which some battery-constrained devices completely lack protection against radio-based attacks. TPCs can provide security and privacy benefits to IMDs by, for instance, cryptographically authenticating other devices that want to communicate with the IMD before allowing the IMD to use any of its battery power. This thesis describes a simplified IMD that lacks its own radio, saving precious battery energy and therefore size. The simplified IMD instead depends on an RFID-scale TPC for all of its communication functions.
TPCs are a natural area of exploration for future electronic design, given the parallel trends of energy harvesting and miniaturization. This work aims to establish and evaluate basic principles by which TPCs can operate
lLTZVisor: a lightweight TrustZone-assisted hypervisor for low-end ARM devices
Dissertação de mestrado em Engenharia Eletrónica Industrial e ComputadoresVirtualization is a well-established technology in the server and desktop space
and has recently been spreading across different embedded industries. Facing
multiple challenges derived by the advent of the Internet of Things (IoT) era,
these industries are driven by an upgrowing interest in consolidating and isolating
multiple environments with mixed-criticality features, to address the complex IoT
application landscape. Even though this is true for majority mid- to high-end
embedded applications, low-end systems still present little to no solutions proposed
so far.
TrustZone technology, designed by ARM to improve security on its processors,
was adopted really well in the embedded market. As such, the research community
became active in exploring other TrustZone’s capacities for isolation, like
an alternative form of system virtualization. The lightweight TrustZone-assisted
hypervisor (LTZVisor), that mainly targets the consolidation of mixed-criticality
systems on the same hardware platform, is one design example that takes advantage
of TrustZone technology for ARM application processors. With the recent
introduction of this technology to the new generation of ARM microcontrollers, an
opportunity to expand this breakthrough form of virtualization to low-end devices
arose.
This work proposes the development of the lLTZVisor hypervisor, a refactored
LTZVisor version that aims to provide strong isolation on resource-constrained
devices, while achieving a low-memory footprint, determinism and high efficiency.
The key for this is to implement a minimal, reliable, secure and predictable virtualization
layer, supported by the TrustZone technology present on the newest
generation of ARM microcontrollers (Cortex-M23/33).Virtualização é uma tecnologia já bem estabelecida no âmbito de servidores e
computadores pessoais que recentemente tem vindo a espalhar-se através de várias
indústrias de sistemas embebidos. Face aos desafios provenientes do surgimento
da era Internet of Things (IoT), estas indústrias são guiadas pelo crescimento
do interesse em consolidar e isolar múltiplos sistemas com diferentes níveis de
criticidade, para atender ao atual e complexo cenário aplicativo IoT. Apesar de
isto se aplicar à maioria de aplicações embebidas de média e alta gama, sistemas
de baixa gama apresentam-se ainda com poucas soluções propostas.
A tecnologia TrustZone, desenvolvida pela ARM de forma a melhorar a segurança
nos seus processadores, foi adoptada muito bem pelo mercado dos sistemas embebidos.
Como tal, a comunidade científica começou a explorar outras aplicações
da tecnologia TrustZone para isolamento, como uma forma alternativa de virtualização
de sistemas. O "lightweight TrustZone-assisted hypervisor (LTZVisor)",
que tem sobretudo como fim a consolidação de sistemas de criticidade mista na
mesma plataforma de hardware, é um exemplo que tira vantagem da tecnologia
TrustZone para os processadores ARM de alta gama. Com a recente introdução
desta tecnologia para a nova geração de microcontroladores ARM, surgiu uma
oportunidade para expandir esta forma inovadora de virtualização para dispositivos
de baixa gama.
Este trabalho propõe o desenvolvimento do hipervisor lLTZVisor, uma versão
reestruturada do LTZVisor que visa em proporcionar um forte isolamento em dispositivos
com recursos restritos, simultâneamente atingindo um baixo footprint de
memória, determinismo e alta eficiência. A chave para isto está na implementação
de uma camada de virtualização mínima, fiável, segura e previsível, potencializada
pela tecnologia TrustZone presente na mais recente geração de microcontroladores
ARM (Cortex-M23/33)
Security for constrained IoT devices
Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2020In the recent past the Internet of Things has been the target of a great evolution, both in terms of applicability and of use. Society increasingly wants to use and massify the IoT to obtain information and act in the environment, for example, to remotely control an irrigation system. The reduction in the cost of devices and the constant evolution of personal mobile devices has largely contributed to their spread. However, its implementation is carried out in adverse environments and outside the typical information systems. The devices are, as a rule, limited in terms of resources, both computation and memory. The applicability to the IoT of the security techniques already known to conventional systems has therefore to be adapted, because it does not take into account the characteristics of the resources of the devices and require additional load when exchanging messages between these system elements. In addition, the development of applications is difficult because there is not yet developed tools and standards as there are for the traditional HTTPS or TLS when considering conventional systems. In this work, we intend to present a prototype of a low-cost solution (compared to existing equivalent solutions) that uses a secure communication channel based on standard protocols. An application is also developed based on technologies more familiar to programmers, similar to traditional Web development. We took into account the ”Green By Web” project as a case study. We have concluded that it is possible to have a secure communication, using UDP/DTLS over the CoAP protocol. With this approach we optimized the number of exchanged messages between the client and the server to be up to 8 times less and their size to be up to 10%, comparing against applications that use TCP/TLS connections, such as web applications that use HTTPS. This allows the energy spent by the low-cost components to be lower and increases their battery lifetime
Custom optimization algorithms for efficient hardware implementation
The focus is on real-time optimal decision making with application in advanced control
systems. These computationally intensive schemes, which involve the repeated solution of
(convex) optimization problems within a sampling interval, require more efficient computational
methods than currently available for extending their application to highly dynamical
systems and setups with resource-constrained embedded computing platforms.
A range of techniques are proposed to exploit synergies between digital hardware, numerical
analysis and algorithm design. These techniques build on top of parameterisable
hardware code generation tools that generate VHDL code describing custom computing
architectures for interior-point methods and a range of first-order constrained optimization
methods. Since memory limitations are often important in embedded implementations we
develop a custom storage scheme for KKT matrices arising in interior-point methods for
control, which reduces memory requirements significantly and prevents I/O bandwidth
limitations from affecting the performance in our implementations. To take advantage of
the trend towards parallel computing architectures and to exploit the special characteristics
of our custom architectures we propose several high-level parallel optimal control
schemes that can reduce computation time. A novel optimization formulation was devised
for reducing the computational effort in solving certain problems independent of the computing
platform used. In order to be able to solve optimization problems in fixed-point
arithmetic, which is significantly more resource-efficient than floating-point, tailored linear
algebra algorithms were developed for solving the linear systems that form the computational
bottleneck in many optimization methods. These methods come with guarantees
for reliable operation. We also provide finite-precision error analysis for fixed-point implementations
of first-order methods that can be used to minimize the use of resources while
meeting accuracy specifications. The suggested techniques are demonstrated on several
practical examples, including a hardware-in-the-loop setup for optimization-based control
of a large airliner.Open Acces
A Study of Multiprocessor Systems using the Picoblaze 8-bit Microcontroller Implemented on Field Programmable Gate Arrays
As Field Programmable Gate Arrays (FPGAs) are becoming more capable of implementing complex logic circuits, designers are increasingly choosing them over traditional microprocessor-based systems for implementing digital controllers and digital signal processing applications. Indeed, as FPGAs are being built using state-of-the-art deep submicron CMOS processes, the increased amount of logic and memory resources allows such FPGA-based implementations to compete in terms of speed, complexity, and power dissipation with most custom-built chips, but at a fraction of the development costs. The modern FPGA is now capable of implementing multiple instances of configurable processors that are completely specified by a high-level descriptor language. Such arrays of soft processor cores have opened up new design possibilities that include complex embedded systems applications that were previously implemented by custom multiprocessor chips. As the FPGA-based multiprocessor system is completely configurable by the user, it can be optimized for speed and power dissipation to fit a given application. The goal of this thesis is to investigate design methods for implementing an array of soft processor cores using the Xilinx FPGA-based 8-bit microcontroller known as PicoBlaze. While development tools exist for the larger 32-bit processor from Xilinx known as MicroBlaze, no such resources are currently available for the PicoBlaze microcontroller. PicoBlaze benefits in applications that requires only less data bits (less than 8 bits). For example, consider the gene sequencing or DNA sequencing in which the processing requires only 2 to 5 bits. In such an application, PicoBlaze can be a simple processor to produce the results. Also, the PicoBlaze unit offers a finer level of granularity and hence consumes fewer resources than the larger 32-bit MicroBlaze processor. Hence, the former will find applications in embedded systems requiring a complex design to be partitioned over several processors but where only an 8-bit datapath is required
Flexible Computing Systems For AI Acceleration At The Extreme Edge Of The IoT
Embedding intelligence in extreme edge devices allows distilling raw data acquired
from sensors into actionable information, directly on IoT end-nodes. This computing
paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable
benefits, driving a large research area (TinyML) to deploy leading Machine Learning
(ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions
Efficient software implementation of elliptic curves and bilinear pairings
Orientador: Júlio César Lopez HernándezTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O advento da criptografia assimétrica ou de chave pública possibilitou a aplicação de criptografia em novos cenários, como assinaturas digitais e comércio eletrônico, tornando-a componente vital para o fornecimento de confidencialidade e autenticação em meios de comunicação. Dentre os métodos mais eficientes de criptografia assimétrica, a criptografia de curvas elípticas destaca-se pelos baixos requisitos de armazenamento para chaves e custo computacional para execução. A descoberta relativamente recente da criptografia baseada em emparelhamentos bilineares sobre curvas elípticas permitiu ainda sua flexibilização e a construção de sistemas criptográficos com propriedades inovadoras, como sistemas baseados em identidades e suas variantes. Porém, o custo computacional de criptossistemas baseados em emparelhamentos ainda permanece significativamente maior do que os assimétricos tradicionais, representando um obstáculo para sua adoção, especialmente em dispositivos com recursos limitados. As contribuições deste trabalho objetivam aprimorar o desempenho de criptossistemas baseados em curvas elípticas e emparelhamentos bilineares e consistem em: (i) implementação eficiente de corpos binários em arquiteturas embutidas de 8 bits (microcontroladores presentes em sensores sem fio); (ii) formulação eficiente de aritmética em corpos binários para conjuntos vetoriais de arquiteturas de 64 bits e famílias mais recentes de processadores desktop dotadas de suporte nativo à multiplicação em corpos binários; (iii) técnicas para implementação serial e paralela de curvas elípticas binárias e emparelhamentos bilineares simétricos e assimétricos definidos sobre corpos primos ou binários. Estas contribuições permitiram obter significativos ganhos de desempenho e, conseqüentemente, uma série de recordes de velocidade para o cálculo de diversos algoritmos criptográficos relevantes em arquiteturas modernas que vão de sistemas embarcados de 8 bits a processadores com 8 coresAbstract: The development of asymmetric or public key cryptography made possible new applications of cryptography such as digital signatures and electronic commerce. Cryptography is now a vital component for providing confidentiality and authentication in communication infra-structures. Elliptic Curve Cryptography is among the most efficient public-key methods because of its low storage and computational requirements. The relatively recent advent of Pairing-Based Cryptography allowed the further construction of flexible and innovative cryptographic solutions like Identity-Based Cryptography and variants. However, the computational cost of pairing-based cryptosystems remains significantly higher than traditional public key cryptosystems and thus an important obstacle for adoption, specially in resource-constrained devices. The main contributions of this work aim to improve the performance of curve-based cryptosystems, consisting of: (i) efficient implementation of binary fields in 8-bit microcontrollers embedded in sensor network nodes; (ii) efficient formulation of binary field arithmetic in terms of vector instructions present in 64-bit architectures, and on the recently-introduced native support for binary field multiplication in the latest Intel microarchitecture families; (iii) techniques for serial and parallel implementation of binary elliptic curves and symmetric and asymmetric pairings defined over prime and binary fields. These contributions produced important performance improvements and, consequently, several speed records for computing relevant cryptographic algorithms in modern computer architectures ranging from embedded 8-bit microcontrollers to 8-core processorsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã
Efficient and Side-Channel Resistant Implementations of Next-Generation Cryptography
The rapid development of emerging information technologies, such as quantum computing and the Internet of Things (IoT), will have or have already had a huge impact on the world. These technologies can not only improve industrial productivity but they could also bring more convenience to people’s daily lives. However, these techniques have “side effects” in the world of cryptography – they pose new difficulties and challenges from theory to practice. Specifically, when quantum computing capability (i.e., logical qubits) reaches a certain level, Shor’s algorithm will be able to break almost all public-key cryptosystems currently in use. On the other hand, a great number of devices deployed in IoT environments have very constrained computing and storage resources, so the current widely-used cryptographic algorithms may not run efficiently on those devices. A new generation of cryptography has thus emerged, including Post-Quantum Cryptography (PQC), which remains secure under both classical and quantum attacks, and LightWeight Cryptography (LWC), which is tailored for resource-constrained devices. Research on next-generation cryptography is of importance and utmost urgency, and the US National Institute of Standards and Technology in particular has initiated the standardization process for PQC and LWC in 2016 and in 2018 respectively.
Since next-generation cryptography is in a premature state and has developed rapidly in recent years, its theoretical security and practical deployment are not very well explored and are in significant need of evaluation. This thesis aims to look into the engineering aspects of next-generation cryptography, i.e., the problems concerning implementation efficiency (e.g., execution time and memory consumption) and security (e.g., countermeasures against timing attacks and power side-channel attacks). In more detail, we first explore efficient software implementation approaches for lattice-based PQC on constrained devices. Then, we study how to speed up isogeny-based PQC on modern high-performance processors especially by using their powerful vector units. Moreover, we research how to design sophisticated yet low-area instruction set extensions to further accelerate software implementations of LWC and long-integer-arithmetic-based PQC. Finally, to address the threats from potential power side-channel attacks, we present a concept of using special leakage-aware instructions to eliminate overwriting leakage for masked software implementations (of next-generation cryptography)
Design, Cryptanalysis and Protection of Symmetric Encryption Algorithms
This thesis covers results from several areas related to symmetric cryptography, secure and efficient implementation and is divided into four main parts:
In Part II, Benchmarking of AEAD, two articles will be presented, showing the results of the FELICS framework for Authenticated encryption algorithms, and multiarchitecture benchmarking of permutations used as construction block of AEAD algorithms.
The Sparkle family of Hash and AEAD algorithms will be shown in Part III. Sparkle is currently a finalist of the NIST call for standardization of lightweight hash and AEAD algorithms.
In Part IV, Cryptanalysis of ARX ciphers, it is discussed two cryptanalysis techniques based on differential trails, applied to ARX ciphers. The first technique, called Meet-in-the-Filter uses an offline trail record, combined with a fixed trail and a reverse differential search to propose long differential trails that are useful for key recovery.
The second technique is an extension of ARX analyzing tools, that can automate the generation of truncated trails from existing non-truncated ones, and compute the exact probability of those truncated trails.
In Part V, Masked AES for Microcontrollers, is shown a new method to efficiently compute a side-channel protected AES, based on the masking scheme described by Rivain and Prouff. This method introduces table and execution-order optimizations, as well as practical security proofs
- …