19 research outputs found
Towards Accurate Run-Time Hardware-Assisted Stealthy Malware Detection: A Lightweight, yet Effective Time Series CNN-Based Approach
According to recent security analysis reports, malicious software (a.k.a. malware) is rising at an alarming rate in numbers, complexity, and harmful purposes to compromise the security of modern computer systems. Recently, malware detection based on low-level hardware features (e.g., Hardware Performance Counters (HPCs) information) has emerged as an effective alternative solution to address the complexity and performance overheads of traditional software-based detection methods. Hardware-assisted Malware Detection (HMD) techniques depend on standard Machine Learning (ML) classifiers to detect signatures of malicious applications by monitoring built-in HPC registers during execution at run-time. Prior HMD methods though effective have limited their study on detecting malicious applications that are spawned as a separate thread during application execution, hence detecting stealthy malware patterns at run-time remains a critical challenge. Stealthy malware refers to harmful cyber attacks in which malicious code is hidden within benign applications and remains undetected by traditional malware detection approaches. In this paper, we first present a comprehensive review of recent advances in hardware-assisted malware detection studies that have used standard ML techniques to detect the malware signatures. Next, to address the challenge of stealthy malware detection at the processor’s hardware level, we propose StealthMiner, a novel specialized time series machine learning-based approach to accurately detect stealthy malware trace at run-time using branch instructions, the most prominent HPC feature. StealthMiner is based on a lightweight time series Fully Convolutional Neural Network (FCN) model that automatically identifies potentially contaminated samples in HPC-based time series data and utilizes them to accurately recognize the trace of stealthy malware. Our analysis demonstrates that using state-of-the-art ML-based malware detection methods is not effective in detecting stealthy malware samples since the captured HPC data not only represents malware but also carries benign applications’ microarchitectural data. The experimental results demonstrate that with the aid of our novel intelligent approach, stealthy malware can be detected at run-time with 94% detection performance on average with only one HPC feature, outperforming the detection performance of state-of-the-art HMD and general time series classification methods by up to 42% and 36%, respectively
Efficient communication protection of many-core systems against active attackers
Many-core system-on-chips, together with their established communication infrastructures, Networks-on-Chip (NoC), are growing in complexity, which encourages the integration of third-party components to simplify and accelerate production processes. However, this also adversely exposes the surface for attacks through the injection of hardware Trojans. This work addresses active attacks on NoCs and focuses on the integrity and availability of transmitted data. In particular, we consider the modification and/or dropping of data during transmission as active attacks that might be performed by malicious routers. To mitigate the impact of such active attacks, we propose two lightweight solutions that respect the performance constraints of NoCs. Assuming the presence of symmetric keys, these approaches combine lightweight authentication codes for integrity protection with network coding for increased efficiency and robustness. The proposed solutions prevent undetected modifications and significantly increase availability through a reliable detection of attacks. The efficiency of these solutions is investigated in different scenarios using cycle-accurate simulations and the area overhead is analyzed relative to state-of-the-art many-core system. The results demonstrate that one authentication scheme with network coding protects the integrity of data to a low residual error of 1.36% at 0.2 attack probability with an area overhead of 2.68%. For faster and more flexible evaluation, an analytical approach is developed which is validated against the cycle-accurate simulations. The analytical approach is more than 1000× faster while having a maximum estimation error of 5%. Moreover, the analytical model provides a deeper insight into the system’s behavior. For example, it reveals which factors influence the performance parameters
Multi-Tenant Cloud FPGA: A Survey on Security
With the exponentially increasing demand for performance and scalability in
cloud applications and systems, data center architectures evolved to integrate
heterogeneous computing fabrics that leverage CPUs, GPUs, and FPGAs. FPGAs
differ from traditional processing platforms such as CPUs and GPUs in that they
are reconfigurable at run-time, providing increased and customized performance,
flexibility, and acceleration. FPGAs can perform large-scale search
optimization, acceleration, and signal processing tasks compared with power,
latency, and processing speed. Many public cloud provider giants, including
Amazon, Huawei, Microsoft, Alibaba, etc., have already started integrating
FPGA-based cloud acceleration services. While FPGAs in cloud applications
enable customized acceleration with low power consumption, it also incurs new
security challenges that still need to be reviewed. Allowing cloud users to
reconfigure the hardware design after deployment could open the backdoors for
malicious attackers, potentially putting the cloud platform at risk.
Considering security risks, public cloud providers still don't offer
multi-tenant FPGA services. This paper analyzes the security concerns of
multi-tenant cloud FPGAs, gives a thorough description of the security problems
associated with them, and discusses upcoming future challenges in this field of
study
Defeating NewHope with a Single Trace
The key encapsulation method NewHope allows two parties to agree on a secret key. The scheme includes a private and a public key. While the public key is used to encipher a random shared secret, the private key enables to decipher the ciphertext. NewHope is a candidate in the NIST post-quantum project, whose aim is to standardize cryptographic systems that are secure against attacks originating from both quantum and classical computers. While NewHope relies on the theory of quantum-resistant lattice problems, practical implementations have shown vulnerabilities against side-channel attacks targeting the extraction of the private key. In this paper, we demonstrate a new attack on the shared secret. The target consists of the C reference implementation as submitted to the NIST contest, being executed on a Cortex-M4 processor. Based on power measurement, the complete shared secret can be extracted from data of one single trace only. Further, we analyze the impact of different compiler directives. When the code is compiled with optimization turned off, the shared secret can be read from an oscilloscope display directly with the naked eye. When optimizations are enabled, the attack requires some more sophisticated techniques, but the attack still works on single power traces
High-Speed Hardware Architectures and FPGA Benchmarking of CRYSTALS-Kyber, NTRU, and Saber
Performance in hardware has typically played a significant role in differentiating among leading candidates in cryptographic standardization efforts. Winners of two past NIST cryptographic contests (Rijndael in case of AES and Keccak in case of SHA-3) were ranked consistently among the two fastest candidates when implemented using FPGAs and ASICs. Hardware implementations of cryptographic operations may quite easily outperform software implementations for at least a subset of major performance metrics, such as latency, number of operations per second, power consumption, and energy usage, as well as in terms of security against physical attacks, including side-channel analysis. Using hardware also permits much higher flexibility in trading one subset of these properties for another. This paper presents high-speed hardware architectures for four lattice-based CCA-secure Key Encapsulation Mechanisms (KEMs), representing three NIST PQC finalists: CRYSTALS-Kyber, NTRU (with two distinct variants, NTRU-HPS and NTRU-HRSS), and Saber. We rank these candidates among each other and compare them with all other Round 3 KEMs based on the data from the previously reported work
Explorando Árvores de Decisão Em Um Fluxo de Síntese Para Circuitos Aproximados
TCC (graduação) - Universidade Federal de Santa Catarina, Centro Tecnológico, Ciências da Computação.A etapa de síntese lógica no desenvolvimento de circuitos integrados tem se tornado
mais desafiadora nos últimos anos devido ao crescimento na complexidade dos circuitos
projetados. A escala nanométrica dos atuais transistores permite maior integração em um
mesmo chip, implicando em funções mais complexas, com mais entradas e mais termos a
serem otimizados. Muitos dos métodos de otimização lógica tradicional atingem seu limite
de otimização em poucas dezenas de entradas, ou conseguem minimizar funções complexas
ao custo de grandes tempos de execução. Algoritmos de Aprendizado de Máquina vêm se
tornando mais comuns em diversas áreas da tecnologia, incluindo a área de ferramentas
para o projeto de circuitos integrados, conhecida como Electronic Design Automation.
Explorar essas ferramentas visando otimizar o tempo de execução da síntese lógica e
analisar seu comportamento nos resultados de área e potência são os objetivos deste
trabalho. Desta forma, este trabalho propõe um fluxo de síntese partindo de uma Tabela
Verdade do circuito como entrada e apresentando soluções de síntese voltadas a baixo
consumo energético, adotando, em alguns casos, Computação Aproximada. A otimização
lógica desenvolvida é baseada em Árvores de Decisão, permitindo que a minimização
lógica produza saídas exatas, assim como também possibilitando que alguma incerteza
seja inserida no sistema através de restrições na profundidade da Árvore, por exemplo. A
exploração de aproximação nas soluções minimizadas pode levar a circuitos com melhor
eficiência energética mantendo níveis aceitáveis de precisão para aplicações tolerantes a erro.
O fluxo de síntese proposto permite a comparação entre a síntese utilizando ferramentas
tradicionais com aquela obtida pelo método de Árvore de Decisão. A saída da minimização
lógica é direcionada para o fluxo OpenROAD. O OpenROAD é um projeto de Electronic
Design Automation que utiliza diversas ferramentas open-source integradas e permite
uma síntese standard cell mapeada para uma tecnologia ASIC. Os resultados observados
mostram o quanto abordagens de Computação Aproximada podem ser promissoras, tendo
reduzido a média de área, atraso e potência estudados para boa parte dos casos avaliados.
Essa média teve um aumento quando se aplica Árvore de Decisão definindo uma acurácia
de 100% quando comparado com o fluxo OpenROAD sem Árvore de Decisão, mas reduziu
continuamente com a diminuição da precisão do circuito escolhida. Com uma acurácia de
90%, por exemplo, a média de área e potência do conjunto estudado diminuiu 41,49% e
47,19%, respectivamente, em comparação com os resultados obtidos com o OpenROAD.The logic synthesis step in the development of integrated circuits has become more chal-
lenging in recent years due to the growth in the complexity of designed circuits. The
nanometric scale of current transistors allows greater integration on the same chip, imply-
ing more complex functions, with more inputs and more terms to be optimized. Many of
the traditional logic optimization methods reach their optimization limit in a few dozen
inputs, or manage to minimize complex functions at the cost of long execution times.
Machine Learning Algorithms are becoming more common in several areas of technology,
including the area of tools for the design of integrated circuits, known as Electronic Design
Automation. Exploring these tools in order to optimize the execution time of the logic
synthesis and analyze their behavior in the area and power results are the objectives of
this work. Thus, this work proposes a synthesis flow starting from a circuits’ Truth Table
as input and presenting synthesis solutions aiming low energy consumption, adopting,
in some cases, Approximate Computing. The developed logic optimization is based on
Decision Trees, allowing the logical minimization to produce exact outputs, as well as
allowing some uncertainty to be inserted in the system through restrictions in the depth of
the tree, for example. The approximation in the minimized solutions can lead to circuits
with better energy efficiency maintaining acceptable levels of accuracy for error tolerant
applications. The proposed synthesis flow allows the comparison between the synthesis
using traditional tools and the one obtained by the Decision Tree method. The output
of logical minimization is directed to the OpenROAD flow. OpenROAD is an Electronic
Design Automation project that uses several integrated open-source tools and allows a
standard cell synthesis mapped to an ASIC technology. The observed results show how
promising Approximate Computing approaches can be, having reduced the average of the
area, delay and power experienced for most of the evaluated cases. This average had an
increase when Decision Tree is applied defining an accuracy of 100% when compared to
the OpenROAD flow without Decision Tree, but it continuously decreased with the chosen
circuits’ precision reduction. With an accuracy of 90%, for example, the average area and
power of the studied set decreased by 41.49% and 47.19%, respectively, compared to the
obtained OpenROAD results
MOCAST 2021
The 10th International Conference on Modern Circuit and System Technologies on Electronics and Communications (MOCAST 2021) will take place in Thessaloniki, Greece, from July 5th to July 7th, 2021. The MOCAST technical program includes all aspects of circuit and system technologies, from modeling to design, verification, implementation, and application. This Special Issue presents extended versions of top-ranking papers in the conference. The topics of MOCAST include:Analog/RF and mixed signal circuits;Digital circuits and systems design;Nonlinear circuits and systems;Device and circuit modeling;High-performance embedded systems;Systems and applications;Sensors and systems;Machine learning and AI applications;Communication; Network systems;Power management;Imagers, MEMS, medical, and displays;Radiation front ends (nuclear and space application);Education in circuits, systems, and communications
Energy and Area Efficient Machine Learning Architectures using Spin-Based Neurons
Recently, spintronic devices with low energy barrier nanomagnets such as spin orbit torque-Magnetic Tunnel Junctions (SOT-MTJs) and embedded magnetoresistive random access memory (MRAM) devices are being leveraged as a natural building block to provide probabilistic sigmoidal activation functions for RBMs. In this dissertation research, we use the Probabilistic Inference Network Simulator (PIN-Sim) to realize a circuit-level implementation of deep belief networks (DBNs) using memristive crossbars as weighted connections and embedded MRAM-based neurons as activation functions. Herein, a probabilistic interpolation recoder (PIR) circuit is developed for DBNs with probabilistic spin logic (p-bit)-based neurons to interpolate the probabilistic output of the neurons in the last hidden layer which are representing different output classes. Moreover, the impact of reducing the Magnetic Tunnel Junction\u27s (MTJ\u27s) energy barrier is assessed and optimized for the resulting stochasticity present in the learning system. In p-bit based DBNs, different defects such as variation of the nanomagnet thickness can undermine functionality by decreasing the fluctuation speed of the p-bit realized using a nanomagnet. A method is developed and refined to control the fluctuation frequency of the output of a p-bit device by employing a feedback mechanism. The feedback can alleviate this process variation sensitivity of p-bit based DBNs. This compact and low complexity method which is presented by introducing the self-compensating circuit can alleviate the influences of process variation in fabrication and practical implementation. Furthermore, this research presents an innovative image recognition technique for MNIST dataset on the basis of p-bit-based DBNs and TSK rule-based fuzzy systems. The proposed DBN-fuzzy system is introduced to benefit from low energy and area consumption of p-bit-based DBNs and high accuracy of TSK rule-based fuzzy systems. This system initially recognizes the top results through the p-bit-based DBN and then, the fuzzy system is employed to attain the top-1 recognition results from the obtained top outputs. Simulation results exhibit that a DBN-Fuzzy neural network not only has lower energy and area consumption than bigger DBN topologies while also achieving higher accuracy
Appropriateness of Imperfect CNFET Based Circuits for Error Resilient Computing Systems
With superior device performance consistently reported in extremely scaled dimensions, low dimensional materials (LDMs), including Carbon Nanotube Field Effect Transistor (CNFET) based technology, have shown the potential to outperform silicon for future transistors in advanced technology nodes. Studies have also demonstrated orders of magnitude improvement in energy efficiency possible with LDMs, in comparison to silicon at competing technology nodes. However, the current fabrication processes for these materials suffer from process imperfections and still appear to be inadequate to compete with silicon for the mainstream high volume manufacturing. Among the LDMs, CNFETs are the most widely studied and closest to high volume manufacturing. Recent works have shown a significant increase in the complexity of CNFET based systems, including demonstration of a 16-bit microprocessor. However, the design of such systems has involved significantly wider-than-usual transistors and avoidance of certain logic combinations. The resulting complexity of several thousand transistors in such systems is still far from the requirements of high-performance general-purpose computing systems having billions of transistors. With the current progress of the process to fabricate CNFETs, their introduction in mainstream manufacturing is expected to take several more years. For an earlier technology adoption, CNFETs appear to be suited for error-resilient computing systems where errors during computation can be tolerated to a certain degree. Such systems relax the need for precise circuits and a perfect process while leveraging the potential energy benefits of CNFET technology in comparison to conventional Si technology. In this thesis, we explore the potential applications using an imperfect CNFET process for error-resilient computing systems, including the impact of the process imperfections at the system level and methods to improve it.
The current most widely adopted fabrication process for CNFETs (separation and placement of solution-based CNTs) still suffers from process imperfections, mainly from open CNTs due to missing of CNTs (in trenches connecting source and drain of CNFET). A fair evaluation of the performance of CNFET based circuits should thus take into consideration the effect of open CNTs, resulting in reduced drive currents. At the circuit level, this leads to failures in meeting 1) the minimum frequency requirement (due to an increase in critical path delay), and 2) the noise suppression requirement. We present a methodology to accurately capture the effect of open CNT imperfection in the state-of-the-art CNFET model, for circuit-level performance evaluation (both delay and glitch vulnerability) of CNFET based circuits using SPICE. A Monte Carlo simulation framework is also provided to investigate the statistical effect of open CNT imperfection on circuit-level performance. We introduce essential metrics to evaluate glitch vulnerability and also provide an effective link between glitch vulnerability and circuit topology.
The past few years have observed significant growth of interest in approximate computing for a wide range of applications, including signal processing, data mining, machine learning, image, video processing, etc. In such applications, the result quality is not compromised appreciably, even in the presence of few errors during computation. The ability to tolerate few errors during computation relaxes the need to have precise circuits. Thus the approximate circuits can be designed, with lesser nodes, reduced stages, and reduced capacitance at few nodes. Consequently, the approximate circuits could reduce critical path delays and enhanced noise suppression in comparison to precise circuits. We present a systematic methodology utilizing Reduced Ordered Binary Decision Diagrams (ROBDD) for generating approximate circuits by taking an example of 16-bit parallel prefix CNFET adder. The approximate adder generated using the proposed algorithm has ~ 5x reduction in the average number of nodes failing glitch criteria (along paths to primary output) and 43.4% lesser Energy Delay Product (EDP) even at high open CNT imperfection, in comparison to the ideal case of no open CNT imperfection, at a mean relative error of 3.3%.
The recent boom of deep learning has been made possible by VLSI technology advancement resulting in hardware systems, which can support deep learning algorithms. These hardware systems intend to satisfy the high-energy efficiency requirement of such algorithms. The hardware supporting such algorithms adopts neuromorphic-computing architectures with significantly less energy compared to traditional Von Neumann architectures. Deep Neural Networks (DNNs) belonging to deep learning domain find its use in a wide range of applications such as image classification, speech recognition, etc. Recent hardware systems have demonstrated the implementation of complex neural networks at significantly less power. However, the complexity of applications and depths of DNNs are expected to drastically increase in the future, imposing a demanding requirement in terms of scalability and energy efficiency of hardware technology. CNFET technology can be an excellent alternative to meet the aggressive energy efficiency requirement for future DNNs. However, degradation in circuit-level performance due to open CNT imperfection can result in timing failure, thus distorting the shape of non-linear activation function, leading to a significant degradation in classification accuracy. We present a framework to obtain sigmoid activation function considering the effect of open CNT imperfection. A digital neuron is explored to generate the sigmoid activation function, which deviates from the ideal case under imperfect process and reduced time period (increased clock frequency). The inherent error resilience of DNNs, on the other hand, can be utilized to mitigate the impact of imperfect process and maintain the shape of the activation function. We use pruning of synaptic weights, which, combined with the proposed approximate neuron, significantly reduces the chance of timing failures and helps to maintain the activation function shape even at high process imperfection and higher clock frequencies. We also provide a framework to obtain classification accuracy of Deep Belief Networks (class of DNNs based on unsupervised learning) using the activation functions obtained from SPICE simulations. By using both approximate neurons and pruning of synaptic weights, we achieve excellent system accuracy (only < 0.5% accuracy drop) with 25% improvement in speed, significant EDP advantage (56.7% less) even at high process imperfection, in comparison to a base configuration of the precise neuron and no pruning with the ideal process, at no area penalty.
In conclusion, this thesis provides directions for the potential applicability of CNFET based technology for error-resilient computing systems. For this purpose, we present methodologies, which provide approaches to assess and design CNFET based circuits, considering process imperfections. We accomplish a DBN framework for digit recognition, considering activation functions from SPICE simulations incorporating process imperfections. We demonstrate the effectiveness of using approximate neuron and synaptic weight pruning to mitigate the impact of high process imperfection on system accuracy