174 research outputs found

    Hardware Architectures for Post-Quantum Cryptography

    Get PDF
    The rapid development of quantum computers poses severe threats to many commonly-used cryptographic algorithms that are embedded in different hardware devices to ensure the security and privacy of data and communication. Seeking for new solutions that are potentially resistant against attacks from quantum computers, a new research field called Post-Quantum Cryptography (PQC) has emerged, that is, cryptosystems deployed in classical computers conjectured to be secure against attacks utilizing large-scale quantum computers. In order to secure data during storage or communication, and many other applications in the future, this dissertation focuses on the design, implementation, and evaluation of efficient PQC schemes in hardware. Four PQC algorithms, each from a different family, are studied in this dissertation. The first hardware architecture presented in this dissertation is focused on the code-based scheme Classic McEliece. The research presented in this dissertation is the first that builds the hardware architecture for the Classic McEliece cryptosystem. This research successfully demonstrated that complex code-based PQC algorithm can be run efficiently on hardware. Furthermore, this dissertation shows that implementation of this scheme on hardware can be easily tuned to different configurations by implementing support for flexible choices of security parameters as well as configurable hardware performance parameters. The successful prototype of the Classic McEliece scheme on hardware increased confidence in this scheme, and helped Classic McEliece to get recognized as one of seven finalists in the third round of the NIST PQC standardization process. While Classic McEliece serves as a ready-to-use candidate for many high-end applications, PQC solutions are also needed for low-end embedded devices. Embedded devices play an important role in our daily life. Despite their typically constrained resources, these devices require strong security measures to protect them against cyber attacks. Towards securing this type of devices, the second research presented in this dissertation focuses on the hash-based digital signature scheme XMSS. This research is the first that explores and presents practical hardware based XMSS solution for low-end embedded devices. In the design of XMSS hardware, a heterogenous software-hardware co-design approach was adopted, which combined the flexibility of the soft core with the acceleration from the hard core. The practicability and efficiency of the XMSS software-hardware co-design is further demonstrated by providing a hardware prototype on an open-source RISC-V based System-on-a-Chip (SoC) platform. The third research direction covered in this dissertation focuses on lattice-based cryptography, which represents one of the most promising and popular alternatives to today\u27s widely adopted public key solutions. Prior research has presented hardware designs targeting the computing blocks that are necessary for the implementation of lattice-based systems. However, a recurrent issue in most existing designs is that these hardware designs are not fully scalable or parameterized, hence limited to specific cryptographic primitives and security parameter sets. The research presented in this dissertation is the first that develops hardware accelerators that are designed to be fully parameterized to support different lattice-based schemes and parameters. Further, these accelerators are utilized to realize the first software-harware co-design of provably-secure instances of qTESLA, which is a lattice-based digital signature scheme. This dissertation demonstrates that even demanding, provably-secure schemes can be realized efficiently with proper use of software-hardware co-design. The final research presented in this dissertation is focused on the isogeny-based scheme SIKE, which recently made it to the final round of the PQC standardization process. This research shows that hardware accelerators can be designed to offload compute-intensive elliptic curve and isogeny computations to hardware in a versatile fashion. These hardware accelerators are designed to be fully parameterized to support different security parameter sets of SIKE as well as flexible hardware configurations targeting different user applications. This research is the first that presents versatile hardware accelerators for SIKE that can be mapped efficiently to both FPGA and ASIC platforms. Based on these accelerators, an efficient software-hardwareco-design is constructed for speeding up SIKE. In the end, this dissertation demonstrates that, despite being embedded with expensive arithmetic, the isogeny-based SIKE scheme can be run efficiently by exploiting specialized hardware. These four research directions combined demonstrate the practicability of building efficient hardware architectures for complex PQC algorithms. The exploration of efficient PQC solutions for different hardware platforms will eventually help migrate high-end servers and low-end embedded devices towards the post-quantum era

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Pruned Bit-Reversal Permutations: Mathematical Characterization, Fast Algorithms and Architectures

    Full text link
    A mathematical characterization of serially-pruned permutations (SPPs) employed in variable-length permuters and their associated fast pruning algorithms and architectures are proposed. Permuters are used in many signal processing systems for shuffling data and in communication systems as an adjunct to coding for error correction. Typically only a small set of discrete permuter lengths are supported. Serial pruning is a simple technique to alter the length of a permutation to support a wider range of lengths, but results in a serial processing bottleneck. In this paper, parallelizing SPPs is formulated in terms of recursively computing sums involving integer floor and related functions using integer operations, in a fashion analogous to evaluating Dedekind sums. A mathematical treatment for bit-reversal permutations (BRPs) is presented, and closed-form expressions for BRP statistics are derived. It is shown that BRP sequences have weak correlation properties. A new statistic called permutation inliers that characterizes the pruning gap of pruned interleavers is proposed. Using this statistic, a recursive algorithm that computes the minimum inliers count of a pruned BR interleaver (PBRI) in logarithmic time complexity is presented. This algorithm enables parallelizing a serial PBRI algorithm by any desired parallelism factor by computing the pruning gap in lookahead rather than a serial fashion, resulting in significant reduction in interleaving latency and memory overhead. Extensions to 2-D block and stream interleavers, as well as applications to pruned fast Fourier transforms and LTE turbo interleavers, are also presented. Moreover, hardware-efficient architectures for the proposed algorithms are developed. Simulation results demonstrate 3 to 4 orders of magnitude improvement in interleaving time compared to existing approaches.Comment: 31 page

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Architectural Improvements Towards an Efficient 16-18 Bit 100-200 MSPS ADC

    Get PDF
    As Data conversion systems continue to improve in speed and resolution, increasing demands are placed on the performance of high-speed Analog to Digital Conversion systems. This work makes a survey about all these and proposes a suitable architecture in order to achieve the desired specifications of 100-200MS/s with 16-18 bit of resolution. The main architecture is based on paralleled structures in order to achieve high sampling rate and at the same time high resolution. In order to solve problems related to Time-interleaved architectures, an advanced randomization method was introduced. It combines randomization and spectral shaping of mismatches. With a simple low-pass filter the method can, compared to conventional randomization algorithms, improve the SFDR as well as the SINAD. The main advantage of this technique over previous ones is that, because the algorithm only need that ADCs are ordered basing on their time mismatches, the absolute accuracy of the mismatch identification method does not matter and, therefore, the requirements on the timing mismatch identification are very low. In addition to that, this correction system uses very simple algorithms able to correct not only for time but also for gain and offset mismatches

    Conformação de pulso de formas de onda OFDM para a interface aérea 5G

    Get PDF
    Orientador: Luís Geraldo Pedroso MeloniDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: As formas de onda com multiplexação ortogonal por divisão de freqüência (OFDM) foram utilizadas com sucesso na interface aérea 3GPP LTE para superar a seletividade do canal e proporcionar uma boa eficiência espectral e altas taxas de transmissão de dados. O próximo sistema de comunicações 5G tem como objetivo oferecer suporte a mais serviços do que o antecessor, como comunicações de banda larga móveis, comunicações de tipo máquina e comunicações de baixa latência, e considera muitos outros cenários de aplicação, como o uso de espectro fragmentado. Esta diversidade de serviços com diferentes requisitos não pode ser suportada pela OFDM convencional, pois OFDM configura toda a largura de banda com parâmetros que atendem a um serviço em particular. Além disso, pode ocorrer interferência interportadora (ICI) quando a OFDM convencional é usada com multiplexação assíncrona de múltiplos usuários e isso é devido às altas emissões fora de banda (OOB) das subportadoras e à violação da condição de ortogonalidade do sinal. Portanto, para atender aos requisitos das futuras aplicações sem fio 5G, o desenvolvimento de uma interface aérea inovadora com novas capacidades torna-se necessário, em particular, uma nova forma de onda mais espectralmente ágil do que OFDM capaz de suportar múltiplas configurações, suprimindo efetivamente a interferência entre usuários, e com integração direta com as camadas superiores. Este trabalho centra-se em duas técnicas de conformação de pulsos para reduzir a emissões fora de banda e melhorar o desempenho de formas de onda baseadas em OFDM. A conformação de pulsos pode permitir o uso de parametrizações múltiplas dentro da forma de onda e abandonar os paradigmas rígidos de ortogonalidade e sincronismo com uma degradação de desempenho causada por interferência intersymbol (ISI) e ICI relativamente baixa. A primeira parte aborda um método de modelagem de pulso baseado na filtragem por subportadora para reduzir a emissão fora de banda no transmissor e interferência de canal adjacente (ACI) no receptor. Ele pode ser implementado usando funções de janela e alguns formatos de janela são apresentados nesta parte. O primeiro usa o prefixo cíclico (CP) existente dos símbolos para suavizar as transições abruptas do sinal, portanto, os grandes lóbulos espectrais sinc causados pelos filtros retangulares. Isso garante a compatibilidade retroativa em sistemas que usam OFDM com prefixo cíclico (CP-OFDM). O formato da segunda janela estende o comprimento do CP para reter a capacidade da forma de onda para combater a propagação do atraso do canal. Os efeitos no desempenho do ISI e ICI são estudados em termos de relação de sinal para interferência (SIR) e taxa de erro de bit (BER) usando formas de onda LTE em um cenário de espectro fragmentado multi-usuário. A segunda parte deste trabalho aborda o desenho e análise de filtros para a contenção espectral flexível em transceptores com filtragem baseada em sub-banda. Este filtro, chamado aqui semi-equiripple, exibe melhor atenuação na banda de rejeição para reduzir as interferências entre subbandas do que os filtros equiripple e filtros sinc baseados em janelamento e também possui boas características de resposta ao impulso para reduzir o ISI. O projeto de filtros baseia-se no algoritmo Parks-McClellan para obter diferentes taxas de decaimento da banda de parada e atende a especificações arbitrárias de máscaras de emissão de espectro (SEM) com baixa distorção dentro da banda. Portanto, pode ser útil para obter baixas emissões fora da banda e configurar sub-bandas com parâmetros independentes, uma vez que a interferência assíncrona é contida pelos filtros. São estudadas três distorções de ISI no filtro: espalhamento de símbolos relacionado à causalidade do filtro, ecos de símbolos devido a ondulações na banda e amplificação de ISI devido a amostras de valores anômalas nas caudas de sua resposta de impulso. O desempenho do filtro é avaliado em termos de densidade de espectro de potência (PSD) e conformidade com SEMs, taxa de erro de modulação (MER) e operação em um esquema assíncrono multi-serviço usando uma única forma de onda. O SIR e o efeito da filtragem na precisão da modulação são avaliados usando formas de onda OFDM ISDB-T e LTE. Estruturas de hardware flexíveis também são propostas para implementações reais. Os resultados mostram que esses métodos de conformação de pulso permitem que a forma de onda explore os fragmentos de espectro disponíveis e ofereça suporte a múltiplos serviços sem uma penalidade de desempenho significativa, o que pode permitir uma interface aérea mais flexívelAbstract: Orthogonal frequency division multiplexing (OFDM) waveforms have been used successfully in the 3GPP Long Term Evolution (LTE) air interface to overcome the channel selectivity and to provide good spectrum efficiency and high transmission data rates. The forthcoming 5G communication system aims to support more services than its predecessor, such as enhanced mobile broadband, machine-type communications and low latency communications, and considers many other application scenarios such as the fragmented spectrum use. This diversity of services with different requirements cannot be supported by conventional OFDM since OFDM configures the entire bandwidth with parameters attending one service in particular. Also, substantial intercarrier interference (ICI) can occur when conventional OFDM is used with asynchronous multiuser multiplexing and this is due to the high out-of-band (OOB) emissions of the subcarriers and the violation of the signal orthogonality constraint. Therefore, to meet the requirements of future 5G wireless applications, the development of an innovative air interface with new capabilities becomes necessary, in particular, a new waveform more spectrally agile than OFDM capable of supporting multiple configurations, suppressing the inter-user interference effectively, and with straightforward integration with the upper layers. This work focuses on two pulse shaping techniques to reduce the OOB emission and improve the in-band and OOB performances of OFDM-based waveforms. Pulse shaping can enable the use of multiple parameterizations within the waveform and abandon the strict paradigms of orthogonality and synchronism with relatively low performance degradation caused by intersymbol interference (ISI) and ICI. The first part addresses a pulse shaping method based on per-subcarrier filtering to reduce both OOB emission in the transmitter and adjacent channel interference (ACI) in the receiver. It can be implemented using window functions and some window formats are presented in this part. The first uses the existing cyclic prefix (CP) of OFDM symbols to smooth abrupt transitions of the signal, thus the large sinc spectral sidelobes caused by the rectangular filters. This guarantees backwards compatibility in systems using conventional cyclic prefixed OFDM (CP-OFDM). The second window format extends the CP length to retain the waveform ability to combat channel delay spread. The effects on performance of ISI and ICI are studied in terms of the signal to interference ratio (SIR) and bit error rate (BER) using LTE waveforms in a multi-user fragmented spectrum scenario. The second part of this work addresses the design and analysis of a filters for flexible spectral containment in subband-based filtering transceivers. This filter, called here semi-equiripple, exhibits better stopband attenuation to reduce the inter-subband interferences than equiripple and windowed truncated sinc filters and also has good impulse response characteristics to reduce ISI. The design is based on the Parks-McClellan algorithm to obtain different stopband decay rates and meet arbitrary spectrum emission masks (SEM) specifications with low in-band distortion. Therefore, it can be useful to achieve low OOB emission and configure subbands with independent parameters since the asynchronous interference is contained by the filters. Three ISI distortions in the filter are studied: symbol spreading related to the filter causality, symbol echoes due to in-band ripples, and ISI amplification due to outlier samples in the tails of its impulse response. The performance of the filter is assessed in terms of the power spectrum density (PSD) and compliance with tight SEMs, modulation error rate (MER) and operation in a multi-service asynchronous scheme using a single waveform. The SIR and the effect of filtering on the modulation accuracy are evaluated using OFDM ISDB-T and LTE waveforms. Flexible hardware structures are also proposed for actual implementations. The results show that these pulse shaping methods enable the waveform to exploit the available spectrum fragments and support multiple services without significant performance penalty, which can allow a more flexible air interfaceMestradoTelecomunicações e TelemáticaMestre em Engenharia ElétricaCAPE
    corecore