71 research outputs found
Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration
Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%
Neural network computing using on-chip accelerators
The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources.
In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications
CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure
Fully homomorphic encryption (FHE) is in the spotlight as a definitive
solution for privacy, but the high computational overhead of FHE poses a
challenge to its practical adoption. Although prior studies have attempted to
design ASIC accelerators to mitigate the overhead, their designs require
excessive amounts of chip resources (e.g., areas) to contain and process
massive data for FHE operations.
We propose CiFHER, a chiplet-based FHE accelerator with a resizable
structure, to tackle the challenge with a cost-effective multi-chip module
(MCM) design. First, we devise a flexible architecture of a chiplet core whose
configuration can be adjusted to conform to the global organization of chiplets
and design constraints. The distinctive feature of our core is a recomposable
functional unit providing varying computational throughput for number-theoretic
transform (NTT), the most dominant function in FHE. Then, we establish
generalized data mapping methodologies to minimize the network overhead when
organizing the chips into the MCM package in a tiled manner, which becomes a
significant bottleneck due to the technology constraints of MCMs. Also, we
analyze the effectiveness of various algorithms, including a novel limb
duplication algorithm, on the MCM architecture. A detailed evaluation shows
that a CiFHER package composed of 4 to 64 compact chiplets provides performance
comparable to state-of-the-art monolithic ASIC FHE accelerators with
significantly lower package-wide power consumption while reducing the area of a
single core to as small as 4.28mm.Comment: 15 pages, 9 figure
Parallel architectures and runtime systems co-design for task-based programming models
The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design.
This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version
Recent Advances in Embedded Computing, Intelligence and Applications
The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
Recommended from our members
A Study of High Performance Multiple Precision Arithmetic on Graphics Processing Units
Multiple precision (MP) arithmetic is a core building block of a wide variety of algorithms in computational mathematics and computer science. In mathematics MP is used in computational number theory, geometric computation, experimental mathematics, and in some random matrix problems. In computer science, MP arithmetic is primarily used in cryptographic algorithms: securing communications, digital signatures, and code breaking. In most of these application areas, the factor that limits performance is the MP arithmetic. The focus of our research is to build and analyze highly optimized libraries that allow the MP operations to be offloaded from the CPU to the GPU. Our goal is to achieve an order of magnitude improvement over the CPU in three key metrics: operations per second per socket, operations per watt, and operation per second per dollar. What we find is that the SIMD design and balance of compute, cache, and bandwidth resources on the GPU is quite different from the CPU, so libraries such as GMP cannot simply be ported to the GPU. New approaches and algorithms are required to achieve high performance and high utilization of GPU resources. Further, we find that low-level ISA differences between GPU generations means that an approach that works well on one generation might not run well on the next.
Here we report on our progress towards MP arithmetic libraries on the GPU in four areas: (1) large integer addition, subtraction, and multiplication; (2) high performance modular multiplication and modular exponentiation (the key operations for cryptographic algorithms) across generations of GPUs; (3) high precision floating point addition, subtraction, multiplication, division, and square root; (4) parallel short division, which we prove is asymptotically optimal on EREW and CREW PRAMs
- …