Search CORE

8 research outputs found

Doctor of Philosophy

Author: Ardestani Ali Shafiee
Publication venue: University of Utah
Publication date: 01/01/2018
Field of study

dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine

The University of Utah: J. Willard Marriott Digital Library

Separation logic for high-level synthesis

Author: Winterstein Felix
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/05/2016
Field of study

High-level synthesis (HLS) promises a significant shortening of the digital hardware design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures remain difficult to implement well, yet such constructs are widely used in software. Automated optimisations that leverage the memory bandwidth of dedicated hardware implementations by distributing the application data over separate on-chip memories and parallelise the implementation are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis that disambiguates pointer-based memory accesses. This thesis takes a step towards closing this gap. We explore recent advances in separation logic, a rigorous mathematical framework that enables formal reasoning about the memory access of heap-manipulating programs. We develop a static analysis that automatically splits heap-allocated data structures into provably disjoint regions. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable loop parallelisation and physical memory partitioning by off-the-shelf HLS tools. We then extend the scope of our technique to pointer-based memory-intensive implementations that require access to an off-chip memory. The extended HLS design aid generates parallel on-chip multi-cache architectures. It uses the disjointness property of memory accesses to support non-overlapping memory regions by private caches. It also identifies regions which are shared after parallelisation and which are supported by parallel caches with a coherency mechanism and synchronisation, resulting in automatically specialised memory systems. We show up to 15x acceleration from heap partitioning, parallelisation and the insertion of the custom cache system in demonstrably practical applications.Open Acces

Spiral - Imperial College Digital Repository

On the exploration and optimization of caches under parametric variation

Author: Αντωνιάδης Χαράλαμπος Γ.
Publication venue
Publication date: 01/01/2014
Field of study

University of Thessaly Institutional Repository

SCMFS Performance Enhancement and Implementation on Mobile Platform

Author: Cao Qian
Publication venue
Publication date
Field of study

This thesis presents a method for enhancing performance of Storage Class Memory File System (SCMFS) and an implementation of SCMFS on Android platform. It focuses on analyzing performance influencing factors of memory file systems and the differences in implementation of SCMFS on Android and Linux kernels. SCMFS allocates memory pages as file blocks and employs virtual memory addresses as file block addresses. SCMFS utilizes processor's memory management unit and TLB (Translation Lookaside Buffer) during file accesses. TLB is an expensive resource and has a limited number of entries to cache virtual to physical address translations. TLB miss results in expensive page walks through memory page table. Thus TLB misses play an important role in determining SCMFS performance. In this thesis, SCMFS is designed to support both 4KB and 2MB page sizes in order to reduce TLB misses and to avoid significant internal fragmentation. By comparing SCMFS with YAFFS2 and EXT4 using popular benchmarks, both advantages and disadvantages of SCMFS huge-page version and small-page version are revealed. In the second part of this thesis, an implementation of SCMFS on Android platform is presented. At the time of working on this research project, Android kernel was not merged into Linux kernel yet. Two main changes of SCMFS kernel code: memory zoning and inode functions, are made to be compatible with Android kernel. AndroSH, a file system benchmark for SCMFS on Android, is developed based on shell script. Evaluations are made from three perspectives to compare SCMFS with YAFFS2 and EXT4: I/O throughput, user data access latency, and application execution latency. SCMFS shows a performance advantage because of its small instruction footprint and its pre-allocation mechanism. However, the singly linked list used by SCMFS to store subdirectories is less efficient than HTree index used by EXT4. The future work can improve lookup efficiency of SCMFS

Texas A&M Repository

Abusing Hardware Race Conditions for High Throughput Energy Efficient Computation

Author: Madhavan Advait
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

We propose a novel computing approach, called “Race Logic”, which utilizes a new data representation to accelerate a broad class of optimization problems, such as those solved by dynamic programming algorithms. The core idea of Race Logic is to deliberately engineer race conditions in a circuit to perform useful computation. In Race Logic, information, instead of being represented as logic levels (as is done in conventional logic), is represented as a timing delay. Computations can then be performed by observing the relative propagation times of signals injected into a configurable circuit (i.e. the outcome of races through the circuit).In this dissertation I will introduce Race Based computation and talk about multiple VLSI implementations. We first begin by considering a synchronous approach, which uses simple clocked delay elements. Though this synchronous implementation outperforms highly optimized conventional implementations of the well-studied, DNA sequence alignment problem, its third order energy scaling with problem size and limited dynamic range of timing delays are its major pitfalls. Next, in the search for energy efficiency, we study asynchronous designs in order to understand the performance trade-offs and applicability of this new architecture. Finally, I will present the results of a prototype asynchronous Race Logic chip and demonstrate that Race-Based computations can align up to 10 million 50 symbol long DNA sequences per second, about 2-3 orders of magnitude faster than the state of the art general purpose computing systems

Ezid

eScholarship - University of California

HIGH PERFORMANCE CLOCK DISTRIBUTION FOR HIGH-SPEED VLSI SYSTEMS

Author: Xu Zhang
Publication venue
Publication date: 12/05/2008
Field of study

Tohoku University堀口進課

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Institutional Repositories DataBase (IRDB)

Towards Practical Access Control and Usage Control on the Cloud using Trusted Hardware

Author: Djoko Takougue Judicael Briand
Publication venue
Publication date: 05/06/2020
Field of study

Cloud-based platforms have become the principle way to store, share, and synchronize files online. For individuals and organizations alike, cloud storage not only provides resource scalability and on-demand access at a low cost, but also eliminates the necessity of provisioning and maintaining complex hardware installations. Unfortunately, because cloud-based platforms are frequent victims of data breaches and unauthorized disclosures, data protection obliges both access control and usage control to manage user authorization and regulate future data use. Encryption can ensure data security against unauthorized parties, but complicates file sharing which now requires distributing keys to authorized users, and a mechanism that prevents revoked users from accessing or modifying sensitive content. Further, as user data is stored and processed on remote ma- chines, usage control in a distributed setting requires incorporating the local environmental context at policy evaluation, as well as tamper-proof and non-bypassable enforcement. Existing cryptographic solutions either require server-side coordination, offer limited flexibility in data sharing, or incur significant re-encryption overheads on user revocation. This combination of issues are ill-suited within large-scale distributed environments where there are a large number of users, dynamic changes in user membership and access privileges, and resources are shared across organizational domains. Thus, developing a robust security and privacy solution for the cloud requires: fine-grained access control to associate the largest set of users and resources with variable granularity, scalable administration costs when managing policies and access rights, and cross-domain policy enforcement. To address the above challenges, this dissertation proposes a practical security solution that relies solely on commodity trusted hardware to ensure confidentiality and integrity throughout the data lifecycle. The aim is to maintain complete user ownership against external hackers and malicious service providers, without losing the scalability or availability benefits of cloud storage. Furthermore, we develop a principled approach that is: (i) portable across storage platforms without requiring any server-side support or modifications, (ii) flexible in allowing users to selectively share their data using fine-grained access control, and (iii) performant by imposing modest overheads on standard user workloads. Essentially, our system must be client-side, provide end-to-end data protection and secure sharing, without significant degradation in performance or user experience. We introduce NeXUS, a privacy-preserving filesystem that enables cryptographic protection and secure file sharing on existing network-based storage services. NeXUS protects the confidentiality and integrity of file content, as well as file and directory names, while mitigating against rollback attacks of the filesystem hierarchy. We also introduce Joplin, a secure access control and usage control system that provides practical attribute-based sharing with decentralized policy administration, including efficient revocation, multi-domain policies, secure user delegation, and mandatory audit logging. Both systems leverage trusted hardware to prevent the leakage of sensitive material such as encryption keys and access control policies; they are completely client-side, easy to install and use, and can be readily deployed across remote storage platforms without requiring any server-side changes or trusted intermediary. We developed prototypes for NeXUS and Joplin, and evaluated their respective overheads in isolation and within a real-world environment. Results show that both prototypes introduce modest overheads on interactive workloads, and achieve portability across storage platforms, including Dropbox and AFS. Together, NeXUS and Joplin demonstrate that a client-side solution employing trusted hardware such as Intel SGX can effectively protect remotely stored data on existing file sharing services

D-Scholarship@Pitt

Recommended from our members

Side channel attack resistant elliptic curves cryptosystem on multi-cores for power efficiency

Author: Yoo Jaewon
Publication venue: 'Oregon State University'
Publication date
Field of study

The Advent of multi-cores allows programs to be executed much faster than before. Cryptoalgorithms use long-bit words thus parallelizing these operations on multi-cores will achieve significant performance improvement. However, not all long-bit word operations in cryptosystems are suitable for parallel execution on multi-cores. In particular, long-bit words used in Elliptic Curves Cryptography (ECC) do not efficiently divide by the system word size. This causes some of the cores to be idle, which makes it vulnerable for attackers to guess how many operations occurred and thus what field size is being used. Multiplication is the most important part of public key cryptosystems. Long-bit word multiplication operations are needed for encryption and decryption. J. Fan et al. proposed using Montgomery multiplication on multi-cores using GF(2²⁵⁶) [25, 26], which is suitable for comput-er systems with 16-bit or 32-bit word size. Fan‟s Montgomery multiplication is suitable for most RSA. However, in ECC, some GFs will cause idle cores. For example, suppose GF(2¹³¹) is used (which is one of the recommended word size by NIST) on a quad-core with a 32-bit word size, which requires [132/32] =5 iterations with the last iteration requiring just a 3-bit operation. This cause three of the cores to be idle during this time causing needless power consumption. The most general and the easiest way to make side channel attacks difficult is to insert dummy instructions to cover the idle processors. However, dummy instructions result in extra workloads that lead to performance degradation and increases in power consumption. In this thesis, we will present a multiplier adjuster technique to improve the execution time and the power consumption for the last unbalanced iteration. By appropriately applying dummy instructions between point-addition and point-doubling operations, a balanced point operation can be achieved in ECC. The performance and power-efficiency of the proposed method on multi-cores are analyzed for each GF used in ECC

ScholarsArchive@OSU