213 research outputs found
On the Average Complexity of Moore's State Minimization Algorithm
We prove that, for any arbitrary finite alphabet and for the uniform
distribution over deterministic and accessible automata with n states, the
average complexity of Moore's state minimization algorithm is in O(n log n).
Moreover this bound is tight in the case of unary utomata
General Purpose Computation on Graphics Processing Units Using OpenCL
Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU
Neural Network Exploration Using Optimal Experiment Design
We consider the question "How should one act when the only goal is to learn as much as possible?" Building on the theoretical results of Fedorov [1972] and MacKay [1992], we apply techniques from Optimal Experiment Design (OED) to guide the query/action selection of a neural network learner. We demonstrate that these techniques allow the learner to minimize its generalization error by exploring its domain efficiently and completely. We conclude that, while not a panacea, OED-based query/action has much to offer, especially in domains where its high computational costs can be tolerated
Fully Distributed And Mixed Symmetric Diagonal Dominant Solvers For Large Scale Optimization
Over the past twenty years, we have witnessed an unprecedented growth in data, inaugurating the
so-called Big Data Epoch. Throughout these years, the exponential growth in the power of computer
chips forecasted by Moore\u27s Law has allowed us to increasingly handle such growing data
progression. However, due to the physical limitations on the size of transistors we have already
reached the computational limits of traditional microprocessors\u27 architecture.Therefore, we either
need conceptually new computers or distributed models of computation to allow processors to solve
Big Data problems in a collaborative manner.
The purpose of this thesis is to show that decentralized optimization is capable of addressing our
growing computational demands by exploiting the power of coordinated data processing. In particular,
we propose an exact distributed Newton method for two important challenges in large-scale
optimization: Network Flow and Empirical Risk Minimization.
The key observation behind our method is related to the symmetric diagonal dominant structure
of the Hessian of dual functions correspondent to the aforementioned problems. Consequently, one
can calculate the Newton direction by solving symmetric diagonal dominant (SDD) systems in a
decentralized fashion.
We first propose a fully distributed SDD solver based on a recursive approximation of SDD matrix
inverses with a collection of specifically structured distributed matrices. To improve the precision of
the algorithm, we then apply Richardson Preconditioners arriving at an efficient algorithm capable
of approximating the solution of SDD system with any arbitrary precision.
vi
Our second fully distributed SDD solver significantly improves the computational performance of
the rst algorithm by utilizing Chebyshev polynomials for an approximation of the SDD matrix
inverse. The particular choice of Chebyshev polynomials is motivated by their extremal properties
and their recursive relation.
We then explore mixed strategies for solving SDD systems by slightly relaxing the decentralization
requirements. Roughly speaking, by allowing for one computer to aggregate some particular information
from all others, one can gain quite surprising computational benefits. The key idea is to
construct a spectral sparsifier of the underlying graph of computers by using local communication
between them.
Finally, we apply these solvers for calculating the Newton direction for the dual function of Network
Flow and Empirical Risk Minimization. On the theoretical side, we establish quadratic convergence
rate for our algorithms surpassing all existing techniques. On the empirical side, we verify our
superior performance in a set of extensive numerical simulations
Memristors -- from In-memory computing, Deep Learning Acceleration, Spiking Neural Networks, to the Future of Neuromorphic and Bio-inspired Computing
Machine learning, particularly in the form of deep learning, has driven most
of the recent fundamental developments in artificial intelligence. Deep
learning is based on computational models that are, to a certain extent,
bio-inspired, as they rely on networks of connected simple computing units
operating in parallel. Deep learning has been successfully applied in areas
such as object/pattern recognition, speech and natural language processing,
self-driving vehicles, intelligent self-diagnostics tools, autonomous robots,
knowledgeable personal assistants, and monitoring. These successes have been
mostly supported by three factors: availability of vast amounts of data,
continuous growth in computing power, and algorithmic innovations. The
approaching demise of Moore's law, and the consequent expected modest
improvements in computing power that can be achieved by scaling, raise the
question of whether the described progress will be slowed or halted due to
hardware limitations. This paper reviews the case for a novel beyond CMOS
hardware technology, memristors, as a potential solution for the implementation
of power-efficient in-memory computing, deep learning accelerators, and spiking
neural networks. Central themes are the reliance on non-von-Neumann computing
architectures and the need for developing tailored learning and inference
algorithms. To argue that lessons from biology can be useful in providing
directions for further progress in artificial intelligence, we briefly discuss
an example based reservoir computing. We conclude the review by speculating on
the big picture view of future neuromorphic and brain-inspired computing
systems.Comment: Keywords: memristor, neuromorphic, AI, deep learning, spiking neural
networks, in-memory computin
Simulated Annealing with min-cut and greedy perturbations
Custom integrated circuit design requires an ever increasing number of elements to be placed on a physical die. The process of searching for an optimal solution is NP-hard so heuristics are required to achieve satisfactory results under time constraints.
Simulated Annealing is an algorithm which uses randomly generated perturbations to adjust a single solution. The effect of a generated perturbation is examined by a cost function which evaluates the solution. If the perturbation decreases the cost, it is accepted. If it increases the cost, it is accepted probabilistically. Such an approach allows the algorithm to avoid local minima and find satisfactory solutions. One problem faced by Simulated Annealing is that it can take a very large number of iterations to reach a desired result. Greedy perturbations use knowledge of the system to generate solutions which may be satisfactory after fewer iterations than non-greedy, however previous work has indicated that the exclusive use of greedy perturbations seems to result in a solution constrained to local minima.
Min-cut is a procedure in which a graph is split into two pieces with the least interconnection possible between them. Using this with a placement problem helps to recognize components which belong to the same functional unit and thus enhance results of Simulated Annealing. The feasibility of this approach has been assessed.
Hardware, through parallelization, can be used to increase the performance of algorithms by decreasing runtime. The possibility of increased performance motivated the exploration of the ability to model greedy perturbations in hardware. The use of greedy perturbations while avoiding local minima was also explored
Recommended from our members
A Dual Dielectric Approach for Performance Aware Reduction of Gate Leakage in Combinational Circuits
Design of systems in the low-end nanometer domain has introduced new dimensions in power consumption and dissipation in CMOS devices. With continued and aggressive scaling, using low thickness SiO2 for the transistor gates, gate leakage due to gate oxide direct tunneling current has emerged as the major component of leakage in the CMOS circuits. Therefore, providing a solution to the issue of gate oxide leakage has become one of the key concerns in achieving low power and high performance CMOS VLSI circuits. In this thesis, a new approach is proposed involving dual dielectric of dual thicknesses (DKDT) for the reducing both ON and OFF state gate leakage. It is claimed that the simultaneous utilization of SiON and SiO2 each with multiple thicknesses is a better approach for gate leakage reduction than the conventional usage of a single gate dielectric (SiO2), possibly with multiple thicknesses. An algorithm is developed for DKDT assignment that minimizes the overall leakage for a circuit without compromising with the performance. Extensive experiments were carried out on ISCAS'85 benchmarks using 45nm technology which showed that the proposed approach can reduce the leakage, as much as 98% (in an average 89.5%), without degrading the performance
On the benefits of resource disaggregation for virtual data centre provisioning in optical data centres
Virtual Data Centre (VDC) allocation requires the provisioning of both computing and network resources. Their joint provisioning allows for an optimal utilization of the physical Data Centre (DC) infrastructure resources. However, traditional DCs can suffer from computing resource underutilization due to the rigid capacity configurations of the server units, resulting in high computing resource fragmentation across the DC servers. To overcome these limitations, the disaggregated DC paradigm has been recently introduced. Thanks to resource disaggregation, it is possible to allocate the exact amount of resources needed to provision a VDC instance. In this paper, we focus on the static planning of a shared optically interconnected disaggregated DC infrastructure to support a known set of VDC instances to be deployed on top. To this end, we provide optimal and sub-optimal techniques to determine the necessary capacity (both in terms of computing and network resources) required to support the expected set of VDC demands. Next, we quantitatively evaluate the benefits yielded by the disaggregated DC paradigm in front of traditional DC architectures, considering various VDC profiles and Data Centre Network (DCN) topologies.Peer ReviewedPostprint (author's final draft
- …