16 research outputs found

    Shared memory with hidden latency on a family of mesh-like networks

    Get PDF

    Aspects of practical implementations of PRAM algorithms

    Get PDF
    The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results: 1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio. 2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks. 3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications

    On the implementation of P-RAM algorithms on feasible SIMD computers

    Get PDF
    The P-RAM model of computation has proved to be a very useful theoretical model for exploiting and extracting inherent parallelism in problems and thus for designing parallel algorithms. Therefore, it becomes very important to examine whether results obtained for such a model can be translated onto machines considered to be more realistic in the face of current technological constraints. In this thesis, we show how the implementation of many techniques and algorithms designed for the P-RAM can be achieved on the feasible SIMD class of computers. The first investigation concerns classes of problems solvable on the P-RAM model using the recursive techniques of compression, tree contraction and 'divide and conquer'. For such problems, specific methods are emphasised to achieve efficient implementations on some SIMD architectures. Problems such as list ranking, polynomial and expression evaluation are shown to have efficient solutions on the 2—dimensional mesh-connected computer. The balanced binary tree technique is widely employed to solve many problems in the P-RAM model. By proposing an implicit embedding of the binary tree of size n on a (√n x√n) mesh-connected computer (contrary to using the usual H-tree approach which requires a mesh of size ≈ (2√n x 2√n), we show that many of the problems solvable using this technique can be efficiently implementable on this architecture. Two efficient O (√n) algorithms for solving the bracket matching problem are presented. Consequently, the problems of expression evaluation (where the expression is given in an array form), evaluating algebraic expressions with a carrier of constant bounded size and parsing expressions of both bracket and input driven languages are all shown to have efficient solutions on the 2—dimensional mesh-connected computer. Dealing with non-tree structured computations we show that the Eulerian tour problem for a given graph with m edges and maximum vertex degree d can be solved in O(d√n) parallel time on the 2 —dimensional mesh-connected computer. A way to increase the processor utilisation on the 2-dimensional mesh-connected computer is also presented. The method suggested consists of pipelining sets of iteratively solvable problems each of which at each step of its execution uses only a fraction of available PE's

    Memory Subsystem Design for Explicit Multithreading Architectures

    Get PDF
    Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. An important enabler for XMT is sufficient memory bandwidth to support parallelism. For targeted deep-submicron VLSI processes, chip designers will be faced with the widely acknowledged issues of rising interconnect RC delays and shortening clock periods. Comprehensive memory design for an XMT architecture has never before been rigorously studied. This thesis relies on an examination the implications of the XMT programming model on memory subsystem design to motivate a potential framework for on-chip memory interconnection. Many system-level issues are considered, and analytical electrical interconnect modeling is used to demonstrate the physical viability of new structures in future processes. It is estimated that a chip, built in a 2008 process with 1024 hardware execution contexts, may be capable of a sustained on-chip memory transaction throughput of 1430 GB/s

    Oblivious Parallel RAM and Applications

    Get PDF
    We initiate the study of cryptography for parallel RAM (PRAM) programs. The PRAM model captures modern multi-core architectures and cluster computing models, where several processors execute in parallel and make accesses to shared memory, and provides the “best of both” circuit and RAM models, supporting both cheap random access and parallelism. We propose and attain the notion of Oblivious PRAM. We present a compiler taking any PRAM into one whose distribution of memory accesses is statistically independent of the data (with negligible error), while only incurring a polylogarithmic slowdown (in both total and parallel complexity). We discuss applications of such a compiler, building upon recent advances relying on Oblivious (sequential) RAM (Goldreich Ostrovsky JACM’12). In particular, we demonstrate the construction of a garbled PRAM compiler based on an OPRAM compiler and secure identity-based encryption

    Adaptation of multiway-merge sorting algorithm to MIMD architectures with an experimental study

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2002.Thesis (Master's) -- Bilkent University, 2002.Includes bibliographical references leaves 73-78.Sorting is perhaps one of the most widely studied problems of computing. Numerous asymptotically optimal sequential algorithms have been discovered. Asymptotically optimal algorithms have been presented for varying parallel models as well. Parallel sorting algorithms have already been proposed for a variety of multiple instruction, multiple data streams (MIMD) architectures. In this thesis, we adapt the multiwaymerge sorting algorithm that is originally designed for product networks, to MIMD architectures. It has good load balancing properties, modest communication needs and well performance. The multiway-merge sort algorithm requires only two all-to-all personalized communication (AAPC) and two one-to-one communications independent from the input size. In addition to evenly distributed load balancing, the algorithm requires only size of 2N/P local memory for each processor in the worst case, where N is the number of items to be sorted and P is the number of processors. We have implemented the algorithm on the PC Cluster that is established at Computer Engineering Department of Bilkent University. To compare the results we have implemented a sample sort algorithm (PSRS Parallel Sorting by Regular Sampling) by X. Liu et all and a parallel quicksort algorithm (HyperQuickSort) on the same cluster. In the experimental studies we have used three different benchmarks namely Uniformly, Gaussian, and Zero distributed inputs. Although the multiwaymerge algorithm did not achieve better results than the other two, which are theoretically cost optimal algorithms, there are some cases that the multiway-merge algorithm outperforms the other two like in Zero distributed input. The results of the experiments are reported in detail. The multiway-merge sort algorithm is not necessarily the best parallel sorting algorithm, but it is expected to achieve acceptable performance on a wide spectrum of MIMD architectures.Cantürk, LeventM.S

    Efficient Data-Oblivious Computation

    Get PDF
    The rapid increase in the amount of data stored by cloud servers has resulted in growing privacy concerns for users. First, although keeping data encrypted at all times is an attractive approach to privacy, encryption may preclude mining and learning useful patterns from data. Second, companies are unable to distribute proprietary programs to other parties without risking the loss of their private code when those programs are reverse engineered. A challenge underlying both those problems is that how data is accessed — even when that data is encrypted — can leak secret information. Oblivious RAM is a well studied cryptographic primitive that can be used to solve the underlying challenge of hiding data-access patterns. In this dissertation, we improve Oblivious RAMs and oblivious algorithms asymptotically. We then show how to apply our novel oblivious algorithms to build systems that enable privacy-preserving computation on encrypted data and program obfuscation. Specifically, the first part of this dissertation shows two efficient Oblivious RAM algorithms: 1) The first algorithm achieves sub-logarithmic bandwidth blowup while only incurring an inexpensive XOR computation for performing Private Information Retrieval operations, and 2) The second algorithm is the first perfectly-secure Oblivious Parallel RAM with O(log3N)O(\log^3 N ) bandwidth blowup, O((logm+loglogN)logN)O((\log m + \log \log N)\log N) depth blowup, and O(1)O(1) space blowup when the PRAM has mm CPUs and stores NN blocks of data. The second part of this dissertation describes two systems — HOP and GraphSC — that address the problem of computing on private data and the distribution of proprietary programs. HOP is a system that achieves simulation-secure obfuscation of RAM programs assuming secure hardware. It is the first prototype implementation of a provably secure virtual black-box (VBB) obfuscation scheme in any model under any assumptions. GraphSC is a system that allows cloud servers to run a class of data-mining and machine-learning algorithms over users’ data without learning anything about that data. GraphSC brings efficient, parallel secure computation to programmers by allowing them to express computation tasks using the GraphLab abstraction. It is backed by the first non-trivial parallel oblivious algorithms that outperform generic Oblivious RAMs

    Asymmetric Load Balancing on a Heterogeneous Cluster of PCs

    Get PDF
    In recent years, high performance computing with commodity clusters of personal computers has become an active area of research. Many organizations build them because they need the computational speedup provided by parallel processing but cannot afford to purchase a supercomputer. With commercial supercomputers and homogenous clusters of PCs, applications that can be statically load balanced are done so by assigning equal tasks to each processor. With heterogeneous clusters, the system designers have the option of quickly adding newer hardware that is more powerful than the existing hardware. When this is done, the assignment of equal tasks to each processor results in suboptimal performance. This research addresses techniques by which the size of the tasks assigned to processors is a suitable match to the processors themselves, in which the more powerful processors can do more work, and the less powerful processors perform less work. We find that when the range of processing power is narrow, some benefit can be achieved with asymmetric load balancing. When the range of processing power is broad, dramatic improvements in performance are realized our experiments have shown up to 92% improvement when asymmetrically load balancing a modified version of the NAS Parallel Benchmarks\u27 LU application

    Towards COP27: The Water-Food-Energy Nexus in a Changing Climate in the Middle East and North Africa

    Get PDF
    Due to its low adaptability to climate change, the MENA region has become a "hot spot". Water scarcity, extreme heat, drought, and crop failure will worsen as the region becomes more urbanized and industrialized. Both water and food scarcity are made worse by civil wars, terrorism, and political and social unrest. It is unclear how climate change will affect the MENA water–food–energy nexus. All of these concerns need to be empirically evaluated and quantified for a full climate change assessment in the region. Policymakers in the MENA region need to be aware of this interconnection between population growth, rapid urbanization, food safety, climate change, and the global goal of lowering greenhouse gas emissions (as planned in COP27). Researchers from a wide range of disciplines have come together in this SI to investigate the connections between water, food, energy, and climate in the region. By assessing the impacts of climate change on hydrological processes, natural disasters, water supply, energy production and demand, and environmental impacts in the region, this SI will aid in implementation of sustainable solutions to these challenges across multiple spatial scales

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
    corecore