19 research outputs found

    Interconnection Networks Embeddings and Efficient Parallel Computations.

    Get PDF
    To obtain a greater performance, many processors are allowed to cooperate to solve a single problem. These processors communicate via an interconnection network or a bus. The most essential function of the underlying interconnection network is the efficient interchanging of messages between processes in different processors. Parallel machines based on the hypercube topology have gained a great respect in parallel computation because of its many attractive properties. Many versions of the hypercube have been introduced by many researchers mainly to enhance communications. The twisted hypercube is one of the most attractive versions of the hypercube. It preserves the important features of the hypercube and reduces its diameter by a factor of two. This dissertation investigates relations and transformations between various interconnection networks and the twisted hypercube and explore its efficiency in parallel computation. The capability of the twisted hypercube to simulate complete binary trees, complete quad trees, and rings is demonstrated and compared with the hypercube. Finally, the fault-tolerance of the twisted hypercube is investigated. We present optimal algorithms to simulate rings in a faulty twisted hypercube environment and compare that with the hypercube

    Properties and algorithms of the (n, k)-arrangement graphs

    Get PDF
    The (n, k)-arrangement interconnection topology was first introduced in 1992. The (n, k )-arrangement graph is a class of generalized star graphs. Compared with the well known n-star, the (n, k )-arrangement graph is more flexible in degree and diameter. However, there are few algorithms designed for the (n, k)-arrangement graph up to present. In this thesis, we will focus on finding graph theoretical properties of the (n, k)- arrangement graph and developing parallel algorithms that run on this network. The topological properties of the arrangement graph are first studied. They include the cyclic properties. We then study the problems of communication: broadcasting and routing. Embedding problems are also studied later on. These are very useful to develop efficient algorithms on this network. We then study the (n, k )-arrangement network from the algorithmic point of view. Specifically, we will investigate both fundamental and application algorithms such as prefix sums computation, sorting, merging and basic geometry computation: finding convex hull on the (n, k )-arrangement graph. A literature review of the state-of-the-art in relation to the (n, k)-arrangement network is also provided, as well as some open problems in this area

    Aspects of k-k-Routing in Meshes and OTIS Networks

    Get PDF
    Aspects of k-k Routing in Meshes and OTIS-Networks Abstract Efficient data transport in parallel computers build on sparse interconnection networks is crucial for their performance. A basic transport problem in such a computer is the k-k routing problem. In this thesis, aspects of the k-k routing problem on r-dimensional meshes and OTIS-G networks are discussed. The first oblivious routing algorithms for these networks are presented that solve the k-k routing problem in an asymptotically optimal running time and a constant buffer size. Furthermore, other aspects of the k-k routing problem for OTIS-G networks are analysed. In particular, lower bounds for the problem based on the diameter and bisection width of OTIS-G networks are given, and the k-k sorting problem on the OTIS-Mesh is considered. Based on OTIS-G networks, a new class of networks, called Extended OTIS-G networks, is introduced, which have smaller diameters than OTIS-G networks.Für die Leistungfähigkeit von Parallelrechnern, die über ein Verbindungsnetzwerk kommunizieren, ist ein effizienter Datentransport entscheidend. Ein grundlegendes Transportproblem in einem solchen Rechner ist das k-k Routing Problem. In dieser Arbeit werden Aspekte dieses Problems in r-dimensionalen Gittern und OTIS-G Netzwerken untersucht. Es wird der erste vergessliche (oblivious) Routing Algorithmus vorgestellt, der das k-k Routing Problem in diesen Netzwerken in einer asymptotisch optimalen Laufzeit bei konstanter Puffergröße löst. Für OTIS-G Netzwerke werden untere Laufzeitschranken für das untersuchte Problem angegeben, die auf dem Durchmesser und der Bisektionsweite der Netzwerke basieren. Weiterhin wird ein Algorithmus vorgestellt, der das k-k Sorting Problem mit einer Laufzeit löst, die nahe an der Bisektions- und Durchmesserschranke liegt. Basierend auf den OTIS-G Netzwerken, wird eine neue Klasse von Netzwerken eingeführt, die sogenannten Extended OTIS-G Netzwerke, die sich durch einen kleineren Durchmesser von OTIS-G Netzwerken unterscheiden

    Efficient Interconnection Schemes for VLSI and Parallel Computation

    Get PDF
    This thesis is primarily concerned with two problems of interconnecting components in VLSI technologies. In the first case, the goal is to construct efficient interconnection networks for general-purpose parallel computers. The second problem is a more specialized problem in the design of VLSI chips, namely multilayer channel routing. In addition, a final part of this thesis provides lower bounds on the area required for VLSI implementations of finite-state machines. This thesis shows that networks based on Leiserson\u27s fat-tree architecture are nearly as good as any network built in a comparable amount of physical space. It shows that these universal networks can efficiently simulate competing networks by means of an appropriate correspondence between network components and efficient algorithms for routing messages on the universal network. In particular, a universal network of area A can simulate competing networks with O(lg^3A) slowdown (in bit-times), using a very simple randomized routing algorithm and simple network components. Alternatively, a packet routing scheme of Leighton, Maggs, and Rao can be used in conjunction with more sophisticated switching components to achieve O(lg^2 A) slowdown. Several other important aspects of universality are also discussed. It is shown that universal networks can be constructed in area linear in the number of processors, so that there is no need to restrict the density of processors in competing networks. Also results are presented for comparisons between networks of different size or with processors of different sizes (as determined by the amount of attached memory). Of particular interest is the fact that a universal network built from sufficiently small processors can simulate (with the slowdown already quoted) any competing network of comparable size regardless of the size of processors in the competing network. In addition, many of the results given do not require the usual assumption of unit wire delay. Finally, though most of the discussion is in the two-dimensional world, the results are shown to apply in three dimensions by way of a simple demonstration of general results on graph layout in three dimensions. The second main problem considered in this thesis is channel routing when many layers of interconnect are available, a scenario that is becoming more and more meaningful as chip fabrication technologies advance. This thesis describes a system MulCh for multilayer channel routing which extends the Chameleon system developed at U. C. Berkeley. Like Chameleon, MulCh divides a multilayer problem into essentially independent subproblems of at most three layers, but unlike Chameleon, MulCh considers the possibility of using partitions comprised of a single layer instead of only partitions of two or three layers. Experimental results show that MulCh often performs better than Chameleon in terms of channel width, total net length, and number of vias. In addition to a description of MulCh as implemented, this thesis provides improved algorithms for subtasks performed by MulCh, thereby indicating potential improvements in the speed and performance of multilayer channel routing. In particular, a linear time algorithm is given for determining the minimum width required for a single-layer channel routing problem, and an algorithm is given for maintaining the density of a collection of nets in logarithmic time per net insertion. The last part of this thesis shows that straightforward techniques for implementing finite-state machines are optimal in the worst case. Specifically, for any s and k, there is a deterministic finite-state machine with s states and k symbols such that any layout algorithm requires (ks lg s) area to lay out its realization. For nondeterministic machines, there is an analogous lower bound of (ks^2) area

    Time Lower Bounds for Parallel Network Computations

    Get PDF
    Direct Acyclic Graphs (DAGs) are a suitable way to describe computations, expressing precedence constraints among operations. Beyond the representation of the execution of an algorithm, a DAG can effectively represent the execution of a parallel network. This last kind of DAG has a regular structure, consisting in the repetition over time of the original network; these common representations suggest a possible uniform approach in the study of execution of algorithms and emulation of networks. Both in parallel computing and computational complexity, DAGs have been extensively employed in the study of algorithmic features, as lower bounds for the execution/emulation time of algorithms/networks, the minimum quantity of memory needed for computing an algorithm or the minimum I/O complexity of an algorithm given a certain amount of fast memory cells. Developed techniques are quite different in their assumptions; one of the more fundamental differences is that some of them allow recomputation of intermediate results, while others disallow it, requiring the storage in memory of intermediate results for their further usages. In nowadays computations the trade-off between data recomputation and data storing is important both in parallel and in local elaborations, since in the former we can increase the bandwidth and reduce the latency with whom data can be accessed (by computing the same data in several points of the network), while in the latter we can avoid to pay the latency of the access in memory to reload data, by recomputing them possibly loading fewer data or using data already present in memory. So far it does not exist an universal technique able to foresee the strict lower bound for each execution of algorithm or emulation of network in each network and the known results derive from several theorems. On the contrary there are a lot of cases for which it neither exists a tight result; among these there are also emulations of extensively studied networks, such as multidimensional arrays. The first part of our thesis starts from this state-of-the-art: we propose a survey of several known lower bound techniques involving DAGs, followed by original theorems which clarify or solve open problems. In particular, in our survey we consider lower bound techniques for execution of algorithms and emulation of networks in parallel networks, showing their principles and their limits. In the discussion we show relationships among theorems, proving that no one of them is better of the others in general terms: there are counter-examples in which each theorem gives better bounds than others. We also exhibit examples where no bound among the considered techniques is tight. Moreover we generalize some theorems originally suited for network emulations, adapting them to execution of general DAGs in parallel networks, showing examples of their application. We also consider theorems for determining minimum I/O complexity, presenting similarities and differences with emulation theorems. One of the main results of the thesis is a new general technique which provides lower bounds almost tight (except for a logarithmic factor) in a class of network emulations including multidimensional arrays. We improve previously better known results which have a polynomial gap between lower bound and actual emulation time. Our theorem considers emulations with recomputation, giving results valid in the most general context. Finally we consider the role of recomputation in performance, trying to understand when it gives a real advantage respect to storing intermediate results in memory. In particular we introduce the problem in simple networks, showing a class of them in which recomputation can not improve I/O performance, ending in butterfly DAGs where recomputation can save a number of I/O accesses at least as big as the fast memory available during the computation. The approach used highlights the difficulty of exploit recomputation in executions of algorithms when their DAG representation exhibits an high bisection bandwidth

    DISTRIBUTED, PARALLEL AND DYNAMIC DISTANCE STRUCTURES

    Get PDF
    Many fundamental computational tasks can be modeled by distances on a graph. This has inspired studying various structures that preserve approximate distances, but trade off this approximation factor with size, running time, or the number of hops on the approximate shortest paths. Our focus is on three important objects involving preservation of graph distances: hopsets, in which our goal is to ensure that small-hop paths also provide approximate shortest paths; distance oracles, in which we build a small data structure that supports efficient distance queries; and spanners, in which we find a sparse subgraph that approximately preserves all distances. We study efficient constructions and applications of these structures in various models of computation that capture different aspects of computational systems. Specifically, we propose new algorithms for constructing hopsets and distance oracles in two modern distributed models: the Massively Parallel Computation (MPC) and the Congested Clique model. These models have received significant attention recently due to their close connection to present-day big data platforms. In a different direction, we consider a centralized dynamic model in which the input changes over time. We propose new dynamic algorithms for constructing hopsets and distance oracles that lead to state-of-the-art approximate single-source, multi-source and all-pairs shortest path algorithms with respect to update-time. Finally, we study the problem of finding optimal spanners in a different distributed model, the LOCAL model. Unlike our other results, for this problem our goal is to find the best solution for a specific input graph rather than giving a general guarantee that holds for all inputs. One contribution of this work is to emphasize the significance of the tools and the techniques used for these distance problems rather than heavily focusing on a specific model. In other words, we show that our techniques are broad enough that they can be extended to different models

    Simulation of the UKQCD computer

    Get PDF

    Procesamiento paralelo : Balance de carga dinámico en algoritmo de sorting

    Get PDF
    Algunas técnicas de sorting intentan balancear la carga mediante un muestreo inicial de los datos a ordenar y una distribución de los mismos de acuerdo a pivots. Otras redistribuyen listas parcialmente ordenadas de modo que cada procesador almacene un número aproximadamente igual de claves, y todos tomen parte del proceso de merge durante la ejecución. Esta Tesis presenta un nuevo método que balancea dinámicamente la carga basado en un enfoque diferente, buscando realizar una distribución del trabajo utilizando un estimador que permita predecir la carga de trabajo pendiente. El método propuesto es una variante de Sorting by Merging Paralelo, esto es, una técnica basada en comparación. Las ordenaciones en los bloques se realizan mediante el método de Burbuja o Bubble Sort con centinela. En este caso, el trabajo a realizar -en términos de comparaciones e intercambios- se encuentra afectada por el grado de desorden de los datos. Se estudió la evolución de la cantidad de trabajo en cada iteración del algoritmo para diferentes tipos de secuencias de entrada, n datos con valores de a n sin repetición, datos al azar con distribución normal, observándose que el trabajo disminuye en cada iteración. Esto se utilizó para obtener una estimación del trabajo restante esperado a partir de una iteración determinada, y basarse en el mismo para corregir la distribución de la carga. Con esta idea, el métoEs revisado por: http://sedici.unlp.edu.ar/handle/10915/9500Facultad de Ciencias Exacta
    corecore