101 research outputs found

    Efficient Execution of Sequential Instructions Streams by Physical Machines

    Get PDF
    Any computational model which relies on a physical system is likely to be subject to the fact that information density and speed have intrinsic, ultimate limits. The RAM model, and in particular the underlying assumption that memory accesses can be carried out in time independent from memory size itself, is not physically implementable. This work has developed in the field of limiting technology machines, in which it is somewhat provocatively assumed that technology has achieved the physical limits. The ultimate goal for this is to tackle the problem of the intrinsic latencies of physical systems by encouraging scalable organizations for processors and memories. An algorithmic study is presented, which depicts the implementation of high concurrency programs for SP and SPE, sequential machine models able to compute direct-flow programs in optimal time. Then, a novel pieplined, hierarchical memory organization is presented, with optimal latency and bandwidth for a physical system. In order to both take full advantage of the memory capabilities and exploit the available instruction level parallelism of the code to be executed, a novel processor model is developed. Particular care is put in devising an efficient information flow within the processor itself. Both designs are extremely scalable, as they are based on fixed capacity and fixed size nodes, which are connected as a multidimensional array. Performance analysis on the resulting machine design has led to the discovery that latencies internal to the processor can be the dominating source of complexity in instruction flow execution, which adds to the effects of processor-memory interaction. A characterization of instruction flows is then developed, which is based on the topology induced by instruction dependences

    Two dimensional search algorithms for linear programming

    Get PDF
    Linear programming is one of the most important classes of optimization problems. These mathematical models have been used by academics and practitioners to solve numerous real world applications. Quickly solving linear programs impacts decision makers from both the public and private sectors. Substantial research has been performed to solve this class of problems faster, and the vast majority of the solution techniques can be categorized as one dimensional search algorithms. That is, these methods successively move from one solution to another solution by solving a one dimensional subspace linear program at each iteration. This dissertation proposes novel algorithms that move between solutions by repeatedly solving a two dimensional subspace linear program. Computational experiments demonstrate the potential of these newly developed algorithms and show an average improvement of nearly 25% in solution time when compared to the corresponding one dimensional search version. This dissertation\u27s research creates the core concept of these two dimensional search algorithms, which is a fast technique to determine an optimal basis and an optimal solution to linear programs with only two variables. This method, called the slope algorithm, compares the slope formed by the objective function with the slope formed by each constraint to determine a pair of constraints that intersect at an optimal basis and an optimal solution. The slope algorithm is implemented within a simplex framework to perform two dimensional searches. This results in the double pivot simplex method. Differently than the well-known simplex method, the double pivot simplex method simultaneously pivots up to two basic variables with two nonbasic variables at each iteration. The theoretical computational complexity of the double pivot simplex method is identical to the simplex method. Computational results show that this new algorithm reduces the number of pivots to solve benchmark instances by approximately 40% when compared to the classical implementation of the simplex method, and 20% when compared to the primal simplex implementation of CPLEX, a high performance mathematical programming solver. Solution times of some random linear programs are also improved by nearly 25% on average. This dissertation also presents a novel technique, called the ratio algorithm, to find an optimal basis and an optimal solution to linear programs with only two constraints. When the ratio algorithm is implemented within a simplex framework to perform two dimensional searches, it results in the double pivot dual simplex method. In this case, the double pivot dual simplex method behaves similarly to the dual simplex method, but two variables are exchanged at every step. Two dimensional searches are also implemented within an interior point framework. This dissertation creates a set of four two dimensional search interior point algorithms derived from primal and dual affine scaling and logarithmic barrier search directions. Each iteration of these techniques quickly solves a two dimensional subspace linear program formed by the intersection of two search directions and the feasible region of the linear program. Search directions are derived by orthogonally partitioning the objective function vector, which allows these novel methods to improve the objective function value at each step by at least as much as the corresponding one dimensional search version. Computational experiments performed on benchmark linear programs demonstrate that these two dimensional search interior point algorithms improve the average solution time by approximately 12% and the average number of iterations by 15%. In conclusion, this dissertation provides a change of paradigm in linear programming optimization algorithms. Implementing two dimensional searches within both a simplex and interior point framework typically reduces the computational time and number of iterations to solve linear programs. Furthermore, this dissertation sets the stage for future research topics in multidimensional search algorithms to solve not only linear programs but also other critical classes of optimization methods. Consequently, this dissertation\u27s research can become one of the first steps to change how commercial and open source mathematical programming software will solve optimization problems

    The Monge array--an abstraction and its applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1991.Includes bibliographical references (p. 211-219).by James KIimbrough Park.Ph.D

    Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

    Get PDF
    Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM

    Parallel Algorithms for Geometric Graph Problems

    Full text link
    We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a (1+ϵ)(1+\epsilon)-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, n1+oϵ(1)n^{1+o_\epsilon(1)}. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for (1+ϵ)(1+\epsilon)-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a (1+ϵ)(1+\epsilon)-approximation algorithm with nδn^{\delta} space in the streaming-with-sorting model with 1/δO(1)1/\delta^{O(1)} passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem

    Fast Parallel Algorithms on a Class of Graph Structures With Applications in Relational Databases and Computer Networks.

    Get PDF
    The quest for efficient parallel algorithms for graph related problems necessitates not only fast computational schemes but also requires insights into their inherent structures that lend themselves to elegant problem solving methods. Towards this objective efficient parallel algorithms on a class of hypergraphs called acyclic hypergraphs and directed hypergraphs are developed in this thesis. Acyclic hypergraphs are precisely chordal graphs and their subclasses, and they have applications in relational databases and computer networks. In this thesis, first, we present efficient parallel algorithms for the following problems on graphs. (1) determining whether a graph is strongly chordal, ptolemaic, or a block graph. If the graph is strongly chordal, determine the strongly perfect vertex elimination ordering. (2) determining the minimal set of edges needed to make an arbitrary graph strongly chordal, ptolemaic, or a block graph. (3) determining the minimum cardinality dominating set, connected dominating set, total dominating set, and the domatic number of a strongly chordal graph. Secondly, we show that the query implication problem (Q\sb1\ \to\ Q\sb2) on two queries, which is to determine whether the data retrieved by query Q\sb1 is always a subset of the data retrieved by query Q\sb2, is not even in NP and in fact complete in \Pi\sb2\sp{p}. We present several \u27fine-grain\u27 analyses of the query implication problem and show that the query implication can be solved in polynomial time given chordal queries. Thirdly, we develop efficient parallel algorithms for manipulating directed hypergraphs H such as finding a directed path in H, closure of H, and minimum equivalent hypergraph of H. We show that finding a directed path in a directed hypergraph is inherently sequential. For directed hypergraphs with fixed degree and diameter we present NC algorithms for manipulations. Directed hypergraphs are representation schemes for functional dependencies in relational databases. Finally, we also present an efficient parallel algorithm for multi-dimensional range search. We show that a set of points in a rectangular parallelepiped can be obtained in O(logn) time with only 2.log\sp2 n - 10.logn + 14 processors on a EREW-PRAM. A nontrivial implementation technique on the hypercube parallel architecture is also presented. Our method can be easily generalized to the case of d-dimensional range search

    Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

    Get PDF
    This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author

    Toward optimised skeletons for heterogeneous parallel architecture with performance cost model

    Get PDF
    High performance architectures are increasingly heterogeneous with shared and distributed memory components, and accelerators like GPUs. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. This thesis explores the potential for algorithmic skeletons integrating a dynamically parametrised static cost model, to deliver portable performance for mostly regular data parallel programs on heterogeneous archi- tectures. The rst contribution of this thesis is to address the challenges of program- ming heterogeneous architectures by providing two skeleton-based programming libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel that enables GPUs to be exploited as general purpose multi-processor devices. Both libraries provide heterogeneous data parallel algorithmic skeletons including hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll. The second contribution is the development of cost models for workload dis- tribution. First, we construct an architectural cost model (CM1) to optimise overall processing time for HWSkel heterogeneous skeletons on a heterogeneous system composed of networks of arbitrary numbers of nodes, each with an ar- bitrary number of cores sharing arbitrary amounts of memory. The cost model characterises the components of the architecture by the number of cores, clock speed, and crucially the size of the L2 cache. Second, we extend the HWSkel cost model (CM1) to account for GPU performance. The extended cost model (CM2) is used in the GPU-HWSkel library to automatically nd a good distribution for both a single heterogeneous multicore/GPU node, and clusters of heteroge- neous multicore/GPU nodes. Experiments are carried out on three heterogeneous multicore clusters, four heterogeneous multicore/GPU clusters, and three single heterogeneous multicore/GPU nodes. The results of experimental evaluations for four data parallel benchmarks, i.e. sumEuler, Image matching, Fibonacci, and Matrix Multiplication, show that our combined heterogeneous skeletons and cost models can make good use of resources in heterogeneous systems. Moreover using cores together with a GPU in the same host can deliver good performance either on a single node or on multiple node architectures

    Time Lower Bounds for Parallel Network Computations

    Get PDF
    Direct Acyclic Graphs (DAGs) are a suitable way to describe computations, expressing precedence constraints among operations. Beyond the representation of the execution of an algorithm, a DAG can effectively represent the execution of a parallel network. This last kind of DAG has a regular structure, consisting in the repetition over time of the original network; these common representations suggest a possible uniform approach in the study of execution of algorithms and emulation of networks. Both in parallel computing and computational complexity, DAGs have been extensively employed in the study of algorithmic features, as lower bounds for the execution/emulation time of algorithms/networks, the minimum quantity of memory needed for computing an algorithm or the minimum I/O complexity of an algorithm given a certain amount of fast memory cells. Developed techniques are quite different in their assumptions; one of the more fundamental differences is that some of them allow recomputation of intermediate results, while others disallow it, requiring the storage in memory of intermediate results for their further usages. In nowadays computations the trade-off between data recomputation and data storing is important both in parallel and in local elaborations, since in the former we can increase the bandwidth and reduce the latency with whom data can be accessed (by computing the same data in several points of the network), while in the latter we can avoid to pay the latency of the access in memory to reload data, by recomputing them possibly loading fewer data or using data already present in memory. So far it does not exist an universal technique able to foresee the strict lower bound for each execution of algorithm or emulation of network in each network and the known results derive from several theorems. On the contrary there are a lot of cases for which it neither exists a tight result; among these there are also emulations of extensively studied networks, such as multidimensional arrays. The first part of our thesis starts from this state-of-the-art: we propose a survey of several known lower bound techniques involving DAGs, followed by original theorems which clarify or solve open problems. In particular, in our survey we consider lower bound techniques for execution of algorithms and emulation of networks in parallel networks, showing their principles and their limits. In the discussion we show relationships among theorems, proving that no one of them is better of the others in general terms: there are counter-examples in which each theorem gives better bounds than others. We also exhibit examples where no bound among the considered techniques is tight. Moreover we generalize some theorems originally suited for network emulations, adapting them to execution of general DAGs in parallel networks, showing examples of their application. We also consider theorems for determining minimum I/O complexity, presenting similarities and differences with emulation theorems. One of the main results of the thesis is a new general technique which provides lower bounds almost tight (except for a logarithmic factor) in a class of network emulations including multidimensional arrays. We improve previously better known results which have a polynomial gap between lower bound and actual emulation time. Our theorem considers emulations with recomputation, giving results valid in the most general context. Finally we consider the role of recomputation in performance, trying to understand when it gives a real advantage respect to storing intermediate results in memory. In particular we introduce the problem in simple networks, showing a class of them in which recomputation can not improve I/O performance, ending in butterfly DAGs where recomputation can save a number of I/O accesses at least as big as the fast memory available during the computation. The approach used highlights the difficulty of exploit recomputation in executions of algorithms when their DAG representation exhibits an high bisection bandwidth
    corecore