81 research outputs found

    Low-Complexity Switch Scheduling Algorithms: Delay Optimality in Heavy Traffic

    Full text link
    Motivated by applications in data center networks, in this paper, we study the problem of scheduling in an input queued switch. While throughput maximizing algorithms in a switch are well-understood, delay analysis was developed only recently. It was recently shown that the well-known MaxWeight algorithm achieves optimal scaling of mean queue lengths in steady state in the heavy-traffic regime, and is within a factor less than 22 of a universal lower bound. However, MaxWeight is not used in practice because of its high time complexity. In this paper, we study several low complexity algorithms and show that their heavy-traffic performance is identical to that of MaxWeight. We first present a negative result that picking a random schedule does not have optimal heavy-traffic scaling of queue lengths even under uniform traffic. We then show that if one picks the best among two matchings or modifies a random matching even a little, using the so-called flip operation, it leads to MaXWeight like heavy-traffic performance under uniform traffic. We then focus on the case of non-uniform traffic and show that a large class of low time complexity algorithms have the same heavy-traffic performance as MaxWeight, as long as it is ensured that a MaxWeight matching is picked often enough. We also briefly discuss the performance of these algorithms in the large scale heavy-traffic regime when the size of the switch increases simultaneously with the load. Finally, we use simulations to compare the performance of various algorithms.Comment: 14 pages paper with 3 page appendix. 4 figures and 1 table. Journa

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    New Sequential and Scalable Parallel Algorithms for Incomplete Factor Preconditioning

    Get PDF
    The solution of large, sparse, linear systems of equations Ax = b is an important kernel, and the dominant term with regard to execution time, in many applications in scientific computing. The large size of the systems of equations being solved currently (millions of unknowns and equations) requires iterative solvers on parallel computers. Preconditioning, which is the process of translating a linear system into a related system that is easier to solve, is widely used to reduce solution time and is sometimes required to ensure convergence. Level-based preconditioning (ILU(ℓ)) has long been used in serial contexts and is widely recognized as robust and effective for a wide range of problems. However, the method has long been regarded as an inherently sequential technique. Parallelism, it has been thought, can be achieved primarily at the expense of increased iterations. We dispute these claims. The first half of this dissertation takes an in-depth look at structurally based ILU(ℓ) symbolic factorization. There are two definitions of fill level, “sum” and “max,” that have been proposed. Hitherto, these definitions have been cast in terms of matrix terminology. We develop a sequence of lemmas and theorems that provide graph theoretic characterizations of both definitions; these characterizations are based on the static graph of a matrix, G(A). Our Incomplete Fill Path Theorem characterizes fill levels per the sum definition; this is the definition that is used in most library implementations of the “classic” ILU(ℓ) factorization algorithm. Our theorem leads to several new graph-search algorithms that compute factors identical, or nearly identical, to those computed by the “classic” algorithm. Our analyses shows that the new algorithms have lower run time complexity than that of the previously existing algorithms for certain classes of matrices that are commonly encountered in scientific applications. The second half of this dissertation presents a Parallel ILU algorithmic framework (PILU). This framework enables scalable parallel ILU preconditioning by combining concepts from domain decomposition and graph ordering. The framework can accommodate ILU(ℓ) factorization as well as threshold-based ILUT methods. A model implementation of the framework, the Euclid library, was developed as part of this dissertation. This library was used to obtain experimental results for Poisson\u27s equation, the Convection-Diffusion equation, and a nonlinear Radiative Transfer problem. The experiments, which were conducted on a variety of platforms with up to 400 CPUs, demonstrate that our approach is highly scalable for arbitrary ILU(ℓ) fill levels

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Managing Overheads in Asynchronous Many-Task Runtime Systems

    Get PDF
    Asynchronous Many-Task (AMT) runtime systems are based on the idea of dividing an algorithm into small units of work, known as tasks. The runtime system is then responsible for scheduling and executing these tasks in an efficient manner by taking into account the resources provided to it and the associated data dependencies between the tasks. One of the primary challenges faced by AMTs is managing such fine-grained parallelism and the overheads associated with creating, scheduling and executing tasks. This work develops methodologies for assessing and managing overheads associated with fine-grained task execution in HPX, our exemplar Asynchronous Many-Task runtime system. Known optimization techniques, viz. active message coalescing, task inlining and parallel loop iteration chunking are applied to HPX. Active message coalescing, where messages bound to the same destination are aggregated into a single message, is presented as a solution to minimize overheads associated with fine-grained communications. Methodologies and metrics for analyzing fine-grained communication overheads are developed. The metrics identified and implemented in this research aid in evaluating network efficiency by giving us an intrinsic view of the underlying network overhead that would be difficult to measure using conventional methods. Task inlining, a method that allows runtime systems to manage the overheads introduced by a large number of tasks by merging tasks together into one thread of execution, is presented as a technique for minimizing fine-grained task overheads. A runtime policy that dynamically decides whether to inline a task is developed and evaluated on different processor architectures. A methodology to derive a largely machine independent constant that allows controlling task granularity is developed. Finally, the machine independent constant derived in the context of task inlining is applied to chunking of parallel loop iterations, which confirms its applicability to reduce overheads, in the context of finding the optimal chunk size of the combined loop iterations

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Quantum computing on cloud-based processors

    Get PDF
    Thesis (MSc)--Stellenbosch University, 2022.ENGLISH ABSTRACT: The noisy intermediate-scale quantum (NISQ) era refers to the current technological epoch permeated with quantum processors that are big enough (50-100 qubits) to be no longer trivially simulatable with digital computers but not yet capable of full fault-tolerant computation. Such processors provide great testbeds to understand the practical issues and resources needed to realize quantum tasks in these processors, such as quantum algorithms. Many pressing issues arise in this context that are a direct consequence of the limitations of these processors (limited number of qubits, low qubit connectivity, and limited coherence times). Hence, for near-term quantum algorithms, there is an overriding imperative to adopt an approach that takes into account, and attempts to mitigate or circumvent some of these limitations. In this thesis, we examine realizing Grover’s quantum search algorithm for four qubits on IBM Q superconducting quantum processors, and potentially scaling up to more qubits. We also investigate non-canonical forms of the quantum search algorithm that trade accuracy for speed in a way that is more suitable for near-term processors. Our contribution to this topic of research is a slight improvement in the accuracy of the solution to a graph problem, solved with a quantum search algorithm implemented on IBM Q quantum processors by Satoh et .al in IEEE Transactions on Quantum Engineering (2020). We also explore the realization of a measurement-based quantum search algorithm for three qubits. Unfortunately, the number of qubits and two-qubit gates required by such an algorithm puts it beyond the reach of current quantum processors. Based on a recently published work with Professor Mark Tame, we also report a proof-of-concept demonstration of a quantum order-finding algorithm for factor- ing the integer 21. Our demonstration builds upon a previous demonstration by Martín-López et al. in Nature Photonics 6, 773 (2012). We go beyond this work by implementing the algorithm on IBM Q quantum processors using a configuration of approximate Toffoli gates with residual phase shifts, which preserves its functional correctness and allows us to achieve a complete factoring of N D 21 using a quantum circuit with relatively fewer two-qubit gates. Lastly, we realize a small-scale three-qubit quantum processor based on a spontaneous parametric down-conversion source built to generate a polarization-entangled Bell state. The state is enlarged by using the path degree of freedom of one of the photons to make a 3-qubit GHZ state. The generated state is versatile enough to carry out quantum correlation measurements such as Bell’s inequalities and entanglement witnesses. The entire experimental setup is motorized and made automatic allowing remote control of the measurements of each of the qubits, and we design and build a mobile graphical user interface to an provide intuitive and visual way to interact with the experiment.AFRIKAANSE OPSOMMING: Die ruiesende intermediêre skaal kwantum (NISQ) era verwys na die huidige tegnolo- giese epog deurdring met kwantumverwerkers wat groot genoeg is (50-100 qubits) om nie meer doeltreffend gesimuleer te kan word op digitale rekenaars nie, maar nog nie in staat is om volle foutverdraagsame berekening uit te voer nie. Sulke verwerkers bied baie goeie toetsplatforms om die probleme en hulpbronne mee te verstaan wat nodig is om kwantumtake soos kwantumalgoritmes in hierdie verwerkers te verwesenlik. Baie dringende kwessies ontstaan in hierdie konteks wat ’n direkte gevolg is van die beperkings van hierdie verwerkeers (beperkte aantal qubits, lae qubit konnektiwiteit en beperkte samehang tye). Daarom is daar vir naby-termyn kwantum algoritmes ’n oorheersende noodsaaklikheid om ’n benadering aan te neem wat hierdie beperkings in ag neem en pogings aanwend om sommige daarvan te versag of te omseil. In hierdie handeling het ons ondersoek ingestel na Grover se kwantumsoekalgoritmes vir vier qubits op IBM Q supergeleier kwantumverwerkers en die moontlike opskaal na ’n groter aantal qubits. Ons ondersoek ook nie-kanonieke vorms van die kwantum- soekalgoritmes wat akkuraatheid vir spoed verhandel op ’n manier wat meer geskik is vir naby-termyn verwerkers. Ons bydra tot hierdie navorsingsonderwerp is ’n effense verbetering aan die akkuraatheid van die oplossing vir ’n grafiekprobleem opgelos met ’n soekalgoritme wat op IBM Q kwantumverwerkers geïmplimenteer is deur Satoh et al. In IEEE Transactions on Quantum Engineering (2020). Ons ondersoek ook die verwesenliking van ’n waarneming-gebaseerde kwantumsoekalgoritme vir drie qubits. Die aantal qubits en twee-qubit logikahekke wat deur so ’n algoritme vereis word plaas dit buite die bereik van huidige kwantumverwerkers. Gebaseer op ’n onlangs-gepubliseerde navorsingsstuk saam met professor Mark Tame rapporteer ons ook ’n bewys-van-konsep demonstrasie van ’n kwantum volgordebepal- ing algoritme vir die faktorisering van die heelgetal 21. Ons demonstrasie bou voort op ’n vorige demonstrasie deur Martín López et al. In Nature Photonics 6,773 (2012). Ons brei uit op hierdie navorsing deur die die algoritme op IBM Q kwantumverwerk- ers te implimenteer met gebruik van benaderde Toffoli logikahekke met oorblywende faseverskuiwings – wat sy funksionele integriteit behou en ons instaat stel om ’n volledige faktoriseering van N = 21 te bereik met behulp van ’n kwantumstroombaan met ’n kleiner aantal twee-qubit logikahekke. Laastens bewerkstellig ons ’n kleinskaalse drie-qubit kwantumverwerker gebaseer op ’n spontane parametriese fluoressensie (“spontaneous parametric down-conversion”) bron wat gebou is om ’n polarisasie-verstrengelde Bell staat te genereer. Hierdie staat word vergroot deur die baanvryheidsgraad van een van die fotone te gebruik om kwantumkorrelasie metings soos Bell se ongelykhede en verstrengelingsgetuies uit te voer. Die hele eksperimentele opstelling word gemotoriseer en geautomatiseer sodat waarnemings van elk van die qubits deur middel van afstandbeheer gemaak kan word, en ons ontwerp en ontwikkel ’n mobile grafiese gebruikerskoppelvlak om ’n intuïtiewe en visuele manier te bied om met die eksperiment te kommunikeer.Master

    Effective Resource and Workload Management in Data Centers

    Get PDF
    The increasing demand for storage, computation, and business continuity has driven the growth of data centers. Managing data centers efficiently is a difficult task because of the wide variety of datacenter applications, their ever-changing intensities, and the fact that application performance targets may differ widely. Server virtualization has been a game-changing technology for IT, providing the possibility to support multiple virtual machines (VMs) simultaneously. This dissertation focuses on how virtualization technologies can be utilized to develop new tools for maintaining high resource utilization, for achieving high application performance, and for reducing the cost of data center management.;For multi-tiered applications, bursty workload traffic can significantly deteriorate performance. This dissertation proposes an admission control algorithm AWAIT, for handling overloading conditions in multi-tier web services. AWAIT places on hold requests of accepted sessions and refuses to admit new sessions when the system is in a sudden workload surge. to meet the service-level objective, AWAIT serves the requests in the blocking queue with high priority. The size of the queue is dynamically determined according to the workload burstiness.;Many admission control policies are triggered by instantaneous measurements of system resource usage, e.g., CPU utilization. This dissertation first demonstrates that directly measuring virtual machine resource utilizations with standard tools cannot always lead to accurate estimates. A directed factor graph (DFG) model is defined to model the dependencies among multiple types of resources across physical and virtual layers.;Virtualized data centers always enable sharing of resources among hosted applications for achieving high resource utilization. However, it is difficult to satisfy application SLOs on a shared infrastructure, as application workloads patterns change over time. AppRM, an automated management system not only allocates right amount of resources to applications for their performance target but also adjusts to dynamic workloads using an adaptive model.;Server consolidation is one of the key applications of server virtualization. This dissertation proposes a VM consolidation mechanism, first by extending the fair load balancing scheme for multi-dimensional vector scheduling, and then by using a queueing network model to capture the service contentions for a particular virtual machine placement
    • …
    corecore