2,510 research outputs found
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
On-line planning and scheduling: an application to controlling modular printers
We present a case study of artificial intelligence techniques applied to the control of production printing equipment. Like many other real-world applications, this complex domain requires high-speed autonomous decision-making and robust continual operation. To our knowledge, this work represents the first successful industrial application of embedded domain-independent temporal planning. Our system handles execution failures and multi-objective preferences. At its heart is an on-line algorithm that combines techniques from state-space planning and partial-order scheduling. We suggest that this general architecture may prove useful in other applications as more intelligent systems operate in continual, on-line settings. Our system has been used to drive several commercial prototypes and has enabled a new product architecture for our industrial partner. When compared with state-of-the-art off-line planners, our system is hundreds of times faster and often finds better plans. Our experience demonstrates that domain-independent AI planning based on heuristic search can flexibly handle time, resources, replanning, and multiple objectives in a high-speed practical application without requiring hand-coded control knowledge
Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs
Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation.
In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise.
In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages.
In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits.
Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine.
Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela.
En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo.
En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada
proceso MPI opera localmente en el paso de mensajes.
En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa.
En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC
Scalable Schedule-Aware Bundle Routing
This thesis introduces approaches providing scalable delay-/disruption-tolerant routing capabilities in scheduled space topologies. The solution is developed for the requirements derived from use cases built according to predictions for future space topology, like the future Mars communications architecture report from the interagency operations advisory group. A novel routing algorithm is depicted to provide optimized networking performance that discards the scalability issues inherent to state-of-the-art approaches. This thesis also proposes a new recommendation to render volume management concerns generic and easily exchangeable, including a new simple management technique increasing volume awareness accuracy while being adaptable to more particular use cases. Additionally, this thesis introduces a more robust and scalable approach for internetworking between subnetworks to increase the throughput, reduce delays, and ease configuration thanks to its high flexibility.:1 Introduction
1.1 Motivation
1.2 Problem statement
1.3 Objectives
1.4 Outline
2 Requirements
2.1 Use cases
2.2 Requirements
2.2.1 Requirement analysis
2.2.2 Requirements relative to the routing algorithm
2.2.3 Requirements relative to the volume management
2.2.4 Requirements relative to interregional routing
3 Fundamentals
3.1 Delay-/disruption-tolerant networking
3.1.1 Architecture
3.1.2 Opportunistic and deterministic DTNs
3.1.3 DTN routing
3.1.4 Contact plans
3.1.5 Volume management
3.1.6 Regions
3.2 Contact graph routing
3.2.1 A non-replication routing scheme
3.2.2 Route construction
3.2.3 Route selection
3.2.4 Enhancements and main features
3.3 Graph theory and DTN routing
3.3.1 Mapping with DTN objects
3.3.2 Shortest path algorithm
3.3.3 Edge and vertex contraction
3.4 Algorithmic determinism and predictability
4 Preliminary analysis
4.1 Node and contact graphs
4.2 Scenario
4.3 Route construction in ION-CGR
4.4 Alternative route search
4.4.1 Yen’s algorithm scalability
4.4.2 Blocking issues with Yen
4.4.3 Limiting contact approaches
4.5 CGR-multicast and shortest-path tree search
4.6 Volume management
4.6.1 Volume obstruction
4.6.2 Contact sink
4.6.3 Ghost queue
4.6.4 Data rate variations
4.7 Hierarchical interregional routing
4.8 Other potential issues
5 State-of-the-art and related work
5.1 Taxonomy
5.2 Opportunistic and probabilistic approaches
5.2.1 Flooding approaches
5.2.2 PROPHET
5.2.3 MaxProp
5.2.4 Issues
5.3 Deterministic approaches
5.3.1 Movement-aware routing over interplanetary networks
5.3.2 Delay-tolerant link state routing
5.3.3 DTN routing for quasi-deterministic networks
5.3.4 Issues
5.4 CGR variants and enhancements
5.4.1 CGR alternative routing table computation
5.4.2 CGR-multicast
5.4.3 CGR extensions
5.4.4 RUCoP and CGR-hop
5.4.5 Issues
5.5 Interregional routing
5.5.1 Border gateway protocol
5.5.2 Hierarchical interregional routing
5.5.3 Issues
5.6 Further approaches
5.6.1 Machine learning approaches
5.6.2 Tropical geometry
6 Scalable schedule-aware bundle routing
6.1 Overview
6.2 Shortest-path tree routing for space networks
6.2.1 Structure
6.2.2 Tree construction
6.2.3 Tree management
6.2.4 Tree caching
6.3 Contact segmentation
6.3.1 Volume management interface
6.3.2 Simple volume manager
6.3.3 Enhanced volume manager
6.4 Contact passageways
6.4.1 Regional border definition
6.4.2 Virtual nodes
6.4.3 Pathfinding and administration
7 Evaluation
7.1 Methodology
7.1.1 Simulation tools
7.1.2 Simulator extensions
7.1.3 Algorithms and scenarios
7.2 Offline analysis
7.3 Eliminatory processing pressures
7.4 Networking performance
7.4.1 Intraregional unicast routing tests
7.4.2 Intraregional multicast tests
7.4.3 Interregional routing tests
7.4.4 Behavior with congestion
7.5 Requirement fulfillment
8 Summary and Outlook
8.1 Conclusion
8.2 Future works
8.2.1 Next development steps
8.2.2 Contact graph routin
Scheduling of Overload-Tolerant Computation and Multi-Mode Communication in Real-Time Systems
Real-time tasks require sufficient resources to meet deadline constraints. A component should provision sufficient resources for its workloads consisting of tasks to meet their deadlines. Supply and demand bound functions can be used to analyze the schedulability of workloads. The demand-bound function determines the maximum required computational units for a given workload and the supply-bound function determines the minimum possible resources supplied to the workload. A component will experience an overload if it receives fewer resources than required. An overload will be transient if it occurs for a bounded amount of time. Most work concentrates on designing components that avoid overloads by over-provisioning resources even though some computational units such as control system components can tolerate transient overloads. Overload-tolerant components can utilize resources more efficiently if over-provisioning of resources can be avoided.
First, this dissertation presents the design of an efficient periodic resource model for scheduling computation of components that can tolerate transient overloads under the Earliest Deadline First (EDF) scheduling policy. We propose a periodic resource model for overload-tolerant components to address three problems: (1) characterize overloads and determine metrics of interest (i.e., delay), (2) derive a model to compute a periodic resource supply for a given workload and a worst-case tolerable delay, and (3) find a periodic resource supply for given control system specifications with a worst-case delay. The derived periodic resource supply can be used to derive an overload-tolerant component interface. Overload-tolerant real-time components can connect with each other in a distributed manner and thus require communication scheduling for reliable and guaranteed transmissions. Moreover, applications may require multi-mode communication for efficient data transmission.
Second, this dissertation discusses communication schedules for multi-mode distributed components. Since distributed multi-mode applications are prone to suffer from delays incurred during mode changes, good communication schedules have low average mode-change delays. A key problem in designing multi-mode communication in real-time systems is the generation of schedules to move away the complexity of schedule design from the developer. We propose a mechanism to generate multi-mode communication schedules using optimization constraints associated with timing requirements. We illustrate a workflow from specifications to the generation of communication schedules through a real-time video monitoring case-study. Experimental analysis for the case-study demonstrates that schedules generated using the proposed method reduce the average mode-change delay compared to a randomized algorithm and the well-known EDF scheduling policy.
Finally, this thesis discusses the synthesis of schedules for computation and communication to achieve not only performance but also separation of concerns for reducing complexity and increasing safety. To integrate overload-tolerant components using real-time communication, we derive specifications of component interfaces using the characterization of overloads and the proposed periodic resource model. The generation of communication schedules uses the specifications of interfaces which include timing requirements of possible transient overloads. A walk-through case-study explains the steps necessary to generate communication schedules using component interfaces. The interfaces provide safety through isolation of transient overload-tolerant components and the generated communication schedules provide high performance as a result of their low average mode-change delay
Master/worker parallel discrete event simulation
The execution of parallel discrete event simulation across metacomputing infrastructures is examined. A master/worker architecture for parallel discrete event simulation is proposed providing robust executions under a dynamic set of services with system-level support for fault tolerance, semi-automated client-directed load balancing, portability across heterogeneous machines, and the ability to run codes on idle or time-sharing clients without significant interaction by users. Research questions and challenges associated with issues and limitations with the work distribution paradigm, targeted computational domain, performance metrics, and the intended class of applications to be used in this context are analyzed and discussed. A portable web services approach to master/worker parallel discrete event simulation is proposed and evaluated with subsequent optimizations to increase the efficiency of large-scale simulation execution through distributed master service design and intrinsic overhead reduction. New techniques for addressing challenges associated with optimistic parallel discrete event simulation across metacomputing such as rollbacks and message unsending with an inherently different computation paradigm utilizing master services and time windows are proposed and examined. Results indicate that a master/worker approach utilizing loosely coupled resources is a viable means for high throughput parallel discrete event simulation by enhancing existing computational capacity or providing alternate execution capability for less time-critical codes.Ph.D.Committee Chair: Fujimoto, Richard; Committee Member: Bader, David; Committee Member: Perumalla, Kalyan; Committee Member: Riley, George; Committee Member: Vuduc, Richar
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis
Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions
- …