12 research outputs found

    Automating Topology Aware Mapping for Supercomputers

    Get PDF
    Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of parallel applications through reduction of network contention. Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration. The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention. Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative for machines of the future.published or submitted for publicationnot peer reviewe

    Toward Distributed At-scale Hybrid Network Test with Emulation and Simulation Symbiosis

    Get PDF
    In the past decade or so, significant advances were made in the field of Future Internet Architecture (FIA) design. Undoubtedly, the size of Future Internet will increase tremendously, and so will the complexity of its users’ behaviors. This advancement means most of future Internet applications and services can only achieve and demonstrate full potential on a large-scale basis. The development of network testbeds that can validate key design decisions and expose operational issues at scale is essential to FIA research. In conjunction with the development and advancement of FIA, cyber-infrastructure testbeds have also achieved remarkable progress. For meaningful network studies, it is indispensable to utilize cyber-infrastructure testbeds appropriately in order to obtain accurate experiment results. That said, existing current network experimentation is intrinsically deficient. The existing testbeds do not offer scalability, flexibility, and realism at the same time. This dissertation aims to construct a hybrid system of conducting at-scale network studies and experiments by exploiting the distributed computing ability of current testbeds. First, this work presents a synchronization of parallel discrete event simulation that offers the simulation with transparent scalability and performance on various high-end computing platforms. The parallel simulator that we implement is configured so that it can self-adapt for the performance while running on supercomputers with disparate architectures. The simulator could be used to handle models of different sizes, varying modeling details, and different complexity levels. Second, this works addresses the issue of researching network design and implementation realistically at scale, through the use of distributed cyber-infrastructure testbeds. An existing symbiotic approach is applied to integrate emulation with simulation so that they can overcome the limitations of physical setup. The symbiotic method is used to improve the capabilities of a specific emulator, Mininet. In this case, Mininet can be used to run applications directly on the virtual machines and software switches, with network connectivity represented by detailed simulation at scale. We also propose a method for using the symbiotic approach to coordinate separate Mininet instances, each representing a different set of the overlapping network flows. This approach provides a significant improvement to the scalability of the network experiments

    netloc: Towards a Comprehensive View of the HPC System Topology

    Get PDF
    International audienceThe increasing complexity of High Performance Computing (HPC) server architectures and networks has made topology- and affinity-awareness a critical component of HPC application optimization. Although there is a portable mechanism for accessing the server-internal topology there is no such mechanism for accessing the network topology of modern HPC systems in an equally portable manner. The Network Locality (netloc) project provides mechanisms for portably discovering and abstractly representing the network topology of modern HPC systems. Additionally, netloc provides the ability to merge the network topology with the server-internal topologies resulting in a comprehensive map of the HPC system topology. Using a modular infrastructure, netloc provides support for a variety of network types and discovery techniques. By representing the network topology as a graph, netloc supports any network topology configuration. The netloc architecture hides the topology discovery mechanism from the application developer thus allowing them to focus on traversing and analyzing the resulting map of the HPC system topology

    Load Balancing Scientific Applications

    Get PDF
    The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one at the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime. Dynamic load balancing of parallel applications becomes more critical at scale, while also being expensive. To make the load balance correction affordable at scale, we propose a lazy load balancing strategy that evaluates the imbalance and computes the new assignment of work to processes asynchronously to the main application computation. We decouple the load balance algorithm from the application and run it on potentially fewer, separate processors. In this Multiple Program Multiple Data (MPMD) configuration, the load balance algorithm can execute concurrently with the application and with higher parallel efficiency than if it were run on the same processors as the simulation. Work is reassigned lazily as directions become available, and the application need not wait for the load balance algorithm to complete. We show that we can save resources by running a load balance algorithm at higher parallel efficiency on a smaller number of processors. Using our framework, we explore the trade-offs of load balancing configurations and demonstrate a performance improvement of up to 46%

    Adaptive structured parallelism

    Get PDF
    Algorithmic skeletons abstract commonly-used patterns of parallel computation, communication, and interaction. Parallel programs are expressed by interweaving parameterised skeletons analogously to the way in which structured sequential programs are developed, using well-defined constructs. Skeletons provide top-down design composition and control inheritance throughout the program structure. Based on the algorithmic skeleton concept, structured parallelism provides a high-level parallel programming technique which allows the conceptual description of parallel programs whilst fostering platform independence and algorithm abstraction. By decoupling the algorithm specification from machine-dependent structural considerations, structured parallelism allows programmers to code programs regardless of how the computation and communications will be executed in the system platform.Meanwhile, large non-dedicated multiprocessing systems have long posed a challenge to known distributed systems programming techniques as a result of the inherent heterogeneity and dynamism of their resources. Scant research has been devoted to the use of structural information provided by skeletons in adaptively improving program performance, based on resource utilisation. This thesis presents a methodology to improve skeletal parallel programming in heterogeneous distributed systems by introducing adaptivity through resource awareness. As we hypothesise that a skeletal program should be able to adapt to the dynamic resource conditions over time using its structural forecasting information, we have developed ASPara: Adaptive Structured Parallelism. ASPara is a generic methodology to incorporate structural information at compilation into a parallel program, which will help it to adapt at execution

    On the simulation and design of manycore CMPs

    Get PDF
    The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market

    NASA space station automation: AI-based technology review

    Get PDF
    Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures

    Transport Layer solution for bulk data transfers over Heterogeneous Long Fat Networks in Next Generation Networks

    Get PDF
    Aquesta tesi per compendi centra les seves contribucions en l'aprenentatge i innovació de les Xarxes de Nova Generació. És per això que es proposen diferents contribucions en diferents àmbits (Smart Cities, Smart Grids, Smart Campus, Smart Learning, Mitjana, eHealth, Indústria 4.0 entre d'altres) mitjançant l'aplicació i combinació de diferents disciplines (Internet of Things, Building Information Modeling, Cloud Storage, Ciberseguretat, Big Data, Internet de el Futur, Transformació Digital). Concretament, es detalla el monitoratge sostenible del confort a l'Smart Campus, la que potser es la meva aportació més representativa dins de la conceptualització de Xarxes de Nova Generació. Dins d'aquest innovador concepte de monitorització s'integren diferents disciplines, per poder oferir informació sobre el nivell de confort de les persones. Aquesta investigació demostra el llarg recorregut que hi ha en la transformació digital dels sectors tradicionals i les NGNs. Durant aquest llarg aprenentatge sobre les NGN a través de les diferents investigacions, es va poder observar una problemàtica que afectava de manera transversal als diferents camps d'aplicació de les NGNs i que aquesta podia tenir una afectació en aquests sectors. Aquesta problemàtica consisteix en el baix rendiment durant l'intercanvi de grans volums de dades sobre xarxes amb gran capacitat d'ample de banda i remotament separades geogràficament, conegudes com a xarxes elefant. Concretament, això afecta al cas d'ús d'intercanvi massiu de dades entre regions Cloud (Cloud Data Sharing use case). És per això que es va estudiar aquest cas d'ús i les diferents alternatives a nivell de protocols de transport,. S'estudien les diferents problemàtiques que pateixen els protocols i s'observa per què aquests no són capaços d'arribar a rendiments òptims. Deguda a aquesta situació, s'hipotetiza que la introducció de mecanismes que analitzen les mètriques de la xarxa i que exploten eficientment la capacitat de la mateixa milloren el rendiment dels protocols de transport sobre xarxes elefant heterogènies durant l'enviament massiu de dades. Primerament, es dissenya l’Adaptative and Aggressive Transport Protocol (AATP), un protocol de transport adaptatiu i eficient amb l'objectiu de millorar el rendiment sobre aquest tipus de xarxes elefant. El protocol AATP s'implementa i es prova en un simulador de xarxes i un testbed sota diferents situacions i condicions per la seva validació. Implementat i provat amb èxit el protocol AATP, es decideix millorar el propi protocol, Enhanced-AATP, sobre xarxes elefant heterogènies. Per això, es dissenya un mecanisme basat en el Jitter Ràtio que permet fer aquesta diferenciació. A més, per tal de millorar el comportament del protocol, s’adapta el seu sistema de fairness per al repartiment just dels recursos amb altres fluxos Enhanced-AATP. Aquesta evolució s'implementa en el simulador de xarxes i es realitzen una sèrie de proves. A l'acabar aquesta tesi, es conclou que les Xarxes de Nova Generació tenen molt recorregut i moltes coses a millorar causa de la transformació digital de la societat i de l'aparició de nova tecnologia disruptiva. A més, es confirma que la introducció de mecanismes específics en la concepció i operació dels protocols de transport millora el rendiment d'aquests sobre xarxes elefant heterogènies.Esta tesis por compendio centra sus contribuciones en el aprendizaje e innovación de las Redes de Nueva Generación. Es por ello que se proponen distintas contribuciones en diferentes ámbitos (Smart Cities, Smart Grids, Smart Campus, Smart Learning, Media, eHealth, Industria 4.0 entre otros) mediante la aplicación y combinación de diferentes disciplinas (Internet of Things, Building Information Modeling, Cloud Storage, Ciberseguridad, Big Data, Internet del Futuro, Transformación Digital). Concretamente, se detalla la monitorización sostenible del confort en el Smart Campus, la que se podría considerar mi aportación más representativa dentro de la conceptualización de Redes de Nueva Generación. Dentro de este innovador concepto de monitorización se integran diferentes disciplinas, para poder ofrecer información sobre el nivel de confort de las personas. Esta investigación demuestra el recorrido que existe en la transformación digital de los sectores tradicionales y las NGNs. Durante este largo aprendizaje sobre las NGN a través de las diferentes investigaciones, se pudo observar una problemática que afectaba de manera transversal a los diferentes campos de aplicación de las NGNs y que ésta podía tener una afectación en estos sectores. Esta problemática consiste en el bajo rendimiento durante el intercambio de grandes volúmenes de datos sobre redes con gran capacidad de ancho de banda y remotamente separadas geográficamente, conocidas como redes elefante, o Long Fat Networks (LFNs). Concretamente, esto afecta al caso de uso de intercambio de datos entre regiones Cloud (Cloud Data Data use case). Es por ello que se estudió este caso de uso y las diferentes alternativas a nivel de protocolos de transporte. Se estudian las diferentes problemáticas que sufren los protocolos y se observa por qué no son capaces de alcanzar rendimientos óptimos. Debida a esta situación, se hipotetiza que la introducción de mecanismos que analizan las métricas de la red y que explotan eficientemente la capacidad de la misma mejoran el rendimiento de los protocolos de transporte sobre redes elefante heterogéneas durante el envío masivo de datos. Primeramente, se diseña el Adaptative and Aggressive Transport Protocol (AATP), un protocolo de transporte adaptativo y eficiente con el objetivo maximizar el rendimiento sobre este tipo de redes elefante. El protocolo AATP se implementa y se prueba en un simulador de redes y un testbed bajo diferentes situaciones y condiciones para su validación. Implementado y probado con éxito el protocolo AATP, se decide mejorar el propio protocolo, Enhanced-AATP, sobre redes elefante heterogéneas. Además, con tal de mejorar el comportamiento del protocolo, se mejora su sistema de fairness para el reparto justo de los recursos con otros flujos Enhanced-AATP. Esta evolución se implementa en el simulador de redes y se realizan una serie de pruebas. Al finalizar esta tesis, se concluye que las Redes de Nueva Generación tienen mucho recorrido y muchas cosas a mejorar debido a la transformación digital de la sociedad y a la aparición de nueva tecnología disruptiva. Se confirma que la introducción de mecanismos específicos en la concepción y operación de los protocolos de transporte mejora el rendimiento de estos sobre redes elefante heterogéneas.This compendium thesis focuses its contributions on the learning and innovation of the New Generation Networks. That is why different contributions are proposed in different areas (Smart Cities, Smart Grids, Smart Campus, Smart Learning, Media, eHealth, Industry 4.0, among others) through the application and combination of different disciplines (Internet of Things, Building Information Modeling, Cloud Storage, Cybersecurity, Big Data, Future Internet, Digital Transformation). Specifically, the sustainable comfort monitoring in the Smart Campus is detailed, which can be considered my most representative contribution within the conceptualization of New Generation Networks. Within this innovative monitoring concept, different disciplines are integrated in order to offer information on people's comfort levels. . This research demonstrates the long journey that exists in the digital transformation of traditional sectors and New Generation Networks. During this long learning about the NGNs through the different investigations, it was possible to observe a problematic that affected the different application fields of the NGNs in a transversal way and that, depending on the service and its requirements, it could have a critical impact on any of these sectors. This issue consists of a low performance operation during the exchange of large volumes of data on networks with high bandwidth capacity and remotely geographically separated, also known as Elephant networks, or Long Fat Networks (LFNs). Specifically, this critically affects the Cloud Data Sharing use case. That is why this use case and the different alternatives at the transport protocol level were studied. For this reason, the performance and operation problems suffered by layer 4 protocols are studied and it is observed why these traditional protocols are not capable of achieving optimal performance. Due to this situation, it is hypothesized that the introduction of mechanisms that analyze network metrics and efficiently exploit network’s capacity meliorates the performance of Transport Layer protocols over Heterogeneous Long Fat Networks during bulk data transfers. First, the Adaptive and Aggressive Transport Protocol (AATP) is designed. An adaptive and efficient transport protocol with the aim of maximizing its performance over this type of elephant network.. The AATP protocol is implemented and tested in a network simulator and a testbed under different situations and conditions for its validation. Once the AATP protocol was designed, implemented and tested successfully, it was decided to improve the protocol itself, Enhanced-AATP, to improve its performance over heterogeneous elephant networks. In addition, in order to upgrade the behavior of the protocol, its fairness system is improved for the fair distribution of resources among other Enhanced-AATP flows. Finally, this evolution is implemented in the network simulator and a set of tests are carried out. At the end of this thesis, it is concluded that the New Generation Networks have a long way to go and many things to improve due to the digital transformation of society and the appearance of brand-new disruptive technology. Furthermore, it is confirmed that the introduction of specific mechanisms in the conception and operation of transport protocols improves their performance on Heterogeneous Long Fat Networks
    corecore