22 research outputs found

    A visual Analytics System for Optimizing Communications in Massively Parallel Applications

    Get PDF
    Current and future supercomputers have tens of thousands of compute nodes interconnected with high-dimensional networks and complex network topologies for improved performance. Application developers are required to write scalable parallel programs in order to achieve high throughput on these machines. Application performance is largely determined by efficient inter-process communication. A common way to analyze and optimize performance is through profiling parallel codes to identify communication bottlenecks. However, understanding gigabytes of profile data is not a trivial task. In this paper, we present a visual analytics system for identifying the scalability bottlenecks and improving the communication efficiency of massively parallel applications. Visualization methods used in this system are designed to comprehend large-scale and varied communication patterns on thousands of nodes in complex networks such as the 5D torus and the dragonfly. We also present efficient rerouting and remapping algorithms that can be coupled with our interactive visual analytics design for performance optimization. We demonstrate the utility of our system with several case studies using three benchmark applications on two leading supercomputers. The mapping suggestion from our system led to 38% improvement in hop-bytes for MiniAMR application on 4,096 MPI processes.This research has been sponsored in part by the U.S. National Science Foundation through grant IIS-1320229, and the U.S. Department of Energy through grants DE-SC0012610 and DE-SC0014917. This research has been funded in part and used resources of the Argonne Leadership Computing Facility at Argonne National Lab- oratory, which is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-06CH11357. This work was supported in part by the DOE Office of Science, ASCR, under award numbers 57L38, 57L32, 57L11, 57K50, and 508050

    Understanding Performance Variability on the Aries Dragonfly Network

    Full text link

    A Visual Analytics Framework for Reviewing Streaming Performance Data

    Full text link
    Understanding and tuning the performance of extreme-scale parallel computing systems demands a streaming approach due to the computational cost of applying offline algorithms to vast amounts of performance log data. Analyzing large streaming data is challenging because the rate of receiving data and limited time to comprehend data make it difficult for the analysts to sufficiently examine the data without missing important changes or patterns. To support streaming data analysis, we introduce a visual analytic framework comprising of three modules: data management, analysis, and interactive visualization. The data management module collects various computing and communication performance metrics from the monitored system using streaming data processing techniques and feeds the data to the other two modules. The analysis module automatically identifies important changes and patterns at the required latency. In particular, we introduce a set of online and progressive analysis methods for not only controlling the computational costs but also helping analysts better follow the critical aspects of the analysis results. Finally, the interactive visualization module provides the analysts with a coherent view of the changes and patterns in the continuously captured performance data. Through a multi-faceted case study on performance analysis of parallel discrete-event simulation, we demonstrate the effectiveness of our framework for identifying bottlenecks and locating outliers.Comment: This is the author's preprint version that will be published in Proceedings of IEEE Pacific Visualization Symposium, 202

    Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

    Full text link
    Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

    Photonic Interconnection Networks for Exascale Computers

    Full text link
    [ES] En los últimos años, distintos proyectos alrededor del mundo se han centrado en el diseño de supercomputadores capaces de alcanzar la meta de la computación a exascala, con el objetivo de soportar la ejecución de aplicaciones de gran importancia para la sociedad en diversos campos como el de la salud, la inteligencia artificial, etc. Teniendo en cuenta la creciente tendencia de la potencia computacional en cada generación de supercomputadores, este objetivo se prevee accesible en los próximos años. Alcanzar esta meta requiere abordar diversos retos en el diseño y desarrollo del sistema. Uno de los principales es conseguir unas comunicaciones rápidas y eficientes entre el inmenso número de nodos de computo y los sitemas de memoria. La tecnología fotónica proporciona ciertas ventajas frente a las redes eléctricas, como un mayor ancho de banda en los enlaces, un mayor paralelismo a nivel de comunicaciones gracias al DWDM o una mejor gestión del cableado gracias a su reducido tamaño. En la tesis se ha desarrollado un estudio de viabilidad y desarrollo de redes de interconexión haciendo uso de la tecnología fotónica para los futuros sistemas a exaescala dentro del proyecto europeo ExaNeSt. En primer lugar, se ha realizado un análisis y caracterización de aplicaciones exaescala. Este análisis se ha utilizado para conocer el comportamiento y requisitos de red que presentan las aplicaciones, y con ello guiarnos en el diseño de la red del sistema. El análisis considera tres parámetros: la distribución de mensajes en base a su tamaño y su tipo, el consumo de ancho de banda requerido a lo largo de la ejecución y la matriz de comunicación espacial entre los nodos. El estudio revela la necesidad de una red eficiente y rápida, debido a que la mayoría de las comunaciones se realizan en burst y con mensajes de un tamaño medio inferior a 50KB. A continuación, la tesis se centra en identificar los principales elementos que diferencian las redes fotónicas de las eléctricas. Identificamos una secuencia de pasos en el diseño de un simulador, ya sea haciéndolo desde cero con tecnología fotónica o adaptando un simulador de redes eléctricas existente para modelar la fotónica. Después se han realizado dos estudios de rendimiento y comparativas entre las actuales redes eléctricas y distintas configuraciones de redes fotónicas utilizando topologías clásicas. En el primer estudio, realizado tanto con tráfico sintético como con trazas de ExaNeSt en un toro, fat tree y dragonfly, se observa como la tecnología fotónica supone una clara mejora respecto a la eléctrica. Además, el estudio muestra que el parámetro que más afecta al rendimiento es el ancho de banda del canal fotónico. El segundo estudio muestra el comportamiento y rendimiento de aplicaciones reales en simulaciones a gran escala en una topología jellyfish. En este estudio se confirman las conclusiones obtenidas en el anterior, revelando además que la tecnología fotónica permite reducir la complejidad de algunas topologías, y por ende, el coste de la red. En los estudios realizados se ha observado una baja utilización de la red debido a que las topologías utilizadas para redes eléctricas no aprovechan las características que proporciona la tecnología fotónica. Por ello, se ha propuesto Segment Switching, una estrategia de conmutación orientada a reducir la longitud de las rutas mediante el uso de buffers intermedios. Los resultados experimentales muestran que cada topología tiene sus propios requerimientos. En el caso del toro, el mayor rendimiento se obtiene con un mayor número de buffers en la red. En el fat tree el parámetro más importante es el tamaño del buffer, obteniendo unas prestaciones similares una configuración con buffers en todos los switches que la que los ubica solo en el nivel superior. En resumen, esta tesis estudia el uso de la tecnología fotónica para las redes de sistemas a exascala y propone aprovechar[CA] Els darrers anys, múltiples projectes de recerca a tot el món s'han centrat en el disseny de superordinadors capaços d'assolir la barrera de computació exascala, amb l'objectiu de donar suport a l'execució d'aplicacions importants per a la nostra societat, com ara salut, intel·ligència artificial, meteorologia, etc. Segons la tendència creixent en la potència de càlcul en cada generació de superordinadors, es preveu assolir aquest objectiu en els propers anys. No obstant això, assolir aquest objectiu requereix abordar diferents reptes importants en el disseny i desenvolupament del sistema. Un dels principals és aconseguir comunicacions ràpides i eficients entre l'enorme nombre de nodes computacionals i els sistemes de memòria. La tecnologia fotònica proporciona diversos avantatges respecte a les xarxes elèctriques actuals, com ara un major ample de banda als enllaços, un major paral·lelisme de la xarxa gràcies a DWDM o una millor gestió del cable a causa de la seva mida molt més xicoteta. En la tesi, s'ha desenvolupat un estudi de viabilitat i desenvolupament de xarxes d'interconnexió mitjançant tecnologia fotònica per a futurs sistemes exascala dins del projecte europeu ExaNeSt. En primer lloc, s'ha dut a terme un estudi de caracterització d'aplicacions exascala dels requisits de xarxa. Els resultats de l'anàlisi ajuden a entendre els requisits de xarxa de les aplicacions exascale i, per tant, ens guien en el disseny de la xarxa del sistema. Aquesta anàlisi considera tres paràmetres principals: la distribució dels missatges en funció de la seva mida i tipus, el consum d'ample de banda requerit durant tota l'execució i els patrons de comunicació espacial entre els nodes. L'estudi revela la necessitat d'una xarxa d'interconnexió ràpida i eficient, ja que la majoria de comunicacions consisteixen en ràfegues de transmissions, cadascuna amb una mida mitjana de missatge de 50 KB. A continuació, la tesi se centra a identificar els principals elements que diferencien les xarxes fotòniques de les elèctriques. Identifiquem una seqüència de passos en el disseny i implementació d'un simulador: tractar la tecnologia fotònica des de zero o per ampliar un simulador de xarxa elèctrica existent per modelar la fotònica. Després, es presenten dos estudis principals de comparació de rendiment entre xarxes elèctriques i diferents configuracions de xarxes fotòniques mitjançant topologies clàssiques. En el primer estudi, realitzat tant amb trànsit sintètic com amb traces d'ExaNeSt en un toro, fat tree i dragonfly, vam trobar que la tecnologia fotònica representa una millora notable respecte a la tecnologia elèctrica. A més, l'estudi mostra que el paràmetre que més afecta el rendiment és l'amplada de banda del canal fotònic. Aquest darrer estudi analitza el rendiment d'aplicacions reals en simulacions a gran escala en una topologia jellyfish. Els resultats d'aquest estudi corroboren les conclusions obtingudes en l'anterior, revelant també que la tecnologia fotònica permet reduir la complexitat d'algunes topologies i, per tant, el cost de la xarxa. En els estudis anteriors ens adonem que la xarxa estava infrautilitzada principalment perquè les topologies estudiades per a xarxes elèctriques no aprofiten les característiques proporcionades per la tecnologia fotònica. Per aquest motiu, proposem Segment Switching, una estratègia de commutació destinada a reduir la longitud de les rutes mitjançant la implementació de memòries intermèdies en nodes intermedis al llarg de la ruta. Els resultats experimentals mostren que cadascuna de les topologies estudiades presenta diferents requisits de memòria intermèdia. Per al toro, com més gran siga el nombre de memòries intermèdies a la xarxa, major serà el rendiment. Per al fat tree, el paràmetre clau és la mida de la memòria intermèdia, aconseguint un rendiment similar tant amb una configuració amb memòria intermèdia en tots els co[EN] In the last recent years, multiple research projects around the world have focused on the design of supercomputers able to reach the exascale computing barrier, with the aim of supporting the execution of important applications for our society, such as health, artificial intelligence, meteorology, etc. According to the growing trend in the computational power in each supercomputer generation, this objective is expected to be reached in the coming years. However, achieving this goal requires addressing distinct major challenges in the design and development of the system. One of the main ones is to achieve fast and efficient communications between the huge number of computational nodes and the memory systems. Photonics technology provides several advantages over current electrical networks, such as higher bandwidth in the links, greater network parallelism thanks to DWDM, or better cable management due to its much smaller size. In this thesis, a feasibility study and development of interconnection networks have been developed using photonics technology for future exascale systems within the European project ExaNeSt. First, a characterization study of exascale applications from the network requirements has been carried out. The results of the analysis help understand the network requirements of exascale applications, and thereby guide us in the design of the system network. This analysis considers three main parameters: the distribution of the messages based on their size and type, the required bandwidth consumption throughout the execution, and the spatial communication patterns between the nodes. The study reveals the need for a fast and efficient interconnection network, since most communications consist of bursts of transmissions, each with an average message size of 50 KB. Next, this dissertation concentrates on identifying the main elements that differentiate photonic networks from electrical ones. We identify a sequence of steps in the design and implementation of a simulator either i) dealing with photonic technology from scratch or ii) to extend an existing electrical network simulator in order to model photonics. After that, two main performance comparison studies between electrical networks and different configurations of photonic networks are presented using classical topologies. In the former study, carried out with both synthetic traffic and traces of ExaNeSt in a torus, fat tree and dragonfly, we found that photonic technology represents a noticeable improvement over electrical technology. Furthermore, the study shows that the parameter that most affects the performance is the bandwidth of the photonic channel. The latter study analyzes performance of real applications in large-scale simulations in a jellyfish topology. The results of this study corroborates the conclusions obtained in the previous, also revealing that photonic technology allows reducing the complexity of some topologies, and therefore, the cost of the network. In the previous studies we realize that the network was underutilized mainly because the studied topologies for electrical networks do not take advantage of the features provided by photonic technology. For this reason, we propose Segment Switching, a switching strategy aimed at reducing the length of the routes by implementing buffers at intermediate nodes along the path. Experimental results show that each of the studied topologies presents different buffering requirements. For the torus, the higher the number of buffers in the network, the higher the performance. For the fat tree, the key parameter is the buffer size, achieving similar performance a configuration with buffers on all switches that locating buffers only at the top level. In summary, this thesis studies the use of photonic technology for networks of exascale systems, and proposes to take advantage of the characteristics of this technology in current electrical network topologies.This thesis has been conceived from the work carried out by Polytechnic University of Valencia in the ExaNeSt European projectDuro Gómez, J. (2021). Photonic Interconnection Networks for Exascale Computers [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/166796TESI

    Characterization and optimization of network traffic in cortical simulation

    Get PDF
    Considering the great variety of obstacles the Exascale systems have to face in the next future, a deeper attention will be given in this thesis to the interconnect and the power consumption. The data movement challenge involves the whole hierarchical organization of components in HPC systems — i.e. registers, cache, memory, disks. Running scientific applications needs to provide the most effective methods of data transport among the levels of hierarchy. On current petaflop systems, memory access at all the levels is the limiting factor in almost all applications. This drives the requirement for an interconnect achieving adequate rates of data transfer, or throughput, and reducing time delays, or latency, between the levels. Power consumption is identified as the largest hardware research challenge. The annual power cost to operate the system would be above 2.5 B$ per year for an Exascale system using current technology. The research for alternative power-efficient computing device is mandatory for the procurement of the future HPC systems. In this thesis, a preliminary approach will be offered to the critical process of co-design. Co-desing is defined as the simultaneos design of both hardware and software, to implement a desired function. This process both integrates all components of the Exascale initiative and illuminates the trade-offs that must be made within this complex undertaking

    Intelligent middleware for HPC systems to improve performance and energy cost efficiency

    Full text link
    High-performance computing (HPC) systems play an essential role in large-scale scientific computations. As the number of nodes in HPC systems continues to increase, their power consumption leads to larger energy costs. The energy costs pose a financial burden on maintaining HPC systems, which will be more challenging on future extreme-scale systems where the number of nodes and power consumption are expected to further grow. To support this growth, higher degrees of network and memory resource sharing are implemented, causing a substantial increase in performance variation and degradation. These challenges call for innovations in HPC system middleware that reduce energy cost without trading off performance. By taking the performance of an HPC system as a first-order constraint, this thesis establishes that HPC systems can participate in demand response programs while providing performance guarantees through a novel design of the middleware. Well-designed middleware also enables enhanced performance by mitigating resource contention induced by energy or cost restrictions. This thesis aims to realize these goals through two complementary approaches. First, this thesis proposes novel policies for HPC systems to enable their participation in emerging power markets, where participants reduce their energy costs by following market requirements. Our policies guarantee that the Quality-of-Service (QoS) of jobs does not drop below given constraints and systematically optimize cost reduction based on large deviation analysis in queueing theory. Through experiments on a real-world cluster whose power consumption is regulated to follow a dynamically changing power target, this thesis claims that HPC systems can participate in emerging power programs without violating the QoS constraints of jobs. Second, this thesis proposes novel resource management strategies to improve the performance of HPC systems. Better resource management can mitigate contention that causes performance degradation and poor system utilization. To resolve network contention, we design an intelligent job allocation policy for HPC systems that incorporate the state-of-the-art dragonfly network topology. Our allocation policy mitigates network contention, reduces network communication latency, and consequently improves the performance of the systems. As some latest HPC systems support the collection of high-granularity network performance metrics at runtime, we also propose a method to quantify the impact of network congestion and demonstrate that a network-data-driven job allocation policy improves HPC performance by avoiding network traffic hot spots.2022-01-18T00:00:00

    Net Topo: uma Abstração da Topologia de Rede para Escalonamento para Aplicações Paralelas e Distribuídas

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.A repartição de trabalho em ambientes distribuídos é um problema relevante em aplicaçõescomo simulações sísmicas, dinâmica molecular e previsão de tempo, pois estas possuem com-portamento dinâmico, gerando desbalanceamento de carga no sistema. Uma das maneiras deresolver esse problema é a utilização de balanceadores de carga dinâmicos, cuja função é redu-zir o tempo de execução da aplicação através de uma distribuição de tarefas mais homogênea.Uma alocação de tarefas que leva em consideração a topologia da rede pode reduzir latências decomunicação e efeitos de contenção alocando tarefas que se comunicam próximas uma da ou-tra. Entretanto, poucos balanceadores consideram a topologia do sistema de maneira dinâmica,devido à dificuldade do balanceador de obter informações de topologia de rede, como distânciae proximidade de tarefas, e de levá-las em consideração para o remapeamento de tarefas.Este trabalho desenvolveu a Net Topo, uma abstração da topologia de rede que facilita o acessoa informações de topologia da rede e a utilização destas para distribuição de tarefas. Ela oferecefunções de proximidade e distância entre nós na rede com foco em melhorar a repartição detrabalho. Foram implementadas algumas opções de inicialização da estrutura e foi propostauma forma de arquivar a informação de topologia.A comprovação do funcionamento da Net Topo se deu através de testes com umbenchmarksin-tético para balanceamento de carga. Uma estratégia de balanceamento de carga foi modificadapara utilizar a abstração elaborada e, quando comparada com a original, não só teve sobrecustonegligenciável mas apresentou benefícios na inicialização e na velocidade da estrutura. Por ar-quivar as informações de topologia para execuções futuras, a Net Topo permitiu uma reduçãode 97% no tempo de inicialização de topologia. O desempenho da abstração foi avaliado em di-ferentes escalas, mostrando que a Net Topo consegue apresentar boa escalabilidade na maioriadas suas funções.A repartição de trabalho em ambientes distribuídos é um problema relevante em aplicaçõescomo simulações sísmicas, dinâmica molecular e previsão de tempo, pois estas possuem com-portamento dinâmico, gerando desbalanceamento de carga no sistema. Uma das maneiras deresolver esse problema é a utilização de balanceadores de carga dinâmicos, cuja função é redu-zir o tempo de execução da aplicação através de uma distribuição de tarefas mais homogênea.Uma alocação de tarefas que leva em consideração a topologia da rede pode reduzir latências decomunicação e efeitos de contenção alocando tarefas que se comunicam próximas uma da ou-tra. Entretanto, poucos balanceadores consideram a topologia do sistema de maneira dinâmica,devido à dificuldade do balanceador de obter informações de topologia de rede, como distânciae proximidade de tarefas, e de levá-las em consideração para o remapeamento de tarefas.Este trabalho desenvolveu a Net Topo, uma abstração da topologia de rede que facilita o acessoa informações de topologia da rede e a utilização destas para distribuição de tarefas. Ela oferecefunções de proximidade e distância entre nós na rede com foco em melhorar a repartição detrabalho. Foram implementadas algumas opções de inicialização da estrutura e foi propostauma forma de arquivar a informação de topologia.A comprovação do funcionamento da Net Topo se deu através de testes com umbenchmarksin-tético para balanceamento de carga. Uma estratégia de balanceamento de carga foi modificadapara utilizar a abstração elaborada e, quando comparada com a original, não só teve sobrecustonegligenciável mas apresentou benefícios na inicialização e na velocidade da estrutura. Por ar-quivar as informações de topologia para execuções futuras, a Net Topo permitiu uma reduçãode 97% no tempo de inicialização de topologia. O desempenho da abstração foi avaliado em di-ferentes escalas, mostrando que a Net Topo consegue apresentar boa escalabilidade na maioriadas suas funções
    corecore