6 research outputs found

    Shift-Based Parallel Image Compositing on InfiniBand Fat-Trees

    Get PDF
    International audienceParallel image compositing has been widely studied over the past 20 years, as this is one, if not the most, crucial element in the implementation of a scalable parallel rendering system. Many algorithms have been proposed and implemented on a large variety of supercomputers. Among the existing supercomputers, InfiniBandTM (IB) PC clusters, and their associated fat-tree topology, are clearly becoming the dominant architecture, as they provide the scalability, high bandwidth and low latency required by the most demanding parallel applications. Surprisingly, very few efforts have been devoted to the implementation and performance evaluation of parallel image compositing algorithms on this kind of architecture. We propose in this paper a new parallel image compositing algorithm, called Shift-Based, relying on a well-known communication pattern called shift permutation. Indeed, shift permutation is one of the possible ways to get the maximum cross bisectional bandwidth provided by an IB fat-tree cluster. We show that our Shift-Based algorithm scales on any number of processing nodes (with peak performance on specific counts), allows overlapping communications with computations and exhibits contention-free network communications. This is demonstrated with the image compositing of very high resolution images at interactive frame rates

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Control de Congestión Eficiente para Redes HPC con Encaminamiento Adaptativo

    Get PDF
    La red de interconexión es el elemento principal en los clusters de computación de alto rendimiento (HPC) y centros de datos (DC), donde miles de nodos deben comunicarse de forma rápida y fiable. El rendimiento de la red depende de varias opciones de diseño, como la topología, el algoritmo de encaminamiento, la arquitectura del switch, etc. En la literatura se han propuesto algoritmos de encaminamiento altamente eficientes, ya sean deterministas o adaptativos, para equilibrar de forma inteligente los flujos de tráfico dependiendo de la topología de red, pero su rendimiento se reduce en los escenarios en los que la congestión y sus efectos negativos (por ejemplo, el HoL blocking) aparecen. En particular, en escenarios donde la congestión es intensa y persistente, el HoL blocking puede degradar drásticamente el rendimiento de los algoritmos de encaminamiento adaptativo, ya que pueden extender los flujos de tráfico congestionado por todas las rutas disponibles. Además, como hemos demostrado en estudios anteriores, la dispersi´on de los flujos congestionados puede deteriorar el rendimiento de los esquemas de colas estáticos utilizados para reducir el HoL blocking mediante la separación de los flujos en diferentes colas del switch buffer. De hecho, como estos sistemas se basan en un criterio estático, definido antes de la inyección del tráfico en la red, no pueden evitar que los flujos congestionados y no congestionados compartan colas cuando se combinan con un encaminamiento adaptativo. En este trabajo, proponemos utilizar algunos esquemas de colas estáticos existentes junto a la asignación dinámica de canales virtuales (VC) para aislar en una solo VC los flujos cuyas rutas han sido encaminadas de forma adaptativa, con el fin de evitar que el impacto de la congestión se extienda a través de varias rutas. Básicamente, los flujos adaptados se mueven a un canal especial de flujos adaptados (AFC), de modo que no interactúan con los flujos asignados a otros VC por el esquema de colas estático. De esta manera, se evita el HoL blocking que los flujos adaptados podrían causar a los flujos no adaptados, incluso si los flujos congestionados se han extendido a través de varias rutas. Por otro lado, el esquema de colas estático reducirá sin ninguna interferencia el HoL blocking que puede aparecer entre los flujos no adaptados. Para evaluar nuestra propuesta hemos realizado experimentos de simulación modelando grandes redes de interconexión basadas en la topología Fat-tree. De los resultados obtenidos, podemos concluir que nuestra técnica reduce de manera eficiente y significativa el impacto del HoLblocking en las redes de interconexión utilizando encaminamiento adaptativo y esquemas de colas cuando aparece la congestión

    Study of the data acquisition network for the triggerless data acquisition of the LHCb experiment and new particle track reconstruction strategies for the LHCb upgrade

    Get PDF
    The LHCb experiment will receive a major upgrade by the end of February 2021. This upgrade will allow the recording of proton-proton collision data at s=14 TeV\sqrt{s} = 14\ \text{TeV} with an instantaneous luminosity of 21033 cm2s12 \cdot 10^{33}\ \text{cm}^{-2}\text{s}^{-1}, making possible measurements of unprecedented precision in the bb and cc-quark flavour sectors. For taking advantage of the increased luminosity provided, the data acquisition system will receive a substantial upgrade. The upgraded system will be capable of processing the full collision rate of 30 MHz30\ \text{MHz}, without any low-level hardware preselection. This new design constraint poses a non-trivial technological challenge, both from a networking and computing point of view. A possible design of a 32 Tb/s32\ \text{Tb/s} data acquisition network is presented, and low-level network simulations are used to validate the design. Those simulations use an accurate behavioural model developed and optimised for this specific purpose. It is mandatory to optimise the reconstruction algorithms using a computing and physics approach, to perform the online reconstruction of the full 30 MHz30\ \text{MHz} pppp collisions rate. A new parametrisation of the charged particles' bending generated by the dipole of the LHCb experiment is presented. The accuracy of the model is tested against Monte Carlo data. This strategy can reduce by a factor four the size of the search windows needed in the SciFi sub-detector. The LookingForward algorithm in the Allen framework uses this model

    A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies

    Full text link
    The final publication is available at Springer via http://link.springer.com/article/10.1007%2Fs11227-014-1303-xLarge cluster-based machines require efficient high-performance interconnection networks. Routing is a key design issue of interconnection networks. Adaptive routing usually outperforms deterministic routing at the expense of introducing out-of-order packet delivery. Many of the commodity interconnects for clusters are based on fat-trees. The adaptive routing algorithm commonly used in fat-trees is composed of a fully adaptive upward subpath, followed by a deterministic downward subpath. As the latter is determined by the former, choosing the most adequate upward path for each packet is critical in fat-trees to achieve a good performance. In this paper, we present a mechanism for selecting the upward path in fat-trees, which enables optimum use of the available network resources to achieve a high network throughput. The proposed path selection is destination based, which allows reducing the head-of-line blocking effect. Indeed, the proposed mechanism can be used either as a selection function (the provided path is used as the preferred one), or as a deterministic routing algorithm (the path is the only possible one). The results show that the resulting selection function outperforms any other known one. Moreover, the proposed deterministic routing algorithm can achieve a similar, or even higher, level of performance than adaptive routing, while providing in-order packet delivery and a simpler switch implementation.This work was supported by the Spanish Ministerio de Ciencia e Innovacion (MICINN) and jointly financed with Plan E funds, under Grant TIN2009-14475-C04 as well as by Consolider-Ingenio 2010 under Grant CSD2006-00046.Gómez Requena, C.; Gilabert Villamón, F.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2015). A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies. Journal of Supercomputing. 71(7):2339-2364. https://doi.org/10.1007/s11227-014-1303-xS23392364717Abali B et al (2001) Adaptive routing on the new switch chip for IBM SP systems. J Parallel Distrib Comput 61(9):1148–1179Bakker E, van Leeuwer J, Tan RB (1991) Linear interval routing. Algoritms Rev 2:45–61Bogdanski B, Reinemo S-A, Sem-Jacobsen FO, Gran sFtree EG (2012) A fully connected and deadlock free switch-to-switch routing algorithm for fat-trees. ACM Trans Archit Code Optim 8(4):55-1–55-20Bogdanski B, Dag B, Reinemo S-A, Flich J (2013) Making the network scalable: inter-subnet routing in InfiniBand. In: Proceedings of the Euro-Par 2013 international conferenceDally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, BurlingtonDuato J, Yalamanchili S, Ni L (2004) Interconnection networks: an engineering approach. Morgan Kaufmann, BurlingtonEscudero-Sahuquillo J, Gunnar E, Garcia PJ, Flich J, Skeie T, Lysne O, Quiles FJ, Duato J (2014) Efficient and cost-effective hybrid congestion control for HPC interconnection networks. IEEE Trans Parallel Distrib Syst (to apear). doi: 10.1109/TPDS.2014.2307851Flich J, Malumbres MP, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: Proceedings of the 14th international parallel and distributed processing symposiumGarcía PJ, Flich J, Duato J, Johnson I, Quiles FJ, Naven F (2005) Dynamic evolution of congestion trees: analysis and impact on switch architecture. In: Proceedings of 1st HiPEAC conference, pp 266–285Geoffray P, Hoefler T (2008) Adaptive routing strategies for modern high performance networks. In: IEEE HOTIGilabert F, Gómez ME, López P, Duato J (2006) On the influence of the selection function on the performance of fat-trees. In: European conference on parallel computingGreenberg R, Leiserson C (1985) Randomized routing on fat-trees. In: Annual symposium on the foundations of computer scienceGómez ME, López P, Duato J (2005) A memory-effective routing strategy for regular interconnection networks. In: IEEE international parallel and distributed processing symposiumGómez C, Gilabert F, Gómez ME, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees workshop on communication architecture on clusters. In: IEEE international parallel and distributed processing symposiumHillis WD, Tucker L (1993) The CM-5 connection machine: a scalable supercomputer. Commun ACM 36(11):31–40Hoefler T, Schneider T, Lumsdaine A (2009) Optimized routing for large-scale InfiniBand networks. In: Proceedings of the 2009 17th IEEE symposium on high performance interconnectsInfiniband Trade Association. http://www.infinibandta.orgJohnson G, Kerbbyson D, Lang M (2008) Optimization of InfiniBand scientific applications. In: 22nd international parallel and distributed processingKariniemi H (2006) On-line reconfigurable extended generalized fat tree network-on-chip for multiprocessor system-on-chip circuits. PhD. thesis, Tampere University of TechnologyKarol MJ, Hluchyj MG, Morgan SP (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35:1347–1356Kim J, Park D, Theocharides T, Vijaykrishnan N, Das CR (2005) A low latency router supporting adaptivity for on-chip interconnects. In: 42nd annual conference on design automationKim J, Dally WJ, Dally J, Abts D (2006) Adaptive routing in high-radix clos network. In: SC 2006 conference, proceedings of the ACM/IEEE, Tampa, FL, 7 Nov 2006. doi: 10.1109/SC.2006.10Lin X, Chung Y, Huang T (2004) A multiple LID routing for fat-tree-based InfiniBand networks. In: IEEE international parallel and distributed processing symposiumMartínez JC, Flich J, Robles A, López P, Duato J (2004) Supporting adaptive routing in IBA switches. J Syst Archit 49:441–449Martínez JC, Flich J, Robles A, López P, Duato J, Koibuchi M (2005) In-order packet delivery in interconnection networks using adaptive routing. In: IEEE international parallel and distributed processing symposiumMyricom. http://www.myri.comPetrini F, Vanneschi M (1995) k-ary n-tress: high performance networks for massively parallel architecture. In: IEEE Micro, vol 15Quadrics homepage. http://www.quadrics.comScott S, Abts D, Kim J, Dally WJ (2006) The BlackWidow high-radix clos network. In: International sympium on computer architectureRuemmler C, Wilkes J (1993) Unix disk access patterns. In: Winter Usenix conferenceTianhe. http://www.nscc-tj.gov.cn/en/Top 500 Supercomputer site (2014). http://www.top500.orgVishnu A, Koop M, Moody A, Mamidala A, Narravula S, Panda D (2007) Hot-spot avoidancce with multipathing over InfiniBand: an MPI perspective. In: International symposium on cluster computing and the gridZahavi E, Johnson G, Kerbyson DJ, Lang M (2010) Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns. Concurr Comput Pract Experience 22:
    corecore