533 research outputs found

    Assessing the Suitability of King Topologies for Interconnection Networks

    Get PDF
    In the late years many different interconnection networks have been used with two main tendencies. One is characterized by the use of high-degree routers with long wires while the other uses routers of much smaller degree. The latter rely on two-dimensional mesh and torus topologies with shorter local links. This paper focuses on doubling the degree of common 2D meshes and tori while still preserving an attractive layout for VLSI design. By adding a set of diagonal links in one direction, diagonal networks are obtained. By adding a second set of links, networks of degree eight are built, named king networks. This research presents a comprehensive study of these networks which includes a topological analysis, the proposal of appropriate routing procedures and an empirical evaluation. King networks exhibit a number of attractive characteristics which translate to reduced execution times of parallel applications. For example, the execution times NPB suite are reduced up to a 30 percent. In addition, this work reveals other properties of king networks such as perfect partitioning that deserves further attention for its convenient exploitation in forthcoming high-performance parallel systems

    ATM layouts with bounded hop count and congestion

    Full text link

    Slim Fly: A Cost Effective Low-Diameter Network Topology

    Full text link
    Abstract—We introduce a high-performance cost-effective net-work topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations. I

    Efficient bypass mechanisms for low latency networks on-chip

    Get PDF
    RESUMEN: La importancia de las redes en-chip en los procesadores multi-núcleo es cada vez mayor. Los routers con baipás son una solución eficiente para reducir la latencia de estas redes. Existen dos tipos de redes con baipás: single-hop y multi-hop. Las redes con baipás single-hop minimizan la latencia individual de cada router al asignar los recursos del router con antelación a la recepción de los paquetes. Las redes con baipás multi-hop, conocidas como SMART, permiten que los paquetes atraviesen múltiples routers en un único ciclo. La primera propuesta de esta tesis es Non-Empty Buffer Bypass (NEBB), un mecanismo que incrementa la utilización del baipás de tipo single-hop, eliminando la necesidad de usar canales virtuales. Para redes con baipás multi-hop propone SMART++ y S-SMART++. SMART++ elimina la necesidad de SMART de usar una gran cantidad de canales virtuales para aprovechar el ancho de banda de la red, permitiendo el diseño de configuraciones de bajo coste. S-SMART++ hace uso de la asignación de recursos de forma especulativa para preparar el baipás de tipo multi-hop. Este mecanismo reduce la latencia y su dependencia con la longitud máxima de los saltos de tipo multi-hop, aspecto clave para su viabilidad en diseños reales. La contribución final es un conjunto de herramientas de código abierto llamada Bypass Simulation Toolset (BST) compuesto por versiones extendidas de BookSim y OpenSMART, una API para integrar BookSim en otros simuladores y una serie de scripts para facilitar el diseño y evaluación de este tipo de redes.ABSTRACT: Networks on-Chip (NoCs) are becoming more important in many-core processors as the number of cores grows. Bypass routers are an efficient solution that skips pipeline stages. There are two types of bypass mechanisms: single-hop and multi-hop bypass. Single-hop bypass minimizes the router delay by skipping allocation stages in each hop. Multi-hop bypass, called SMART, minimizes the effective number of hops by traversing multiple routers in a single cycle. The first proposal of this dissertation is Non-Empty Buffer Bypass (NEBB) for single-hop bypass, which increases the bypass utilization without requiring VCs to match traditional bypass routers. It proposes SMART++ and S-SMART++ for multi-hop bypass. SMART++ removes the requirement of using multiple VCs of SMART to exploit the bandwidth of the network, enabling low-cost configurations. S-SMART++ relies on speculative allocation to set up multi-hop bypass paths. Thus, it reduces latency and its dependency with the maximum length of multi-hops, relaxing the requirements to integrate multi-hop bypass in real designs. The final contribution is an open-source set of tools to simulate bypass NoCs called Bypass Simulation Toolset (BST) conformed by extended versions of BookSim and OpenSMART, an API to integrate BookSim in other simulators, and scripts to simplify the designing and evaluation of such NoCs.This work was supported by the Spanish Ministry of Science, Innovation and Universities, FPI grant BES-2017-079971, and contracts TIN2010-21291-C02-02, TIN2013- 46957-C2-2-P, TIN2015-65316-P, TIN2016-76635-C2-2-R (AEI/FEDER, UE) and TIC PID2019-105660RB-C22; the European HiPEAC Network of Excellence; the European Community's Seventh Framework Programme (FP7/2007-2013), under the Mont-Blanc 1 and 2 projects (grant agreements n 288777 and 610402); the European Union's Horizon 2020 research and innovation programme under the Mont-Blanc 3 project (grant agreement nº 671697). Bluespec Inc. provided access to Bluespec tools

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Automating Topology Aware Mapping for Supercomputers

    Get PDF
    Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of parallel applications through reduction of network contention. Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration. The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention. Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative for machines of the future.published or submitted for publicationnot peer reviewe

    NoCo: ILP-based worst-case contention estimation for mesh real-time manycores

    Get PDF
    Manycores are capable of providing the computational demands required by functionally-advanced critical applications in domains such as automotive and avionics. In manycores a network-on-chip (NoC) provides access to shared caches and memories and hence concentrates most of the contention that tasks suffer, with effects on the worst-case contention delay (WCD) of packets and tasks' WCET. While several proposals minimize the impact of individual NoC parameters on WCD, e.g. mapping and routing, there are strong dependences among these NoC parameters. Hence, finding the optimal NoC configurations requires optimizing all parameters simultaneously, which represents a multidimensional optimization problem. In this paper we propose NoCo, a novel approach that combines ILP and stochastic optimization to find NoC configurations in terms of packet routing, application mapping, and arbitration weight allocation. Our results show that NoCo improves other techniques that optimize a subset of NoC parameters.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015- 65316-P and the HiPEAC Network of Excellence. It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (agreement No. 772773). Carles Hernández is jointly supported by the MINECO and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the Spanish Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Enrico Mezzetti has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva-Incorporaci´on postdoctoral fellowship number IJCI-2016-27396.Peer ReviewedPostprint (author's final draft

    Constructing virtual 5-dimensional tori out of lower-dimensional network cards

    Full text link
    [EN] In the Top500 and Graph500 lists of the last years, some of the most powerful systems implement a torus topology to interconnect themillions of computing nodes they include. Some of these torus networks are of five or six dimensions, which implies an additional difficulty as the node degree increases. In previous works, we proposed and evaluated the nD Twin (nDT) torus topology to virtually increase the dimensions a torus is able to implement. We showed that this new topology reduces the distances between nodes, increasing, therefore, global network performance. In this work, we present how to build a 5DT torus network using a specific commercial 6-port network card (EXTOLL card) to interconnect those nodes. We show, using the same number of cards, that the performance of the 5DT torus network we are able to implement using our proposal is higher than the performance of the 3D torus network for the same number of compute nodes.Spanish MINECO; European Commission, Grant/Award Number: TIN2015-66972-C5-1-R and TIN2015-66972-C5-2-R; JCCM, Grant/Award Number: PEII-2014-028-P; Spanish MICINN, Grant/Award Number: FJCI-2015-26080Andújar-Muñoz, FJ.; Villar, JA.; Sanchez Garcia, JL.; Alfaro Cortes, FJ.; Duato Marín, JF.; Fröning, H. (2017). Constructing virtual 5-dimensional tori out of lower-dimensional network cards. Concurrency and Computation Practice and Experience. 1-17. https://doi.org/10.1002/cpe.4361S11

    How Graph Theory can help Communications Engineering

    Get PDF
    International audienceWe give an overview of different aspects of graph theory which can be applied in communication engineering, not trying to present immediate results to be applied neither a complete survey of results, but to give a flavor of how graph theory can help this field. We deal in this paper with network topologies, resource competition, state transition diagrams and specific models for optical networks
    • …
    corecore