259 research outputs found

    High-speed, economical design implementation of transit network router

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 88-90).by Kazuhiro Hara.M.S

    Predictive and distributed routing balancing (PR-DRB) : high speed interconnection networks

    Get PDF
    Current parallel applications running on clusters require the use of an interconnection network to perform communications among all computing nodes available. Imbalance of communications can produce network congestion, reducing throughput and increasing latency, degrading the overall system performance. On the other hand, parallel applications running on these networks posses representative stages which allow their characterization, as well as repetitive behavior that can be identified on the basis of this characterization. This work presents the Predictive and Distributed Routing Balancing (PR-DRB), a new method developed to gradually control network congestion, based on paths expansion, traffic distribution and effective traffic load, in order to maintain low latency values. PR-DRB monitors messages latencies on intermediate routers, makes decisions about alternative paths and record communication pattern information encountered during congestion situation. Based on the concept of applications repetitiveness, best solution recorded are reapplied when saved communication pattern re-appears. Traffic congestion experiments were conducted in order to evaluate the performance of the method, and improvements were observed.Les aplicacions paral·leles actuals en els Clústers requereixen l'ús d'una xarxa d'interconnexió per comunicar a tots els nodes de còmput disponibles. El desequilibri en la càrrega de comunicacions pot congestionar la xarxa, incrementant la latència i disminuint el throughput, degradant el rendiment total del sistema. D'altra banda, les aplicacions paral·leles que s'executen sobre aquestes xarxes contenen etapes representatives durant la seva execució les quals permeten caracteritzar-les, a més d'extraure un comportament repetitiu que pot ser identificat en base a aquesta caracterització. Aquest treball presenta el Balanceig Predictiu de Encaminament Distribuït (PR-DRB), un nou mètode desenvolupat per controlar la congestió a la xarxa en forma gradual, basat en l'expansió de camins, la distribució de trànsit i càrrega efectiva actual per tal de mantenir una latència baixa. PR-DRB monitoritza la latència dels missatges en els encaminadors, pren decisions sobre els camins alternatius a utilitzar i registra la informació de la congestió sobre la base del patró de comunicacions detectat, utilitzant com a concepte base la repetitivitat de les aplicacions per després tornar a aplicar la millor solució quan aquest patró es repeteixi. Experiments de trànsit amb congestió van ser portats a terme per avaluar el rendiment del mètode, els quals van mostrar la bondat del mateix.Las aplicaciones paralelas actuales en los Clústeres requieren el uso de una red de interconexión para comunicar a todos los nodos de cómputo disponibles. El desbalance en la carga de comunicaciones puede congestionar la red, incrementando la latencia y disminuyendo el throughput, degradando el rendimiento total del sistema. Por otro lado, las aplicaciones paralelas que corren sobre estas redes contienen etapas representativas durante su ejecución las cuales permiten caracterizarlas, además de un comportamiento repetitivo que puede ser identificado en base a dicha caracterización. Este trabajo presenta el Balanceo Predictivo de Encaminamiento Distribuido (PR-DRB), un nuevo método desarrollado para controlar la congestión en la red en forma gradual; basado en la expansión de caminos, la distribución de tráfico y carga efectiva actual, a fin de mantener una latencia baja. PR-DRB monitorea la latencia de los mensajes en los encaminadores, toma decisiones sobre los caminos alternativos a utilizar y registra la información de la congestión en base al patrón de comunicaciones detectado, usando como concepto base la repetitividad de las aplicaciones para luego volver a aplicar la mejor solución cuando dicho patrón se repita. Experimentos de tráfico con congestión fueron llevados a cabo para evaluar el rendimiento del método, los cuales mostraron la bondad del mismo

    A composable approach to design of newer techniques for large-scale denial-of-service attack attribution

    Get PDF
    Since its early days, the Internet has witnessed not only a phenomenal growth, but also a large number of security attacks, and in recent years, denial-of-service (DoS) attacks have emerged as one of the top threats. The stateless and destination-oriented Internet routing combined with the ability to harness a large number of compromised machines and the relative ease and low costs of launching such attacks has made this a hard problem to address. Additionally, the myriad requirements of scalability, incremental deployment, adequate user privacy protections, and appropriate economic incentives has further complicated the design of DDoS defense mechanisms. While the many research proposals to date have focussed differently on prevention, mitigation, or traceback of DDoS attacks, the lack of a comprehensive approach satisfying the different design criteria for successful attack attribution is indeed disturbing. Our first contribution here has been the design of a composable data model that has helped us represent the various dimensions of the attack attribution problem, particularly the performance attributes of accuracy, effectiveness, speed and overhead, as orthogonal and mutually independent design considerations. We have then designed custom optimizations along each of these dimensions, and have further integrated them into a single composite model, to provide strong performance guarantees. Thus, the proposed model has given us a single framework that can not only address the individual shortcomings of the various known attack attribution techniques, but also provide a more wholesome counter-measure against DDoS attacks. Our second contribution here has been a concrete implementation based on the proposed composable data model, having adopted a graph-theoretic approach to identify and subsequently stitch together individual edge fragments in the Internet graph to reveal the true routing path of any network data packet. The proposed approach has been analyzed through theoretical and experimental evaluation across multiple metrics, including scalability, incremental deployment, speed and efficiency of the distributed algorithm, and finally the total overhead associated with its deployment. We have thereby shown that it is realistically feasible to provide strong performance and scalability guarantees for Internet-wide attack attribution. Our third contribution here has further advanced the state of the art by directly identifying individual path fragments in the Internet graph, having adopted a distributed divide-and-conquer approach employing simple recurrence relations as individual building blocks. A detailed analysis of the proposed approach on real-life Internet topologies with respect to network storage and traffic overhead, has provided a more realistic characterization. Thus, not only does the proposed approach lend well for simplified operations at scale but can also provide robust network-wide performance and security guarantees for Internet-wide attack attribution. Our final contribution here has introduced the notion of anonymity in the overall attack attribution process to significantly broaden its scope. The highly invasive nature of wide-spread data gathering for network traceback continues to violate one of the key principles of Internet use today - the ability to stay anonymous and operate freely without retribution. In this regard, we have successfully reconciled these mutually divergent requirements to make it not only economically feasible and politically viable but also socially acceptable. This work opens up several directions for future research - analysis of existing attack attribution techniques to identify further scope for improvements, incorporation of newer attributes into the design framework of the composable data model abstraction, and finally design of newer attack attribution techniques that comprehensively integrate the various attack prevention, mitigation and traceback techniques in an efficient manner

    The MANGO clockless network-on-chip: Concepts and implementation

    Get PDF

    Characterization and optimization of network traffic in cortical simulation

    Get PDF
    Considering the great variety of obstacles the Exascale systems have to face in the next future, a deeper attention will be given in this thesis to the interconnect and the power consumption. The data movement challenge involves the whole hierarchical organization of components in HPC systems — i.e. registers, cache, memory, disks. Running scientific applications needs to provide the most effective methods of data transport among the levels of hierarchy. On current petaflop systems, memory access at all the levels is the limiting factor in almost all applications. This drives the requirement for an interconnect achieving adequate rates of data transfer, or throughput, and reducing time delays, or latency, between the levels. Power consumption is identified as the largest hardware research challenge. The annual power cost to operate the system would be above 2.5 B$ per year for an Exascale system using current technology. The research for alternative power-efficient computing device is mandatory for the procurement of the future HPC systems. In this thesis, a preliminary approach will be offered to the critical process of co-design. Co-desing is defined as the simultaneos design of both hardware and software, to implement a desired function. This process both integrates all components of the Exascale initiative and illuminates the trade-offs that must be made within this complex undertaking

    Energy-Efficient Interconnection Networks for High-Performance Computing

    Get PDF
    In recent years, energy has become one of the most important factors for de- signing and operating large scale computing systems. This is particularly true in high-performance computing, where systems often consist of thousands of nodes. Especially after the end of Dennard’s scaling, the demand for energy- proportionality in components, where energy is depending linearly on utilization, increases continuously. As the main contributor to the overall power consumption, processors have received the main attention so far. The increasing energy proportionality of processors, however, shifts the focus to other components such as interconnection networks. Their share of the overall power consumption is expected to increase to 20% or more while other components further increase their efficiency in the near future. Hence, it is crucial to improve energy proportionality in interconnection networks likewise to reduce overall power and energy consumption. To facilitate these attempts, this work provides comprehensive studies about energy saving in interconnection networks at different levels. First, interconnection networks differ fundamentally from other components in their underlying technology. To gain a deeper understanding of these differences and to identify targets for energy savings, this work provides a detailed power analysis of current network hardware. Furthermore, various applications at different scales are analyzed regarding their communication patterns and locality properties. The findings show that communication makes up only a small fraction of the execution time and networks are actually idling most of the time. Another observation is that point-to-point communication often only occurs within various small subsets of all participants, which indicates that a coordinated mapping could further decrease network traffic. Based on these studies, three different energy-saving policies are designed, which all differ in their implementation and focus. Then, these policies are evaluated in an event-based, power-aware network simulator. While two policies that operate completely local at link level, enable significant energy savings of more than 90% in most analyses, the hybrid one does not provide further benefits despite significant additional design effort. Additionally, these studies include network design parameters, such as transition time between different link configurations, as well as the three most common topologies in supercomputing systems. The final part of this work addresses the interactions of congestion management and energy-saving policies. Although both network management strategies aim for different goals and use opposite approaches, they complement each other and can increase energy efficiency in all studies as well as improve the performance overhead as opposed to plain energy saving

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design

    Small-world interconnection networks for large parallel computer systems

    Get PDF
    The use of small-world graphs as interconnection networks of multicomputers is proposed and analysed in this work. Small-world interconnection networks are constructed by adding (or modifying) edges to an underlying local graph. Graphs with a rich local structure but with a large diameter are shown to be the most suitable candidates for the underlying graph. Generation models based on random and deterministic wiring processes are proposed and analysed. For the random case basic properties such as degree, diameter, average length and bisection width are analysed, and the results show that a fast transition from a large diameter to a small diameter is experienced when the number of new edges introduced is increased. Random traffic analysis on these networks is undertaken, and it is shown that although the average latency experiences a similar reduction, networks with a small number of shortcuts have a tendency to saturate as most of the traffic flows through a small number of links. An analysis of the congestion of the networks corroborates this result and provides away of estimating the minimum number of shortcuts required to avoid saturation. To overcome these problems deterministic wiring is proposed and analysed. A Linear Feedback Shift Register is used to introduce shortcuts in the LFSR graphs. A simple routing algorithm has been constructed for the LFSR and extended with a greedy local optimisation technique. It has been shown that a small search depth gives good results and is less costly to implement than a full shortest path algorithm. The Hilbert graph on the other hand provides some additional characteristics, such as support for incremental expansion, efficient layout in two dimensional space (using two layers), and a small fixed degree of four. Small-world hypergraphs have also been studied. In particular incomplete hypermeshes have been introduced and analysed and it has been shown that they outperform the complete traditional implementations under a constant pinout argument. Since it has been shown that complete hypermeshes outperform the mesh, the torus, low dimensional m-ary d-cubes (with and without bypass channels), and multi-stage interconnection networks (when realistic decision times are accounted for and with a constant pinout), it follows that incomplete hypermeshes outperform them as well

    Architectural Support for Efficient Communication in Future Microprocessors

    Get PDF
    Traditionally, the microprocessor design has focused on the computational aspects of the problem at hand. However, as the number of components on a single chip continues to increase, the design of communication architecture has become a crucial and dominating factor in defining performance models of the overall system. On-chip networks, also known as Networks-on-Chip (NoC), emerged recently as a promising architecture to coordinate chip-wide communication. Although there are numerous interconnection network studies in an inter-chip environment, an intra-chip network design poses a number of substantial challenges to this well-established interconnection network field. This research investigates designs and applications of on-chip interconnection network in next-generation microprocessors for optimizing performance, power consumption, and area cost. First, we present domain-specific NoC designs targeted to large-scale and wire-delay dominated L2 cache systems. The domain-specifically designed interconnect shows 38% performance improvement and uses only 12% of the mesh-based interconnect. Then, we present a methodology of communication characterization in parallel programs and application of characterization results to long-channel reconfiguration. Reconfigured long channels suited to communication patterns enhance the latency of the mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally, we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core systems, cache traffic has 69% compressibility and shows high value sharing among flows. Compression-enabled NoC improves the latency by up to 63% and saves energy consumption by up to 12%
    corecore