38 research outputs found

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication

    A performance model of communication in the quarc NoC

    Get PDF
    Networks on-chip (NoC) emerged as a promising communication medium for future MPSoC development. To serve this purpose, the NoCs have to be able to efficiently exchange all types of traffic including the collective communications at a reasonable cost. The Quarc NoC is introduced as a NOC which is highly efficient in performing collective communication operations such as broadcast and multicast. This paper presents an introduction to the Quarc scheme and an analytical model to compute the average message latency in the architecture. To validate the model we compare the model latency prediction against the results obtained from discrete-event simulations

    Quarc: a novel network-on-chip architecture

    Get PDF
    This paper introduces the Quarc NoC, a novel NoC architecture inspired by the Spidergon NoC. The Quarc scheme significantly outperforms the Spidergon NoC through balancing the traffic which is the result of the modifications applied to the topology and the routing elements.The proposed architecture is highly efficient in performing collective communication operations including broadcast and multicast. We present the topology, routing discipline and switch architecture for the Quarc NoC and demonstrate the performance with the results obtained from discrete event simulations

    Adaptive code division multiple access protocol for wireless network-on-chip architectures

    Get PDF
    Massive levels of integration following Moore\u27s Law ushered in a paradigm shift in the way on-chip interconnections were designed. With higher and higher number of cores on the same die traditional bus based interconnections are no longer a scalable communication infrastructure. On-chip networks were proposed enabled a scalable plug-and-play mechanism for interconnecting hundreds of cores on the same chip. Wired interconnects between the cores in a traditional Network-on-Chip (NoC) system, becomes a bottleneck with increase in the number of cores thereby increasing the latency and energy to transmit signals over them. Hence, there has been many alternative emerging interconnect technologies proposed, namely, 3D, photonic and multi-band RF interconnects. Although they provide better connectivity, higher speed and higher bandwidth compared to wired interconnects; they also face challenges with heat dissipation and manufacturing difficulties. On-chip wireless interconnects is one other alternative proposed which doesn\u27t need physical interconnection layout as data travels over the wireless medium. They are integrated into a hybrid NOC architecture consisting of both wired and wireless links, which provides higher bandwidth, lower latency, lesser area overhead and reduced energy dissipation in communication. However, as the bandwidth of the wireless channels is limited, an efficient media access control (MAC) scheme is required to enhance the utilization of the available bandwidth. This thesis proposes using a multiple access mechanism such as Code Division Multiple Access (CDMA) to enable multiple transmitter-receiver pairs to send data over the wireless channel simultaneously. It will be shown that such a hybrid wireless NoC with an efficient CDMA based MAC protocol can significantly increase the performance of the system while lowering the energy dissipation in data transfer. In this work it is shown that the wireless NoC with the proposed CDMA based MAC protocol outperformed the wired counterparts and several other wireless architectures proposed in literature in terms of bandwidth and packet energy dissipation. Significant gains were observed in packet energy dissipation and bandwidth even with scaling the system to higher number of cores. Non-uniform traffic simulations showed that the proposed CDMA-WiNoC was consistent in bandwidth across all traffic patterns. It is also shown that the CDMA based MAC scheme does not introduce additional reliability concerns in data transfer over the on-chip wireless interconnects

    The MANGO clockless network-on-chip: Concepts and implementation

    Get PDF

    Floorplan-Aware High Performance NoC Design

    Full text link
    Las actuales arquitecturas de m�ltiples n�cleos como los chip multiprocesadores (CMP) y soluciones multiprocesador para sistemas dentro del chip (MPSoCs) han adoptado a las redes dentro del chip (NoC) como elemento -ptimo para la inter-conexi-n de los diversos elementos de dichos sistemas. En este sentido, fabricantes de CMPs y MPSoCs han adoptado NoCs sencillas, generalmente con una topolog'a en malla o anillo, ya que son suficientes para satisfacer las necesidades de los sistemas actuales. Sin embargo a medida que los requerimientos del sistema -- baja latencia y alto rendimiento -- se hacen m�s exigentes, estas redes tan simples dejan de ser una soluci-n real. As', la comunidad investigadora ha propuesto y analizado NoCs m�s complejas. No obstante, estas soluciones son m�s dif'ciles de implementar -- especialmente los enlaces largos -- haciendo que este tipo de topolog'as complejas sean demasiado costosas o incluso inviables. En esta tesis, presentamos una metodolog'a de dise-o que minimiza la p�rdida de prestaciones de la red debido a su implementaci-n real. Los principales problemas que se encuentran al implementar una NoC son los conmutadores y los enlaces largos. En esta tesis, el conmutador se ha hecho modular, es decir, formado como uni-n de m-dulos m�s peque-os. En nuestro caso, los m-dulos son id�nticos, donde cada m-dulo es capaz de arbitrar, conmutar, y almacenar los mensajes que le llegan. Posteriormente, flexibilizamos la colocaci-n de estos m-dulos en el chip, permitiendo que m-dulos de un mismo conmutador est�n distribuidos por el chip. Esta metodolog'a de dise-o la hemos aplicado a diferentes escenarios. Primeramente, hemos introducido nuestro conmutador modular en NoCs con topolog'as conocidas como la malla 2D. Los resultados muestran como la modularidad y la distribuci-n del conmutador reducen la latencia y el consumo de potencia de la red. En segundo lugar, hemos utilizado nuestra metodolog'a de dise-o para implementar un crossbar distribuidRoca Pérez, A. (2012). Floorplan-Aware High Performance NoC Design [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/17844Palanci

    Interconnects architectures for many-core era using surface-wave communication

    Get PDF
    PhD ThesisNetworks-on-chip (NoCs) is a communication paradigm that has emerged aiming to address on-chip communication challenges and to satisfy interconnection demands for chip-multiprocessors (CMPs). Nonetheless, there is continuous demand for even higher computational power, which is leading to a relentless downscaling of CMOS technology to enable the integration of many-cores. However, technology downscaling is in favour of the gate nodes over wires in terms of latency and power consumption. Consequently, this has led to the era of many-core processors where power consumption and performance are governed by inter-core communications rather than core computation. Therefore, NoCs need to evolve from being merely metalbased implementations which threaten to be a performance and power bottleneck for many-core efficiency and scalability. To overcome such intensified inter-core communication challenges, this thesis proposes a novel interconnect technology: the surface-wave interconnect (SWI). This new RF-based on-chip interconnect has notable characteristics compared to cutting-edge on-chip interconnects in terms of CMOS compatibility, high speed signal propagation, low power dissipation, and massive signal fan-out. Nonetheless, the realization of the SWI requires investigations at different levels of abstraction, such as the device integration and RF engineering levels. The aim of this thesis is to address the networking and system level challenges and highlight the potential of this interconnect. This should encourage further research at other levels of abstraction. Two specific system-level challenges crucial in future many-core systems are tackled in this study, which are cross-the-chip global communication and one-to-many communication. This thesis makes four major contributions towards this aim. The first is reducing the NoC average-hop count, which would otherwise increase packet-latency exponentially, by proposing a novel hybrid interconnect architecture. This hybrid architecture can not only utilize both regular metal-wire and SWI, but also exploits merits of both bus and NoC architectures in terms of connectivity compared to other general-purpose on-chip interconnect architectures. The second contribution addresses global communication issues by developing a distance-based weighted-round-robin arbitration (DWA) algorithm. This technique prioritizes global communication to be send via SWI short-cuts, which offer more efficient power dissipation and faster across-the-chip signal propagation. Results obtained using a cycleaccurate simulator demonstrate the effectiveness of the proposed system architecture in terms of significant power reduction, considervii able average delay reduction and higher throughput compared to a regular NoC. The third contribution is in handling multicast communications, which are normally associated with traffic overload, hotspots and deadlocks and therefore increase, by an order of magnitude the power consumption and latency. This has been achieved by proposing a novel routing and centralized arbitration schemes that exploits the SWI0s remarkable fan-out features. The evaluation demonstrates drastic improvements in the effectiveness of the proposed architecture in terms of power consumption ( 2-10x) and performance ( 22x) but with negligible hardware overheads ( 2%). The fourth contribution is to further explore multicast contention handling in a flexible decentralized manner, where original techniques such as stretch-multicast and ID-tagging flow control have been developed. A comparison of these techniques shows that the decentralized approach is superior to the centralized approach with low traffic loads, while the latter outperforms the former near and after NoC saturation

    Using Proportional-Integral-Differential approach for Dynamic Traffic Prediction in Wireless Network-on-Chip

    Get PDF
    The massive integration of cores in multi-core system has enabled chip designer to design systems while meeting the power performance demands of the applications. Wireless interconnection has emerged as an energy efficient solution to the challenges of multi-hop communication over the wireline paths in conventional Networks-on-Chips (NoCs). However, to ensure the full benefits of this novel interconnect technology, design of simple, fair and efficient Medium Access Control (MAC) mechanism to grant access to the on-chip wireless communication channel is needed. Moreover, to adapt to the varying traffic demands from the applications running on a multicore environment, MAC mechanisms should dynamically adjust the transmission slots of the wireless interfaces (WIs). To ensure an efficient utilization of the wireless medium in a Wireless NoC (WiNoC), in this work we present the design of prediction model that is used by two dynamic MAC mechanism to predict the traffic demand of the WIs and respond accordingly by adjusting transmission slots of the WIs. Through system level simulations, we show that the traffic aware MAC mechanisms are more energy efficient as well as capable of sustaining higher data bandwidth in WiNoCs

    Multistage Packet-Switching Fabrics for Data Center Networks

    Get PDF
    Recent applications have imposed stringent requirements within the Data Center Network (DCN) switches in terms of scalability, throughput and latency. In this thesis, the architectural design of the packet-switches is tackled in different ways to enable the expansion in both the number of connected endpoints and traffic volume. A cost-effective Clos-network switch with partially buffered units is proposed and two packet scheduling algorithms are described. The first algorithm adopts many simple and distributed arbiters, while the second approach relies on a central arbiter to guarantee an ordered packet delivery. For an improved scalability, the Clos switch is build using a Network-on-Chip (NoC) fabric instead of the common crossbar units. The Clos-UDN architecture made with Input-Queued (IQ) Uni-Directional NoC modules (UDNs) simplifies the input line cards and obviates the need for the costly Virtual Output Queues (VOQs). It also avoids the need for complex, and synchronized scheduling processes, and offers speedup, load balancing, and good path diversity. Under skewed traffic, a reliable micro load-balancing contributes to boosting the overall network performance. Taking advantage of the NoC paradigm, a wrapped-around multistage switch with fully interconnected Central Modules (CMs) is proposed. The architecture operates with a congestion-aware routing algorithm that proactively distributes the traffic load across the switching modules, and enhances the switch performance under critical packet arrivals. The implementation of small on-chip buffers has been made perfectly feasible using the current technology. This motivated the implementation of a large switching architecture with an Output-Queued (OQ) NoC fabric. The design merges assets of the output queuing, and NoCs to provide high throughput, and smooth latency variations. An approximate analytical model of the switch performance is also proposed. To further exploit the potential of the NoC fabrics and their modularity features, a high capacity Clos switch with Multi-Directional NoC (MDN) modules is presented. The Clos-MDN switching architecture exhibits a more compact layout than the Clos-UDN switch. It scales better and faster in port count and traffic load. Results achieved in this thesis demonstrate the high performance, expandability and programmability features of the proposed packet-switches which makes them promising candidates for the next-generation data center networking infrastructure

    A time-predictable many-core processor design for critical real-time embedded systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget. CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform. The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization. However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system. This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas
    corecore