26 research outputs found

    Beyond the socket: NUMA-aware GPUs

    Get PDF
    GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version

    ECMO for COVID-19 patients in Europe and Israel

    Get PDF
    Since March 15th, 2020, 177 centres from Europe and Israel have joined the study, routinely reporting on the ECMO support they provide to COVID-19 patients. The mean annual number of cases treated with ECMO in the participating centres before the pandemic (2019) was 55. The number of COVID-19 patients has increased rapidly each week reaching 1531 treated patients as of September 14th. The greatest number of cases has been reported from France (n = 385), UK (n = 193), Germany (n = 176), Spain (n = 166), and Italy (n = 136) .The mean age of treated patients was 52.6 years (range 16–80), 79% were male. The ECMO configuration used was VV in 91% of cases, VA in 5% and other in 4%. The mean PaO2 before ECMO implantation was 65 mmHg. The mean duration of ECMO support thus far has been 18 days and the mean ICU length of stay of these patients was 33 days. As of the 14th September, overall 841 patients have been weaned from ECMO support, 601 died during ECMO support, 71 died after withdrawal of ECMO, 79 are still receiving ECMO support and for 10 patients status n.a. . Our preliminary data suggest that patients placed on ECMO with severe refractory respiratory or cardiac failure secondary to COVID-19 have a reasonable (55%) chance of survival. Further extensive data analysis is expected to provide invaluable information on the demographics, severity of illness, indications and different ECMO management strategies in these patients

    Routing table minimization for irregular mesh NoCs

    No full text
    The majority of current Network on Chip (NoC) architectures employ mesh topology and use simple static routing, to reduce power and area. However, regular mesh topology is unrealistic due to variations in module sizes and shapes, and is not suitable for application-specific NoCs. Consequently, simplistic routing techniques such as XY routing are inadequate, raising the need for low cost alternatives which can work in irregular mesh networks. In this paper we present a novel technique for reducing the total hardware cost of routing tables for both source and distributed routing approaches. The proposed technique is based on applying a fixed routing function combined with minimal deviation tables that are used only when the routing decisions for a given destination deviate from the predefined routing function. We apply this methodology to compare three hardware efficient routing methods for irregular mesh topology NoCs. For each method, we develop path selection algorithms that minimize the overall cost of routing tables. Finally, we demonstrate by simulations on random and specific real application network instances a significant cost saving compared to standard solutions, and examine the scaling of cost savings with growing NoC size. 1 1

    Cost Considerations in Network on Chip

    No full text
    Abstract — Systems on Chip (SoCs) require efficient inter-module interconnection providing for the required communications at a low cost. We analyze the generic cost in area and power of Networks on Chip (NoCs) and alternative interconnect architectures: a shared bus, a segmented bus and a point-to-point interconnect. For each architecture we derive analytical expressions for area, power dissipation and operating frequency as well as asymptotic limits of these functions. The analysis quantifies the intuitive NoC scalability advantages. Next we turn to NoC cost optimization. We explore cost tradeoffs between the number of buffers and the link speed. We use a reference architecture, termed QNoC (Quality-of-Service NoC), which is based on a grid of wormhole switches, shortest path routing and multiple QoS classes. Two traffic scenarios are considered, one dominated by short packets sensitive to queuing delays and the other dominated by large block-transfers. Our simulations show that network cost can be minimized while maintaining quality of service, by trading off buffers with links in the first scenario but not in the second. Index Terms — network on chip, scalable interconnect, wormhole buffering, cost minimization. 1

    Abstract QNoC: QoS architecture and design process for network on chip

    No full text
    We define Quality of Service (QoS) and cost model for communications in Systems on Chip (SoC), and derive related Network on Chip (NoC) architecture and design process. SoC inter-module communication traffic is classified into four classes of service: signaling (for inter-module control signals); real-time (representing delay-constrained bit streams); RD/WR (modeling short data access) and block-transfer (handling large data bursts). Communication traffic of the target SoC is analyzed (by means of analytic calculations and simulations), and QoS requirements (delay and throughput) for each service class are derived. A customized Quality-of-Service NoC (QNoC) architecture is derived by modifying a generic network architecture. The customization process minimizes the network cost (in area and power) while maintaining the required QoS. The generic network is based on a two-dimensional planar mesh and fixed shortest path (X –Y based) multi-class wormhole routing. Once communication requirements of the target SoC are identified, the network is customized as follows: The SoC modules are placed so as to minimize spatial traffic density, unnecessary mesh links and switching nodes are removed, and bandwidth is allocated to the remaining links and switches according to their relative load so that link utilization is balanced. The result is a low cost customized QNoC for the target SoC which guarantees that QoS requirements are met
    corecore