134 research outputs found

    Virtualizing Network Processors

    Get PDF
    This paper considers the problem of virtualizing the resources of a network processor (NP) in order to allow multiple third-parties to execute their own virtual router software on a single physical router at the same time. Our broad interest is in designing such a router capable of supporting virtual networking. We discuss the issues and challenges involved in this virtualization, and then describe specific techniques for virtualizing both the control and data-plane processors on NPs. For Intel IXP NPs in particular, we present a dynamic, macro-based technique for virtualization that allows multiple virtual routers to run on multiple data plane processors (or micro-engines) while maintaining memory isolation and enforcing memory bandwidth allocations

    ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„์„ ํ†ตํ•œ ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ ํŒŒํ‹ฐ์…”๋‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2023. 2. ๊น€์žฅ์šฐ.Ensuring fairness or providing isolation between multiple workloads with distinct characteristics that are collocated on a single, shared-memory system is a challenge. Recent multicore processors provide last-level cache (LLC) hardware partitioning to provide hardware support for isolation, with the cache partitioning often specified by the user. While more LLC capacity often results in higher performance, in this dissertation we identify that a workload allocated more LLC capacity result in worse performance on real-machine experiments, which we refer to as MiW (more is worse). Through various controlled experiments, we identify that another workload with less LLC capacity causes more frequent LLC misses. The workload stresses the main memory system shared by both workloads and degrades the performance of the former workload even if LLC partitioning is used (a balloon effect). To resolve this problem, we propose virtualizing the data path of main memory controllers and dedicating the memory virtual channels (mVCs) to each group of applications, grouped for LLC partitioning. mVC can further fine-tune the performance of groups by differentiating buffer sizes among mVCs. It can reduce the total system cost by executing latency-critical and throughput-oriented workloads together on shared machines, of which performance criteria can be achieved only on dedicated machines if mVCs are not supported. Experiments on a simulated chip multiprocessor show that our proposals effectively eliminate the MiW phenomenon, hence providing additional opportunities for workload consolidation in a datacenter. Our case study demonstrates potential savings of machine count by 21.8% with mVC, which would otherwise violate a service level objective (SLO).์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ํ•™๊ณ„ ๋ฐ ์—…๊ณ„์˜ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์œผ๋ฉฐ, ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋™์‹œ์— ์‹คํ–‰๋˜๋Š”๋ฐ, ์ด ๋•Œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์‹œ์Šคํ…œ์˜ ์—ฌ๋Ÿฌ ์ž์›๋“ค์„ ๊ณต์œ ํ•˜๊ฒŒ ๋œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ๊ณต์œ  ์ž์›์˜ ์˜ˆ๋กœ๋Š” ๋ผ์ŠคํŠธ ๋ ˆ๋ฒจ ์บ์‹œ(LLC) ๋ฐ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋“ค ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์ผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค ๊ฐ„์— ๊ณต์œ  ์ž์›์˜ ๊ณต์ •์„ฑ์„ ๋ณด์žฅํ•˜๊ฑฐ๋‚˜ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ตœ๊ทผ ๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๋Š” LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์‹œ์ž‘ํ•˜์˜€๋‹ค. ์‚ฌ์šฉ์ž๋Š” ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ ์ œ๊ณต๋œ LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์›ํ•˜๋Š” ์ˆ˜์ค€๋งŒํผ LLC๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ„์„ญ์„ ๋ฐ›์ง€ ์•Š๋„๋ก ๊ฒฉ๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ LLC ์šฉ๋Ÿ‰์„ ๋งŽ์ด ํ• ๋‹น ๋ฐ›์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋” ๋งŽ์€ LLC ์šฉ๋Ÿ‰์„ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜๋œ๋‹ค๋Š” ์‚ฌ์‹ค(MiW, more is worse)์„ ํ•˜๋“œ์›จ์–ด์  ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ํ†ต์ œ๋œ ์‹คํ—˜์„ ํ†ตํ•ด LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ด LLC ๋ฏธ์Šค๋ฅผ ๋” ์ž์ฃผ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ผ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. LLC ์šฉ๋Ÿ‰์„ ์ ๊ฒŒ ํ• ๋‹น ๋ฐ›์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๊ณต์œ ํ•˜๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์— ์ŠคํŠธ๋ ˆ์Šค๋ฅผ ๊ฐ€ํ•˜๊ณ , LLC ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋ฅผ ํ•˜์˜€์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œ์ผฐ๋‹ค. MiW ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ์˜ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ€์ƒํ™”ํ•˜๊ณ  LLC ํŒŒํ‹ฐ์…”๋‹์— ์˜ํ•ด ๊ทธ๋ฃนํ™”๋œ ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์— ์ „์šฉ์œผ๋กœ ํ• ๋‹น๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์ƒ ์ฑ„๋„(mVC)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. mVC๋ฅผ ํ†ตํ•ด ๊ฐ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์€ ๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ์†Œ์œ ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๊ฐ€์ƒํ™” ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์ด ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๋…์ ํ•˜๋”๋ผ๋„ ๋‹ค๋ฅธ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์€ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜์–ด ์„œ๋กœ ๊ฒฉ๋ฆฌ๋œ ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•œ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ mVC์˜ ๋ฒ„ํผ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ๊ทธ๋ฃน์˜ ์„ฑ๋Šฅ ๋ฏธ์„ธ ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€๋‹ค. mVC๋ฅผ ๋„์ž…ํ•จ์œผ๋กœ์จ ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ ๋น„์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ์ค‘์š”ํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์„ ํ•จ๊ป˜ ์‹คํ–‰ํ•  ๋•Œ mVC๊ฐ€ ์—†์„ ๊ฒฝ์šฐ์—๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์˜ ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•  ์ˆ˜ ์—†์—ˆ์ง€๋งŒ, mVC๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ๊ธฐ์ค€์น˜๋ฅผ ๋งŒ์กฑํ•˜๋ฉด์„œ ์‹œ์Šคํ…œ์˜ ์ด ๋น„์šฉ์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ฉ€ํ‹ฐ ์นฉ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” MiW ํ˜„์ƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ฑฐํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ, ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์—์„œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ๋“ค์˜ ๋™์‹œ ์‹คํ–‰์„ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด mVC๋ฅผ ๋„์ž…ํ•˜์—ฌ ์‹œ์Šคํ…œ ๋น„์šฉ์„ 21.8%๊นŒ์ง€ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€์œผ๋ฉฐ, mVC๋ฅผ ๋„์ž…ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” ์„œ๋น„์Šค ๊ธฐ์ค€(SLO)์„ ๋งŒ์กฑํ•˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1. Introduction 1 1.1 Research Contributions 5 1.2 Outline 6 2. Background 7 2.1 Cache Hierarchy and Policies 7 2.2 Cache Partitioning 10 2.3 Benchmarks 15 2.3.1 Working Set Size 16 2.3.2 Top-down Analysis 17 2.3.3 Profiling Tools 19 3. More-is-Worse Phenonmenon 21 3.1 More LLC Leading to Performance Drop 21 3.2 Synthetic Workload Evaluation 27 3.3 Impact on Latency-critical Workloads 31 3.4 Workload Analysis 33 3.5 The Root Cause of the MiW Phenomenon 35 3.6 Limitations of Existing Solutions 41 3.6.1 Memory Bandwidth Throttling 41 3.6.2 Fairness-aware Memory Scheduling 44 4. Virtualizing Memory Channels 49 4.1 Memory Virtual Channel (mVC) 50 4.2 mVC Buffer Allocation Strategies 52 4.3 Evaluation 57 4.3.1 Experimental Setup 57 4.3.2 Reproducing Hardware Results 59 4.3.3 Mitigating MiW through mVC 60 4.3.4 Evaluation on Four Groups 64 4.3.5 Potentials for Operating Cost Savings with mVC 66 5. Related Work 71 5.1 Component-wise QoS/Fairness for Shared Resources 71 5.2 Holistic Approaches to QoS/Fairness 73 5.3 MiW on Recent Architectures 74 6. Conclusion 76 6.1 Discussion 78 6.2 Future Work 79 Bibliography 81 ๊ตญ๋ฌธ์ดˆ๋ก 89๋ฐ•

    Towards the Teraflop CFD

    Get PDF
    We are surveying current projects in the area of parallel supercomputers. The machines considered here will become commercially available in the 1990 - 1992 time frame. All are suitable for exploring the critical issues in applying parallel processors to large scale scientific computations, in particular CFD calculations. This chapter presents an overview of the surveyed machines, and a detailed analysis of the various architectural and technology approaches taken. Particular emphasis is placed on the feasibility of a Teraflops capability following the paths proposed by various developers

    Packet Switched vs. Time Multiplexed FPGA Overlay Networks

    Get PDF
    Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PEโ€™s operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2ร— faster for small area designs while time-multiplexing is up to 5ร— faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology
    • โ€ฆ
    corecore