18 research outputs found

    A scalable instruction queue design using dependence chains

    No full text
    Increasing the number of instruction queue (IQ) entries in a dynamically scheduled processor exposes more instruction-level parallelism, leading to higher performance. However, increasing a conventional IQ’s physical size leads to larger latencies and slower clock speeds. We introduce a new IQ design that divides a large queue into small segments, which can be clocked at high frequencies. We use dynamic dependence-based scheduling to promote instructions from segment to segment until they reach a small issue buffer. Our segmented IQ is designed specifically to accommodate variable-latency instructions such as loads. Despite its roughly similar circuit complexity, simulation results indicate that our segmented instruction queue with 512 entries and 128 chains improves performance by up to 69 % over a 32-entry conventional instruction queue for SpecINT 2000 benchmarks, and up to 398 % for SpecFP 2000 benchmarks. The segmented IQ achieves from 55 % to 98 % of the performance of a monolithic 512-entry queue while providing the potential for much higher clock speeds.

    A Simple Integrated Network Interface for High-Bandwidth Servers

    No full text
    High-bandwidth TCP/IP networking places a significant burden on end hosts. We argue that this issue should be addressed by integrating simple network interface controllers (NICs) more closely with host CPUs, not by pushing additional computation out to the NICs. We present a simple integrated NIC design (SINIC) that is significantly less complex and more flexible than a conventional DMA-descriptor-based NIC but performs as well or better than the conventional NIC when both are integrated onto the processor die. V-SINIC, an extended version of SINIC, provides virtual per-packet registers, enabling packet-level parallel processing while maintaining a FIFO model. V-SINIC also enables deferring the copy of the packet payload on receive, which we exploit to implement a zero-copy receive optimization in the Linux 2.6 kernel. This optimization improves bandwidth by over 50 % on a receive-oriented microbenchmark.

    SSDExplorer: A virtual platform for SSD simulations

    No full text
    Solid State Drives (SSDs) are becoming more and more popular, driven by the restless growth of high performance computing and cloud applications. The development of an SSD architecture implies the analysis of a bunch of trade-offs that, if properly understood, can tighten the SSD design space, thus reducing the prototyping effort. Although SSD hardware prototyping platforms are the best way to capture realistic system behaviors, they inherently suffer from a lack of flexibility. To tackle this challenge and to identify the optimum design, under a given set of constraints, the SSD research community is increasingly relying on sophisticated software tools for modeling and simulating SSD platforms. In the first part of this chapter the authors take a careful look at both literature and available simulation tools, including VSSIM, NANDFlashSim, and DiskSim. All these solutions are benchmarked against performances of real SSDs, including an OCZ VERTEX 120 GB and a NVRAM card used in large enterprise storage platform, that have been measured under different traffic workloads. PROs and CONs of each simulator are analyzed, pointing out which kind of answers each of them can give and at what price. The second part of the chapter is devoted to an advanced simulator named “SSDExplorer”, which is a fine-grained SSD virtual platform that was developed with the following goals in mind

    End-to-end performance forecasting

    No full text

    Performance analysis of system overheads in TCP/IP workloads

    No full text
    Current high-performance computer systems are unable to saturate the latest available high-bandwidth networks such as 10 Gigabit Ethernet. A key obstacle in achieving 10 gigabits per second is the high overhead of communication between the CPU and network interface controller (NIC), which typically resides on a standard I/O bus with high access latency. Using several network-intensive benchmarks, we investigate the impact of this overhead by analyzing the performance of hypothetical systems in which the NIC is more closely coupled to the CPU, including integration on the CPU die. We find that systems with high-latency NICs spend a significant amount of time in the device driver. NIC integration can substantially reduce this overhead, providing significant throughput benefits when other CPU processing is not a bottleneck. NIC integration also enables cache placement of DMA data. This feature has tremendous benefits when payloads are touched quickly, but potentially can harm performance in other situations due to cache pollution

    The M5 simulator: Modeling networked systems

    No full text
    TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulator attributes, M5 provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsys-tem, and the ability to simulate multiple networked systems deterministically. Our experience in simulating network workloads revealed some unexpected interactions between TCP and the common simulation accel-eration techniques of sampling and warm-up. We have successfully validated M5’s simulated performance results against real machines, indicating that our models and methodology adequately capture the salient characteristics of these systems. M5’s usefulness as a general-purpose architecture simulator and its liberal open-source license have led to its adoption by several other academic and commercial groups. 2 Keywords computer architecture, simulation, simulation software, interconnected systems

    Analyzing NIC Overheads in Network-Intensive Workloads

    No full text
    Modern high-bandwidth networks place a significant strain on host I/O subsystems. However, despite the practical ubiquity of TCP/IP over Ethernet for high-speed networking, the vast majority of end-host networking research continues in the current paradigm of the network interface as a generic peripheral device. As a result, proposed optimizations focus on purely software changes, or on moving some of the computation from the primary CPU to the off-chip network interface controller (NIC). We look at an alternative approach: leave the kernel TCP/IP stack unchanged, but eliminate bottlenecks by closer attachment of the NIC to the CPU and memory system
    corecore