57 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    TCP/IP スタック ニ オケル チェックサムケイサン ノ GPU オフロード ニ ヨル セイノウ コウジョウ シュホウ

    Full text link
    Ethernet Jumbo Frameの登場により、特にデータセンタなどの閉じたネットワーク環境において送受信されるフレームサイズが増大している。フレームサイズの増大に伴い、TCP/IPスタックにおけるチェックサム計算に要するCPU負荷が増大する。本報告では、大きな帯域幅のメモリをもつGraphics Processing Unit (GPU)にチェックサム計算をオフロードすることにより、CPU負荷を削減し、データ転送スループットを向上させる手法を提案する。具体的には、CPU-GPU間のパケット転送効率を向上させるためのパケットキューイング手法、及び、GPU上で複数のパケットを同時処理するためのGPUマルチプロセッサを用いたパケット分散処理手法の2つを提案する。ユーザランドで動作するチェックサム計算の簡易実装により性能を評価し、チェックサム計算のGPUオフロードによって、データ転送性能が最大で13%向上することを示す。The size of ethernet frames is becoming larger and larger due to the utilization of Ethernet Jumbo Frame option, especially in closed network environment such as data center networks. Increasing frame size would cause the large overhead for checksum calculation in TCP /IP protocol processing, that increase the CPU load. In this report we propose the scheme for decreasing CPU load and improving data transmission throughput by offloading checksum calculation to Graphics Processing Unit (GPU). Our scheme consists of the following two methods: packet queueing method to improve the packet transmission throughput between CPU and GPU, and the packet processing method exploiting the advantage of GPU's multiprocessor architecture. We evaluate the performance of the proposed scheme by simple experiments using the user-land implementation and confirm that the proposed scheme can improve the TCP data transmission throughput by 13 %, that is almost the same as the case when the checksum calculation is canceled.坪内佑樹, 長谷川剛, 谷口義明, 中野博隆, 松岡茂登「TCP/IPスタックにおけるチェックサム計算のGPUオフロードによる性能向上手法」『電子情報通信学会技術研究報告』Vol.113, No.244、pp.67-72、電子情報通信学会、201

    A data-driven study of operating system energy-performance trade-offs towards system self optimization

    Get PDF
    This dissertation is motivated by an intersection of changes occurring in modern software and hardware; driven by increasing application performance and energy requirements while Moore's Law and Dennard Scaling are facing challenges of diminishing returns. To address these challenging requirements, new features are increasingly being packed into hardware to support new offloading capabilities, as well as more complex software policies to manage these features. This is leading to an exponential explosion in the number of possible configurations of both software and hardware to meet these requirements. For network-based applications, this thesis demonstrates how these complexities can be tamed by identifying and exploiting the characteristics of the underlying system through a rigorous and novel experimental study. This thesis demonstrates how one can simplify this control strategy problem in practical settings by cutting across the complexity through the use of mechanisms that exploit two fundamental properties of network processing. Using the common request-response network processing model, this thesis finds that controlling 1) the speed of network interrupts and 2) the speed at which the request is then executed, enables the characterization of the software and hardware in a stable and well-structured manner. Specifically, a network device's interrupt delay feature is used to control the rate of incoming and outgoing network requests and a processor's frequency setting was used to control the speed of instruction execution. This experimental study, conducted using 340 unique combinations of the two mechanisms, across 2 OSes and 4 applications, finds that optimizing these settings in an application-specific way can result in characteristic performance improvements over 2X while improving energy efficiency by over 2X

    Fast algorithm for real-time rings reconstruction

    Get PDF
    The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of μs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Network-Compute Co-Design for Distributed In-Memory Computing

    Get PDF
    The booming popularity of online services is rapidly raising the demands for modern datacenters. In order to cope with data deluge, growing user bases, and tight quality of service constraints, service providers deploy massive datacenters with tens to hundreds of thousands of servers, keeping petabytes of latency-critical data memory resident. Such data distribution and the multi-tiered nature of the software used by feature-rich services results in frequent inter-server communication and remote memory access over the network. Hence, networking takes center stage in datacenters. In response to growing internal datacenter network traffic, networking technology is rapidly evolving. Lean user-level protocols, like RDMA, and high-performance fabrics have started making their appearance, dramatically reducing datacenter-wide network latency and offering unprecedented per-server bandwidth. At the same time, the end of Dennard scaling is grinding processor performance improvements to a halt. The net result is a growing mismatch between the per-server network and compute capabilities: it will soon be difficult for a server processor to utilize all of its available network bandwidth. Restoring balance between network and compute capabilities requires tighter co-design of the two. The network interface (NI) is of particular interest, as it lies on the boundary of network and compute. In this thesis, we focus on the design of an NI for a lightweight RDMA-like protocol and its full integration with modern manycore server processors. The NI capabilities scale with both the increasing network bandwidth and the growing number of cores on modern server processors. Leveraging our architecture's integrated NI logic, we introduce new functionality at the network endpoints that yields performance improvements for distributed systems. Such additions include new network operations with stronger semantics tailored to common application requirements and integrated logic for balancing network load across a modern processor's multiple cores. We make the case that exposing richer, end-to-end semantics to the NI is a unique enabler for optimizations that can reduce software complexity and remove significant load from the processor, contributing towards maintaining balance between the two valuable resources of network and compute. Overall, network-compute co-design is an approach that addresses challenges associated with the emerging technological mismatch of compute and networking capabilities, yielding significant performance improvements for distributed memory systems

    Building Evolvable Networks:Flexible and Predictable Packet Processing

    Get PDF
    Software packet-processing platforms--network devices running on general-purpose servers--are emerging as a compelling alternative to the traditional high-end switches and routers running on specialized hardware. Their promise is to enable the fast deployment of new, sophisticated kinds of packet processing without the need to buy and deploy expensive new equipment. This would allow us to transform the current Internet into a programmable network, a network that can evolve over time and provide a better service for the users. In order to become a credible alternative to the hardware platforms, software packet processing needs to offer not just flexibility, but also a competitive level of performance and, equally important, predictability. Recent works have demonstrated high performance for software platforms, but this was shown only for simple, conventional workloads like packet forwarding and IP routing. And this was achieved for systems where all the processing cores handle the same type/amount of traffic and run identical code, a critical simplifying assumption. One challenge is to achieve high and predictable performance in the context of software platforms running a diverse set of applications and serving multiple clients with different needs. Another challenge is to offer such flexibility without the risk of disrupting the network by introducing bugs, unpredictable performance, or security vulnerabilities. In this thesis we focus on how to design software packet-processing systems so as to achieve both high performance and predictability, while maintaining the ease of programmability. First, we identify the main factors that affect packet-processing performance on a modern multicore server--cache misses, cache contention, load-balancing across processing cores--and show how to parallelize the functionality across the available cores in order to maximize the throughput. Second, we analyze the way contention for shared resources--caches, memory controllers, buses--affects performance in a system that runs a diverse set of packet-processing applications. The key observation is that contention for the last-level cache represents the dominant contention factor and the performance drop suffered by a given application is mostly determined by the number of cache references/second performed by the competing applications. We leverage this observation and we show that such a system is able to provide predictable performance in the face of resource contention. Third, we present the result of working iteratively on two tasks: designing a domain-specific verification tool for packet-processing software, while trying to identify a minimal set of restrictions that packet-processing software must satisfy in order to be verification-friendly. The main insight is that packet-processing software is a good fit for verification because it typically consists of distinct pieces of code that share limited mutable state and we can leverage this domain specificity to sidestep fundamental scalability challenges in software verification. We demonstrate that it is feasible to automatically prove useful properties of software dataplanes to ensure a smooth network operation. Based on our results, we conclude that we can design software network equipment that combines both flexibility and predictability
    corecore