332 research outputs found
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
Recommended from our members
Languages and Compilers for Writing Efficient High-Performance Computing Applications
Many everyday applications, such as web search, speech recognition, and weather prediction, are executed on high-performance systems containing thousands of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). These applications can be written in either low-level programming languages, such as NVIDIA CUDA, or domain specific languages, like Halide for image processing and PyTorch for machine learning programs. Despite the popularity of these languages, there are several challenges that programmers face when developing efficient high-performance computing applications. First, since every hardware support a different low-level programming model, to utilize new hardware programmers need to rewrite their applications in another programming language. Second, writing efficient code involves restructuring the computation to ensure (i) regular memory access patterns, (ii) non-divergent control flow, and (iii) complete utilization of different programmer managed caches. Furthermore, since these low-level optimizations are known only to hardware experts, it is difficult for a domain expert to write optimized code for new computations. Third, existing domain specific languages suffer from optimization barriers in the language constructs that prevent new optimizations and hence, these languages provide sub-optimal performance. To address these challenges this thesis presents the following novel abstractions and compiler techniques for writing image processing and machine learning applications that can run efficiently on a variety of high-performance systems. First, this thesis presents techniques to optimize image processing programs on GPUs using the features of modern GPUs. These techniques improve the concurrency and register usage of generated code to provide better performance than the state-of-the-art. Second, this thesis presents NextDoor, which is the first system to provide an abstraction for writing graph sampling applications and efficiently executing these applications on GPUs. Third, this thesis presents CoCoNet, which is a domain specific language to co-optimize communication and computation in distributed machine learning workloads. By breaking the optimization barriers in existing domain specific languages, these techniques help programmers write correct and efficient code for diverse high-performance computing workloads
Efficient GPU Tree Walks for Effective Distributed N-Body Simulations
N-body problems, such as simulating the motion of stars in a galaxy, are popularly solved using tree codes like Barnes-Hut. ChaNGa is a best-of-breed n-body platform that uses an asymptotically-efficient tree traversal strategy known as a dual-tree walk to quickly determine which bodies need to interact with each other to provide an accurate simulation result. However, this strategy does not work well on GPUs, due to the highly-irregular nature of the dual-tree algorithm. On GPUs, ChaNGa uses a hybrid strategy where the CPU performs the tree walk to determine which bodies interact while the GPU performs the force computation. In this paper, we show that a highly-optimized single-tree walk approach is able to achieve better GPU performance by significantly accelerating the tree walk and reducing CPU/GPU communication. Our experiments show that this new design can achieve a 8.25× speedup over baseline ChaNGa using a one node, one process per node configuration
- …