219,078 research outputs found

    Extending BORPH for shared memory reconfigurable computers

    Get PDF
    We extend BORPH for shared memory reconfigurable computers in this paper. BORPH is an operating system designed for FPGA based reconfigurable computers. BORPH introduced the concept of hardware process in contrast to software process. With our extension, hardware processes are supported to communicate with other processes based on shared memory. In our system, the program of hardware process is not just hardware design, but the software program running on embedded processor in FPGA. Our experiment shows the overhead of shared memory segments management is acceptable. And with independent virtual memory access, bandwidth of repeated shared memory access is high.published_or_final_versio

    Memory-savvy distributed interactive ray tracing

    Get PDF
    Journal ArticleInteractive ray tracing in a cluster environment requires paying close attention to the constraints of a loosely coupled distributed system. To render large scenes interactively, memory limits and network latency must be addressed efficiently. In this paper, we improve previous systems by moving to a page-based distributed shared memory layer, resulting in faster and easier access to a shared memory space. The technique is designed to take advantage of the large virtual memory space provided by 64-bit machines. We also examine task reuse through decentralized load balancing and primitive reorganization to complement the shared memory system. These techniques improve memory coherence and are valuable when physical memory is limited. C-SAF

    MARACAS: a real-time multicore VCPU scheduling framework

    Full text link
    This paper describes a multicore scheduling and load-balancing framework called MARACAS, to address shared cache and memory bus contention. It builds upon prior work centered around the concept of virtual CPU (VCPU) scheduling. Threads are associated with VCPUs that have periodically replenished time budgets. VCPUs are guaranteed to receive their periodic budgets even if they are migrated between cores. A load balancing algorithm ensures VCPUs are mapped to cores to fairly distribute surplus CPU cycles, after ensuring VCPU timing guarantees. MARACAS uses surplus cycles to throttle the execution of threads running on specific cores when memory contention exceeds a certain threshold. This enables threads on other cores to make better progress without interference from co-runners. Our scheduling framework features a novel memory-aware scheduling approach that uses performance counters to derive an average memory request latency. We show that latency-based memory throttling is more effective than rate-based memory access control in reducing bus contention. MARACAS also supports cache-aware scheduling and migration using page recoloring to improve performance isolation amongst VCPUs. Experiments show how MARACAS reduces multicore resource contention, leading to improved task progress.http://www.cs.bu.edu/fac/richwest/papers/rtss_2016.pdfAccepted manuscrip

    Reducing Host Load, Network Load and Latency in a Distributed Shared Memory

    Get PDF
    Mether is a Distributed Shared Memory (DSM) that runs on SunĀ¹ workstations under the SunOS 4.0 operating system. User programs access the Mether address space in a way indistinguishable from other memory. Mether had a number of performance problems which we had also seen on a distributed shared memory called Memnet[2]. In this paper we discuss changes we made to Mether and protocols we developed to use Mether that minimize host load, network load, and latency. An interesting (and unexpected) result was that for one problem we studied the same best protocol for Mether is identical to the best protocol for MemNet[6]. The changes to Mether involve exposing an inconsistent store to the application and making access to the consistent and inconsistent versions very convenient; providing both demand-driven and data-driven semantics for updating pages; and allowing the user to specify that only a small subset of a page need be transferred. All of these operations are encoded in a few address bits in the Mether virtual address

    LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version]

    Get PDF
    CORRECTION: The authors for entry [4] in the references should have been "E. S. Chung, J. C. Hoe, and K. Mai".Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer management in an H.264 decoder and memory management within a processor microarchitecture timing model

    Design and implementation of page based distributed shared memory in distributed database systems

    Get PDF
    This project is the simulation of page based distributed shared memory originally called IVY proposed by Li in 1986[3] and then by Li and Hudak in 1989[4]. The \u27Page Based Distributed Shared Memory System\u27 consists of a collection of clients or workstations connected to a server by a Local Area Network. The server contains a shared memory segment within which the distributed database is located. The shared memory segment is divided in the form of pages and hence the name \u27Page Based Distributed Shared Memory System\u27 where each page represents a table within that distributed database. In the simplest variant, each page is present on exactly one machine. A reference to a local page is done at full memory speed. An attempt to reference a page on a different machine causes a page fault, which is trapped by the software. The software then sends a message to the remote machine, which finds the needed page and sends it to the requesting process. The fault is then restarted and can now complete, which is achieved with the help of Inter Process Communication (IPC) library. In essence, this design is similar to traditional virtual memory systems: when a process touches a nonresident page, a fault occurs and the operating system fetches the page and maps it in. The difference here is that instead of getting the page from the disk, the software gets it from another processor over the network. To the user process, however, the system looks very much like a traditional multiprocessor, with multiple processes are free to read and write the shared memory at will. All communication and synchronization is done via the memory, with no communication visible to the user process. The approach is not to share the entire address space, but only a selected portion of it, namely just those variables or data structures that needs to be used by more than one process. With respect to a distributed database system, and in this model, the shared variables represent the pages or tables within the shared memory segment. One does not think of each machine as having direct access to an ordinary memory but rather, to a collection of shared variables, giving a higher level of abstraction. This approach greatly reduces the amount of data that must be shared, but in most cases, considerable information about the shared data is available, such as their types, which helps optimize the implementation. Page-based distributed-shared memory takes a normal linear address space and allows the pages to migrate dynamically over the network on demand. Processes can access all of memory using normal read and write instructions and are not aware of when page faults or network transfers occur. Accesses to remote data are detected and protected by the memory management unit. In order to facilitate optimization, the shared variables or tables are replicated on multiple machines. Potentially, reads can be done locally without any network traffic, and writes are done using a multicopy update protocol. This protocol is widely used in distributed database system. The main purpose of this simulation is to discuss the issues in a Distributed Database System and how they can be overcome with the help of a Page Based Distributed Shared Memory System. In a Distributed Database System, multiple clients can read data from or write data to the database. The main issue in such a type of a system is achieving Consistency171. Consistency is defined as Any read to a location within the database returns the value stored by the most recent write operation to that location within the database m. Since multiple clients are trying to perform various operations on the same database, it is difficult to ensure that the requesting client gets the most recent copy of the database

    Optimizing virtual machine scheduling in NUMA multicore systems

    Full text link
    An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance difficult. Vir-tualization further complicates the scheduling problem. Due to abstract and inaccurate mappings from virtual hardware to machine hardware, program and system-level optimizations are often not effective within virtual machines. We find that the penalty to access the ā€œuncore ā€ memory subsystem is an effective metric to predict program perfor-mance in NUMA multicore systems. Based on this metric, we add NUMA awareness to the virtual machine scheduling. We propose a Bias Random vCPU Migration (BRM) algorithm that dynamically migrates vCPUs to minimize the system-wide uncore penalty. We have implemented the scheme in the Xen virtual machine monitor. Experiment results on a two-way Intel NUMA multicore system with various workloads show that BRM is able to improve application performance by up to 31.7 % compared with the default Xen credit scheduler. More-over, BRM achieves predictable performance with, on average, no more than 2 % runtime variations. 1
    • ā€¦
    corecore