24 research outputs found

    Feasibility of optically interconnected parallel processors using wavelength division multiplexing

    Full text link

    Cache-coherent distributed shared memory: perspectives on its development and future challenges

    Full text link

    Lock Prediction to Reduce the Overhead of Synchronization Primitives

    Get PDF
    The advent of chip multi-processors has led to an increase in computational performance in recent years. Employing efficient parallel algorithms has become important to harness the full potential of multiple cores. One of the major productivity limitation in parallel programming arises due to use of Synchronization Primitives. The primitives are used to enforce mutual exclusion on critical section data. Most shared-memory multi-processor architectures provide hardware support for mutually exclusive access on shared data structures using lock and unlock operations. These operations are implemented in hardware as a set of instructions that atomically read and then write to a single memory location. Good synchronization techniques should try to reduce network bandwidth, have low access time in acquiring locks and be fair in granting requests. In a typical directory controller based locking scheme, each thread communicates with the directory controller for lock request and lock release. The overhead of this design includes communication with the directory controller for each step of lock acquisition, and this causes high latency transactions. Thus, a significant amount of time is spent in communication as compared to the actual operation. Previous works have focused on reducing the communication to home node through various techniques. One such technique of interest is the Implicit Queue on Lock Bit Technique (IQOLB). In this technique, the lock is forwarded directly to the requestor from the thread currently holding the lock without communication through the home node. Limitations of the method include the following: the forwarding operation can take place only after the current thread holding the lock has received information about the new lock requestor from the home node and also modification to cache coherence protocol to distinguish a regular memory read request and a synchronization request. Very little research has been performed in the area of lock prediction. We believe based on data analysis that lock communication is predictable and the prediction can improve performance significantly. This research focuses on predicting the sequence in which locks are acquired so that the thread currently holding the lock can preemptively invalidate the locked cache line and forward the same to subsequent requestors and hence reduce the time taken to acquire a lock. The predictor is adaptive: whenever a lock is biased towards a thread, it will remain in the cache of that particular thread, and invalidation will not take place. The benefits of the technique include reduction in the number of messages exchanged with the home node without any modification to the cache coherence protocol (does not distinguish a regular memory read request and synchronization request). The results of the evaluation of lock predictor on PARSEC benchmark suite shows an improvement in overall performance by an average of 9 % over the base case

    Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    Get PDF
    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model

    Mobile Home Node: Improving Directory Cache Coherence Performance in NoCs via Exploitation of Producer-Consumer Relationships

    Get PDF
    The implementation of multiple processors on a single chip has been made possible with advancements in process technology. The benefits of having multiple cores on a single chip bring with it a new set of constraints for maintaining fast and consistent memory accesses. Cache coherence protocols are needed to maintain the consistency of shared memory on individual caches. Current cache coherency protocols are either snoop based, which is not scalable but provides fast access for small number of cores, or directory based, which involves a directory that acts as the ordering point providing scalability with relatively slower access. Our focus is on improving the memory access time of the scalable directory protocol. We have observed that most memory requests follow a pattern where in one of the processors, which we will dub the Producer, repeatedly writes to a particular memory location. A subset of the remaining cores, which we will dub the Consumers, repeatedly read the data from that same memory location. In our implementation we utilize this relationship to provide direct cache to cache transfers and minimize the access time by avoiding the indirection through the directory. We move the directory temporarily to the Producer node so that the consumer can directly request the producer for the cache line. Our technique improves the memory access time by 13 percent and reduces network traffic by 30 percent over standard directory coherence protocol with very little area overhead

    A two-level directory architecture for highly scalable cc-NUMA multiprocessors

    Full text link
    corecore