579 research outputs found
Parallelization Strategies for Density Matrix Renormalization Group Algorithms on Shared-Memory Systems
Shared-memory parallelization (SMP) strategies for density matrix
renormalization group (DMRG) algorithms enable the treatment of complex systems
in solid state physics. We present two different approaches by which
parallelization of the standard DMRG algorithm can be accomplished in an
efficient way. The methods are illustrated with DMRG calculations of the
two-dimensional Hubbard model and the one-dimensional Holstein-Hubbard model on
contemporary SMP architectures. The parallelized code shows good scalability up
to at least eight processors and allows us to solve problems which exceed the
capability of sequential DMRG calculations.Comment: 18 pages, 9 figure
Performance Comparison of Various STM Concurrency Control Protocols Using Synchrobench
Writing concurrent programs for shared memory multiprocessor systems is a nightmare. This hinders users to exploit the full potential of multiprocessors. STM (Software Transactional Memory) is a promising concurrent programming paradigm which addresses woes of programming for multiprocessor systems.
In this paper, we implement BTO (Basic Timestamp Ordering), SGT (Serialization Graph Testing) and MVTO(Multi-Version Time-Stamp Ordering) concurrency control protocols and build an STM(Software Transactional Memory) library to evaluate the performance of these protocols. The deferred write approach is followed to implement the STM. A SET data structure is implemented using the transactions of our STM library. And this transactional SET is used as a test application to evaluate the STM. The performance of the protocols is rigorously compared against the linked-list module of the Synchrobench benchmark. Linked list module implements SET data structure using lazy-list, lock-free list, lock-coupling list and ESTM (Elastic Software Transactional Memory).
Our analysis shows that for a number of threads greater than 60 and update rate 70%, BTO takes (17% to 29%) and (6% to 24%) less CPU time per thread when compared against lazy-list and lock-coupling list respectively. MVTO takes (13% to 24%) and (3% to 24%) less CPU time per thread when compared against lazy-list and lock-coupling list respectively. BTO and MVTO have similar per thread CPU time. BTO and MVTO outperform SGT by 9% to 36%
Persistent Memory Programming Abstractions in Context of Concurrent Applications
The advent of non-volatile memory (NVM) technologies like PCM, STT,
memristors and Fe-RAM is believed to enhance the system performance by getting
rid of the traditional memory hierarchy by reducing the gap between memory and
storage. This memory technology is considered to have the performance like that
of DRAM and persistence like that of disks. Thus, it would also provide
significant performance benefits for big data applications by allowing
in-memory processing of large data with the lowest latency to persistence.
Leveraging the performance benefits of this memory-centric computing technology
through traditional memory programming is not trivial and the challenges
aggravate for parallel/concurrent applications. To this end, several
programming abstractions have been proposed like NVthreads, Mnemosyne and
intel's NVML. However, deciding upon a programming abstraction which is easier
to program and at the same time ensures the consistency and balances various
software and architectural trade-offs is openly debatable and active area of
research for NVM community.
We study the NVthreads, Mnemosyne and NVML libraries by building a concurrent
and persistent set and open addressed hash-table data structure application. In
this process, we explore and report various tradeoffs and hidden costs involved
in building concurrent applications for persistence in terms of achieving
efficiency, consistency and ease of programming with these NVM programming
abstractions. Eventually, we evaluate the performance of the set and hash-table
data structure applications. We observe that NVML is easiest to program with
but is least efficient and Mnemosyne is most performance friendly but involves
significant programming efforts to build concurrent and persistent
applications.Comment: Accepted in HiPC SRS 201
Using Lock Servers to Scale Real-Time Locking Protocols: Chasing Ever-Increasing Core Counts
During the past decade, parallelism-related issues have been at the forefront of real-time systems research due to the advent of multicore technologies. In the coming years, such issues will loom ever larger due to increasing core counts. Having more cores means a greater potential exists for platform capacity loss when the available parallelism cannot be fully exploited. In this paper, such capacity loss is considered in the context of real-time locking protocols. In this context, lock nesting becomes a key concern as it can result in transitive blocking chains that force tasks to execute sequentially unnecessarily. Such chains can be quite long on a larger machine. Contention-sensitive real-time locking protocols have been proposed as a means of "breaking" transitive blocking chains, but such protocols tend to have high overhead due to more complicated lock/unlock logic. To ease such overhead, the usage of lock servers is considered herein. In particular, four specific lock-server paradigms are proposed and many nuances concerning their deployment are explored. Experiments are presented that show that, by executing cache hot, lock servers can enable reductions in lock/unlock overhead of up to 86%. Such reductions make contention-sensitive protocols a viable approach in practice
Sound Static Deadlock Analysis for C/Pthreads (Extended Version)
We present a static deadlock analysis approach for C/pthreads. The design of
our method has been guided by the requirement to analyse real-world code. Our
approach is sound (i.e., misses no deadlocks) for programs that have defined
behaviour according to the C standard, and precise enough to prove
deadlock-freedom for a large number of programs. The method consists of a
pipeline of several analyses that build on a new context- and thread-sensitive
abstract interpretation framework. We further present a lightweight dependency
analysis to identify statements relevant to deadlock analysis and thus speed up
the overall analysis. In our experimental evaluation, we succeeded to prove
deadlock-freedom for 262 programs from the Debian GNU/Linux distribution with
in total 2.6 MLOC in less than 11 hours
Supporting Nested Resources in MrsP
The original MrsP proposal presented a new multiprocessor resource sharing protocol based on the properties and behaviour of the Priority Ceiling Protocol, supported by a novel helping mechanism. While this approach proved to be as simple and elegant as the single processor protocol, the implications with regard to nested resources was identified as requiring further clarification. In this work we present a complete approach to nested resources behaviour and analysis for the MrsP protocol
Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems
Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming.
The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends.
The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines.
The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past
Shared Hash Tables in Parallel Model Checking
AbstractIn light of recent shift towards shared-memory systems in parallel explicit model checking, we explore relative advantages and disadvantages of shared versus private hash tables. Since usage of shared state storage allows for techniques unavailable in distributed memory, these are evaluated, both theoretically and practically, in a prototype implementation. Experimental data is presented to assess practical utility of those techniques, compared to static partitioning of state space, more traditional in distributed memory algorithms
- …