119,073 research outputs found
Improvements in Hardware Transactional Memory for GPU Architectures
In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level.
After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Synchronizing multivariate financial time series
Prices or returns of financial assets are most often collected in local times of the trading markets. The need to synchronize multivariate time series of financial prices or returns is motivated by the fact that information continues to flow for closed markets while others are still open. We propose here a synchronization technique which takes this into account. Besides the nice interpretation of synchronization, the method potentially increases the predictive performance of any reasonable model and is more appropriate for the calculation of portfolio risk measures such as for example the expected shortfall. We found empirically that this was the case for the CCC-GARCH(1,1) model for a 7-dimensional time series of daily exchange rate returns. Since multivariate analysis is generally important for analyzing time-changing portfolios and for better portfolio predictions (even when portfolio weights are time-constant), synchronization is a valuable technique for a variety of problems with multivariate financial data
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
Irregular computations on unstructured data are an important class of
problems for parallel programming. Graph coloring is often an important
preprocessing step, e.g. as a way to perform dependency analysis for safe
parallel execution. The total run time of a coloring algorithm adds to the
overall parallel overhead of the application whereas the number of colors used
determines the amount of exposed parallelism. A fast and scalable coloring
algorithm using as few colors as possible is vital for the overall parallel
performance and scalability of many irregular applications that depend upon
runtime dependency analysis.
Catalyurek et al. have proposed a graph coloring algorithm which relies on
speculative, local assignment of colors. In this paper we present an improved
version which runs even more optimistically with less thread synchronization
and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We
show that the new technique scales better on multi-core and many-core systems
and performs up to 1.5x faster than its predecessor on graphs with high-degree
vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201
Static local coordination avoidance for distributed objects
In high-throughput, distributed systems, such as large-scale banking infrastructure, synchronization between actors becomes a bottle-neck in high-contention scenarios. This results in delays for users, and reduces opportunities for scaling such systems. This paper proposes Static Local Coordination Avoidance, which analyzes application invariants at compile time to detect whether messages are independent, so that synchronization at run time is avoided, and parallelism is increased. Analysis shows that in industry scenarios up to 60% of operations are independent. Initial performance evaluation shows that, in comparison to a standard 2-phase commit baseline, throughput is increased, and latency is reduced. As a result, scalability bottlenecks in high-contention scenarios in distributed actor systems are reduced for independent messages
STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization
Synchronous local stochastic gradient descent (local SGD) suffers from some
workers being idle and random delays due to slow and straggling workers, as it
waits for the workers to complete the same amount of local updates. In this
paper, to mitigate stragglers and improve communication efficiency, a novel
local SGD strategy, named STSyn, is developed. The key point is to wait for the
fastest workers, while keeping all the workers computing continually at
each synchronization round, and making full use of any effective (completed)
local update of each worker regardless of stragglers. An analysis of the
average wall-clock time, average number of local updates and average number of
uploading workers per round is provided to gauge the performance of STSyn. The
convergence of STSyn is also rigorously established even when the objective
function is nonconvex. Experimental results show the superiority of the
proposed STSyn against state-of-the-art schemes through utilization of the
straggler-tolerant technique and additional effective local updates at each
worker, and the influence of system parameters is studied. By waiting for
faster workers and allowing heterogeneous synchronization with different
numbers of local updates across workers, STSyn provides substantial
improvements both in time and communication efficiency.Comment: 12 pages, 10 figures, submitted for transaction publicatio
- …