2,592 research outputs found
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
Master of Science
thesisTo minimize resource consumption and maximize performance, computer architecture research has been investigating approaches that may compute inaccurate solutions. Such hardware inaccuracies may induce a wide variety of program behaviors which are not obs
Improving Transactional Memory Performance for Irregular Applications
Postprint de autor publicado posteriormente con este DOI:http://dx.doi.org/10.1016/j.procs.2015.05.398Transactional memory (TM) offers optimistic concurrency support in modern multicore archi-
tectures, helping the programmers to extract parallelism in irregular applications when data
dependence information is not available before runtime. In fact, recent research focus on ex-
ploiting thread-level parallelism using TM approaches. However, the proposed techniques are
of general use, valid for any type of application.
This work presents ReduxSTM, a software TM system specially designed to extract maxi-
mum parallelism from irregular applications. Commit management and conflict detection are
tailored to take advantage of both, sequential transaction ordering to assure correct results,
and privatization of reduction patterns, a very frequent memory access pattern in irregular
applications. Both techniques are used to avoid unnecessary transaction aborts.
A function in 300.twolf package from SPEC CPU2000 was taken as a motivating irregular
program. This code was parallelized using ReduxSTM and an ordered version of TinySTM,
a state-of-the-art TM system. Experimental evaluation shows that ReduxTM exploits more
parallelism from the sequential program and obtains better performance than the other system.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Algorithm Diversity for Resilient Systems
Diversity can significantly increase the resilience of systems, by reducing
the prevalence of shared vulnerabilities and making vulnerabilities harder to
exploit. Work on software diversity for security typically creates variants of
a program using low-level code transformations. This paper is the first to
study algorithm diversity for resilience. We first describe how a method based
on high-level invariants and systematic incrementalization can be used to
create algorithm variants. Executing multiple variants in parallel and
comparing their outputs provides greater resilience than executing one variant.
To prevent different parallel schedules from causing variants' behaviors to
diverge, we present a synchronized execution algorithm for DistAlgo, an
extension of Python for high-level, precise, executable specifications of
distributed algorithms. We propose static and dynamic metrics for measuring
diversity. An experimental evaluation of algorithm diversity combined with
implementation-level diversity for several sequential algorithms and
distributed algorithms shows the benefits of algorithm diversity
Runtime-assisted cache coherence deactivation in task parallel programs
With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability.
This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Unions Horizon 2020 research
and innovation programme (grant agreements 671697 and 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness
under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft
Strong Memory Consistency For Parallel Programming
Correctly synchronizing multithreaded programs is challenging, and errors can lead to program failures (e.g., atomicity violations). Existing memory consistency models rule out some possible failures, but are limited by depending on subtle programmer-defined locking code and by providing unintuitive semantics for incorrectly synchronized code. Stronger memory consistency models assist programmers by providing them with easier-to-understand semantics with regard to memory access interleavings in parallel code. This dissertation proposes a new strong memory consistency model based on ordering-free regions (OFRs), which are spans of dynamic instructions between consecutive ordering constructs (e.g. barriers). Atomicity over ordering-free
regions provides stronger atomicity than existing strong memory consistency models with competitive performance. Ordering-free regions also simplify programmer reasoning by limiting the potential for atomicity violations to fewer points in the program’s execution. This dissertation explores both software-only and hardware-supported systems that provide OFR serializability
SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering
URL to conference programIn the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip(NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application run time reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by- 13 mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power Architecture TM cores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.United States. Defense Advanced Research Projects Agency (DARPA UHPC grant at MIT (Angstrom))Center for Future Architectures ResearchMicroelectronics Advanced Research Corporation (MARCO)Semiconductor Research Corporatio
- …