10 research outputs found

    Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

    Get PDF
    Recent proposals for Chip Multiprocessors (CMPs) advocate speculative, or implicit, threading in which the hardware employs prediction to peel off instruction sequences (i.e., implicit threads) from the sequential execution stream and speculatively executes them in parallel on multiple processor cores. These proposals augment a conventional multiprocessor, which employs explicit threading, with the ability to handle implicit threads. Current proposals focus on only implicitly-threaded code sections. This paper identifies, for the first time, the issues in combining explicit and implicit threading. We present the Multiplex architecture to combine the two threading models. Multiplex exploits the similarities between implicit and explicit threading, and provides a unified support for the two threading models without additional hardware. Multiplex groups a subset of protocol states in an implicitly-threaded CMP to provide a write-invalidate protocol for explicit threads. Using a fully-integrated compiler inf rastructure for automatic generation of Multiplex code, this paper presents a detailed performance analysis for entire benchmarks, instead of just implicitly- threaded sections, as done in previous papers. We show that neither threading models alone performs consistently better than the other across the benchmarks. A CMP with four dual-issue CPUs achieves a speedup of 1.48 and 2.17 over one dual-issue CPU, using implicit-only and explicit-only threading, respectively. Multiplex matches or outperforms the better of the two threading models for every benchmark, and a four-CPU Multiplex achieves a speedup of 2.63. Our detailed analysis indicates that the dominant overheads in an implicitly-threaded CMP are speculation state overflow due to limited L1 cache capacity, and load imbalance and data dependences in fine-grain threads

    Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors

    No full text
    Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have attempted this technique in the context of scalable shared-memory systems. In this paper, we present and evaluate a new hardware scheme for scalable speculative parallelization. This design needs relatively simple hardware and is efficiently integrated into a cache-coherent NUMA system. We have designed the scheme in a hierarchical manner that largely abstracts away the internals of the node. We effectively utilize a speculative CMP as the building block for our scheme. Simulations show that the architecture proposed delivers good speedups at a modest hardware cost. For a set of important nonanalyzable scientific loops, we report average speedups of 4.2 for 16 processors. We show that support for per-word speculative state is required by our applications, or else the performance suffers greatly.

    A pattern language for parallelizing irregular algorithms

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted. This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature. This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms. Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain. We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context

    Cache-based Cross-iteration Coherence For Speculative Parallelization

    No full text
    Maximal utilization of cores in multicore architectures is key to realize the potential performance available from higher density devices. In order to achieve scalable performance, parallelization techniques rely on carefully tunning speculative architecture support, run-time environment and software-based transformations. Hardware and software mechanisms have already been proposed to address this problem. They either require deep (and risky) changes on the existing hardware and cache coherence protocols, or exhibit poor performance scalability for a range of applications. The addition of cache tags as an enabler for data versioning, recently announced by the industry (i.e. IBM BlueGene/Q), could allow a better exploitation of parallelism at the microarchitecture level. In this paper, we present an execution model that supports both DOPIPE-based speculation and traditional speculative parallelization techniques. It is based on a simple cache tagging approach for data versioning, which integrates smoothly with typical cache coherence protocols, not requiring any changes to them. Experimental results, using SPEC and PARSEC benchmarks, reveal substantial speedups in a 24-core simulated CMP, while demonstrate improved scalability when compared to a software-only approach. © 2013 IEEE.216225IEEE Computer Society's Technical',ACM,Committee on Parallel Processing (TCPP),et al.,Shell IndiaAllen, R., Kennedy, K., (2002) Optimizing Compilers for Modern Architectures: A Dependence-based Approach, , Morgan Kaufmann Publishers Inc(2010) A.M.D. AMD64 architecture programmers manual volume 2:System programming, , A.M.DBienia, C., Kumar, S., Singh, J.P., Li, K., The parsec benchmark suite: Characterization and architectural implications (2008) PACT 08: Proceedings of the 1 7th International Conference on Parallel Architectures and Compilation Techniques, pp. 72-81Bridges, M., Vachharajani, N., Zhang, Y., Jablin, T., August, D., Revisiting the sequential programming model for multi-core (2007) MICRO '07: Proceedings of the 40th Annual IEEEIACM International Symposium on Microarchitecture, pp. 69-84Bridges, M.J., (2008) The Velocity Compiler: Extracting Efficient Multicore Execution from Legacy Sequential Codes, , PhD thesis, Department of CS, Princeton University, Princeton, New Jersey, United StatesCintra, M., Martinez, J.F., Torreuas, J., Architectural support for scalable speculative parallelization in shared-memory mUltiprocessors (2000) ISCA '00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 13-24Dagum, L., Menon, R., Open MP: An industry-standard api for shared-memory programming (1998) IEEE Computational Science and Engineering, 5 (1), pp. 46-55Gopal, S., Vijaykumar, T.N., Smith, J.E., Sohi, G.S., Speculative versioning cache (1998) HPCA '98: Proceedings of the 4th International Symposium on High-Performance Computer Architecture, pp. 195-205Lattner, C., Adve, V., A compilation framework for lifelong program analysis & transformation (2004) CGO 04: Proceedings of the International Symposium on Code Generation and Optimization, p. 75Moudgal, A., Kuttanna, B., (2001) Apparatus and Method to Prevent Overwriting of Modified Cache Entries Prior to Write Back, , US Patent 6286082Ottoni, G., Rangan, R., Stoler, A., August, D., Automatic thread extraction with decoupled software pipelining (2005) MICRO '05: Proceedings of the 38th Annual IEEEIACM International Symposium on Microarchitecture, pp. 105-118http://d1.dropbox.comlu/5351143/proofs.pdfRaman, A., Kim, H., Mason, T.R., Jablin, T.B., August, D.I., Speculative parallelization using software multi-threaded transactions (2010) Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 65-76Rauchwerger, L., Padua, D.A., The LRPD test: Speculative runtime paraUelization of loops with privatization and reduction parallelization (1999) IEEE Transactions on Parallel and Distributed Systems, 10 (2), pp. 160-180Reinders, J., Intel threading building blocks (2007) O ' ReiUyRenau, J., Strauss, K., Ceze, L., Liu, W., Sarangi, S., Tuck, J., Torrellas, J., Energy-efficient thread-level speculation (2006) IEEE Micro, 26 (1), pp. 80-91Rundberg, P., Stenstrom, P., An all-software thread-level data dependence speculation system for mUltiprocessors (2001) Journal of InstructionLevel Parallelism, 3http://www.spec.org, Standard Performance Evaluation CorporationSteffan, J.G., Colohan, C.B., Zhai, A., Mowry, T.C., A scalable approach to thread-level speculation (2000) ACM SIGARCH Computer Architecture News, 28 (2), pp. 1-12http://iacoma.cs.uiuc.edu/paulsack/sescdocThies, W., Chandrasekhar, V., Amarasinghe, S., A practical approach to exploiting coarse-grained pipeline parallelism in c programs (2007) MICRO '07: Proceedings of the 40th Annual IEEEIACM International Symposium on Microarchitecture, pp. 356-369Vachharaj Ani, N., (2008) Intelligent Speculation for Pipelined Multithreading, , PhD thesis, Department of Computer Science, Princeton University, Princeton, New Jersey, United State
    corecore