Search CORE

1,448 research outputs found

Thread partitioning and value prediction for exploiting speculative thread-level parallelism

Author: González Colás Antonio María
Marcuello Pedro
Tubella Murgadas Jordi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

Speculative thread-level parallelism has been recently proposed as a source of parallelism to improve the performance in applications where parallel threads are hard to find. However, the efficiency of this execution model strongly depends on the performance of the control and data speculation techniques. Several hardware-based schemes for partitioning the program into speculative threads are analyzed and evaluated. In general, we find that spawning threads associated to loop iterations is the most effective technique. We also show that value prediction is critical for the performance of all of the spawning policies. Thus, a new value predictor, the increment predictor, is proposed. This predictor is specially oriented for this kind of architecture and clearly outperforms the adapted versions of conventional value predictors such as the last value, the stride, and the context-based, especially for small-sized history tables.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

The "MIND" Scalable PIM Architecture

Author: Brodowicz Maciej
Sterling Thomas
Publication venue
Publication date: 01/01/2005
Field of study

MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

Caltech Authors

Analysis of network-on-chip topologies for cost-efficient chip multiprocessors

Author: Izu C.
Ortín-Obón M.
Suárez-Gracia D.
Villarroya-Gaudó M.
Viñals-Yúfera V.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Abstract not availableMarta Ortín-Obón, Darío Suárez-Gracia, María Villarroya-Gaudó, Cruz Izu, Víctor Viñals-Yúfer

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Adelaide Research & Scholarship

Repositorio Universidad de Zaragoza

POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

Author: Ayguadé Parra Eduard
Badia Sala Rosa Maria
Casas Guix Marc
Chronaki Kallia
Labarta Mancho Jesús José
Moreto Planas Miquel
Rico Alejandro
Valero Cortés Mateo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Shared memory with hidden latency on a family of mesh-like networks

Author: Harris Tim J.
Publication venue: The University of Edinburgh
Publication date: 01/01/1995
Field of study

Edinburgh Research Archive

Distributed-Memory Breadth-First Search on Massive Graphs

Author: Asanovic Krste
Beamer Scott
Buluc Aydin
Madduri Kamesh
Patterson David
Publication venue
Publication date: 01/01/2017
Field of study

This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

arXiv.org e-Print Archive

CiteSeerX

eScholarship - University of California

NCESPARC+: an implementation of a SPARC architecture with hardware support to multithreading for the multiplus multiprocessor

Author: Aude Júlio Salek
Barbosa M. A. S.
João Junior M.
Martins F. R. S.
Pinto S. B.
Young M.T.
Publication venue: Instituto Tércio Pacitti de Aplicações e Pesquisas Computacionais
Publication date: 31/12/1999
Field of study

NCESP ARC + is an implementation of the SP ARC v: 8 architecture with hardware support to a variable number of thread contexts, which is under development for use within the framework of the Multiplus distributed shared-memory multiprocessor. It is expected to provide an efficient and automatic mechanism to hide the latency of busy-waiting synchronization loops, cachecoherence protocol and remote memory access operations within the Multiplus multiprocessor. NCESPARC + performs context-switching in at most four processor cycles whenever there is an instruction cache miss, a data dependency in relation to the destination operand of a pending load instruction or a busy-waiting synchronization loop. It has a decoupled architecture which allows the main pipeline to process instructions from a given context while the Memory Interface Unit performs memory access operations related to that same context or to any other context. Results of simulation experiments show the impact of some architectural parameters on the NCESPARC + processor performance and demonstrate that the use of multiple thread contexts can e.ffectively produce a much better utilization of the processor when long latency operations are performed In addition, NCESPARC + processor performance with a single context is superior to that of a standard implementation of the SPARC architecture due to its decoupled architecture

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Pantheon

Bringing Real Processorsto Labs

Author: Gómez Requena Crispín
Gómez Requena María Engracia
Sahuquillo Borrás Julio
Publication venue: 'Wiley'
Publication date: 01/09/2015
Field of study

This is the accepted version of the following article: Gómez, C., Gómez, M. E. and Sahuquillo, J. (2015), Bringing real processors to labs. Comput Appl Eng Educ, 23: 724–732. , which has been published in final form at http://dx.doi.org/10.1002/cae.21645The architecture of current processors has experienced great changes in the last years, leading to sophisticated multithreaded multicore processors. The inherent complexity of such processors makes difficult to update processor teaching to include current commercial products, especially at lab sessions where simplistic simulators are usually used. However, instructors are forced to reduce this gap if they want to properly prepare students in this topic. Dealing with these complex concepts at labs does not only help reinforce theoretical concepts but also has a positive effect in the students motivation. This article presents amethodology designed for the study of current microprocessor mechanisms in a gradual way without overwhelming students. The methodology is based on the use of a detailed simulation framework, used both in the academia and in the industry, which accurately models features from current processors. Due to the huge simulator complexity, it is introduced through several learning phases. Qualitative and quantitative results demonstrate that students are able to develop skills in a detailed simulator in a reasonable time period and, at the same time they learn the details of complex architectural mechanisms of commercial microprocessors.Contract grant sponsor: Spanish Government; Contract grant number: TIN2012-38341-C04-01Gómez Requena, C.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2015). Bringing Real Processorsto Labs. Computer Applications in Engineering Education. 23(5):724-732. https://doi.org/10.1002/cae.21645S724732235D. Sanchez C. Kozyrakis ZSim: Fast and accurate microarchitectural simulation of thousand-core systems 2013 475 486U. Rafael J. Sahuquillo S. Petit P. Lopez Multi2Sim: A simulation framework to evaluate multicore-multithreaded processors 2007 62 68Aziz, S. M., Sicard, E., & Ben Dhia, S. (2010). Effective Teaching of the Physical Design of Integrated Circuits Using Educational Tools. IEEE Transactions on Education, 53(4), 517-531. doi:10.1109/te.2009.2031842Dexter, S. L., Anderson, R. E., & Becker, H. J. (1999). Teachers’ Views of Computers as Catalysts for Changes in Their Teaching Practice. Journal of Research on Computing in Education, 31(3), 221-239. doi:10.1080/08886504.1999.10782252Austin, T., Larson, E., & Ernst, D. (2002). SimpleScalar: an infrastructure for computer system modeling. Computer, 35(2), 59-67. doi:10.1109/2.982917T. E. Carlson W. Heirman L. Eeckhout Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation 2011 52http://www.multi2sim.orgS. Woo M. Ohara E. Torrie J. Singh A. Gupta The Splash-2 programs: Characterization and methodological considerations 1995 24 3

RiuNet

EM2: A Scalable Shared-Memory Multicore Architecture

Author: Devadas Srini
Khan Omer
Lis Mieszko
Publication venue
Publication date: 01/01/2010
Field of study

We introduce the Execution Migration Machine (EM2), a novel, scalable shared-memory architecture for large-scale multicores constrained by off-chip memory bandwidth. EM2 reduces cache miss rates, and consequently off-chip memory usage, by permitting only one copy of data to be stored anywhere in the system: when a thread wishes to access an address not locally cached on the core it is executing on, it migrates to the appropriate core and continues execution. Using detailed simulations of a range of 256-core configurations on the SPLASH-2 benchmark suite, we show that EM2 improves application completion times by 18% on the average while remaining competitive with traditional architectures in silicon area

CiteSeerX

DSpace@MIT