Search CORE

832 research outputs found

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Author: Elango Venmugil
Fauzia Naznin
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Ravishankar Mahesh
Rountev Atanas
Sadayappan P.
Publication venue
Publication date: 01/12/2013
Field of study

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Practical Parallelization of Scientific Applications

Author: Aldinucci Marco
Cesare Valentina
Colonnelli Iacopo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Crossref

Institutional Research Information System University of Turin

Architectural and Complier Mechanisms for Accelerating Single Thread Applications on Mulitcore Processors.

Author: Zhong Hongtao
Publication venue
Publication date
Field of study

Multicore systems have become the dominant mainstream computing platform. One of the biggest challenges going forward is how to efficiently utilize the ever increasing computational power provided by multicore systems. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores. However, single-thread applications realize little to no gains from multicore systems. This work investigates architectural and compiler mechanisms to automatically accelerate single thread applications on multicore processors by efficiently exploiting three types of parallelism across multiple cores: instruction level parallelism (ILP), fine-grain thread level parallelism (TLP), and speculative loop level parallelism (LLP). A multicore architecture called Voltron is proposed to exploit different types of parallelism. Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, several in-order cores are coalesced to emulate a wide-issue VLIW processor. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. By executing fine-grain threads in parallel, Voltron provides coarse-grained out-of-order execution capability using in-order cores. Architectural mechanisms for speculative execution of loop iterations are also supported under the decoupled mode. Voltron can dynamically switch between two modes with low overhead to exploit the best form of available parallelism. This dissertation also investigates compiler techniques to exploit different types of parallelism on the proposed architecture. First, this work proposes compiler techniques to manage multiple instruction streams to collectively function as a single logical stream on a conventional VLIW to exploit ILP. Second, this work studies compiler algorithms to extract fine-grain threads. Third, this dissertation proposes a series of systematic compiler transformations and a general code generation framework to expose hidden speculative LLP hindered by register and memory dependences in the code. These transformations collectively remove inter-iteration dependences that are caused by subsets of isolatable instructions, are unwindable, or occur infrequently. Experimental results show that proposed mechanisms can achieve speedups of 1.33 and 1.14 on 4 core machines by exploiting ILP and TLP respectively. The proposed transformations increase the DOALL loop coverage in applications from 27% to 61%, resulting in a speedup of 1.84 on 4 core systems.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58419/1/hongtaoz_1.pd

Deep Blue Documents at the University of Michigan

Exploiting tightly-coupled cores

Author: Bates D
Bradbury A
Koltes A
Mullins R
Publication venue: Proceedings - 2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, IC-SAMOS 2013
Publication date: 01/01/2013
Field of study

This is the published manuscript. It was first published by Springer in the Journal of Signal Processing Systems here: http://link.springer.com/article/10.1007%2Fs11265-014-0944-6.The individual processors of a chip-multiprocessor traditionally have rigid boundaries. Inter-core communication is only possible via memory and control over a core’s resources is localised. Specialisation necessary to meet today’s challenging energy targets is typically provided through the provision of a range of processor types and accelerators. An alternative approach is to permit specialisation by tailoring the way a large number of homogeneous cores are used. The approach here is to relax processor boundaries, create a richer mix of intercore communication mechanisms and provide finer-grain control over, and access to, the resources of each core. We evaluate one such design, called Loki, that aims to support specialisation in software on a homogeneous many-core architecture. We focus on the design of a single 8-core tile, conceived as the building block for a larger many-core system. We explore the tile’s ability to support a range of parallelisation opportunities and detail the control and communication mechanisms needed to exploit each core’s resources in a flexible manner. Performance and a detailed breakdown of energy usage is provided for a range of benchmarks and configurations.This work was supported by EPSRC grant EP/G033110/1

Springer - Publisher Connector

Apollo (Cambridge)

Does dynamic and speculative parallelization enable advanced parallelizing and optimizing code transformations?

Author: Clauss Philippe
Jimborean Alexandra
Publication venue: HAL CCSD
Publication date: 23/01/2012
Field of study

International audienceThread-Level Speculation (TLS) is a dynamic and automatic parallelization strategy allowing to handle codes that cannot be parallelized at compile-time, because of insufficient information that can be extracted from the source code. However, the proposed TLS systems are strongly limited in the kind of parallelization they can apply on the original sequential code. Consequently, they often yield poor performance. In this paper, we give the main reasons of their limits and show that it is possible in some cases for a TLS system to handle more advanced parallelizing transformations. In particular, it is shown that codes characterized by phases where the memory behavior can be modeled by linear functions, can take advantage of a dynamic use of the polytope model

INRIA a CCSD electronic archive server

Encoding & Characterization of process models for Deep Predictive Process Monitoring.

Author: CHIORRINI ANDREA
Publication venue: country:Italia
Publication date: 16/03/2023
Field of study

La sempre crescente digitalizzazione di molti aspetti della vita, sta modificando l'esecuzione operativa di molte attività umane, producendo anche una grande quantità di informazione sotto forma di log di dati. Questi possono essere sfruttati per migliorare la qualità di queste esecuzioni. Un modo per sfruttare queste informazioni è usarle per predire come l'esecuzione di un'attività umana possa evolvere fino al suo completamento, così da supportare i manager nel determinare, per esempio, se intervenire per prevenire delle situazioni indesiderate o per meglio allocare le risorse a disposizione. Nella presente tesi, si propone un approccio che usa l'informazione relativa al parallelismo presente tra le attività per eseguire i task tipici del Predictive Process Monitoring. Questo viene fatto rappresentando le esecuzioni di processo con il corrispondente Instance Graph e processandole utilizzando delle graph convolutional neural networks. Inoltre, per definire gli ambiti in cui tale approccio funziona al meglio nel presente elaborato si illustra una nuova metrica ideata per misurare il parallelismo all'interno dei processi di business. Infine, è presentato un insieme di metriche che descrivono il contesto di esecuzione di una attività all'interno di un processo per rappresentare l'attività stessa. Questo è utilizzato sia per definire un meccanismo di "querying" per le attività all'interno dei processi sia per introdurre la nozione di "location" come un ulteriore obiettivo di predizione per le tecniche di Predictive Process Monitoring. Gli approcci proposti sono stati valutati utilizzando vari dataset reali e i risultati ottenuti sono promettenti.Ever-increasing digitalization of all aspects of life modifies the operative executions of most human tasks and produces a huge wealth of information, in the form of data logs, that could be leveraged to further improve the general quality of such executions. One way of leveraging such information is to predict how the execution of such tasks will unfold until their completion so as to be capable of supporting the managers in determining, for example, whether to intervene to prevent undesired process outcomes or how to best allocate resources. In the present thesis, it is proposed an approach that uses the information about the parallelism among activities for the Predictive Process Monitoring tasks, by representing process executions with their corresponding Instance Graph and processing them using deep graph convolutional neural networks. Also, to define the scope to best apply such an approach is devised a novel metric that manages to effectively measure the parallelism in a business process model. Lastly, the definition of a set of metrics that describe the execution context of an activity inside a process to represent the activity itself is presented. This is used both to define a querying mechanism for activities in processes and to introduce the notion of "location" as a further relevant prediction target for Predictive Process Monitoring techniques. The proposed techniques have been experimentally evaluated using several real-world datasets and the results are promising

IRIS UniversitÃ Politecnica delle Marche

Recommended from our members

HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing.

Author: Brooks David M.
Campanoni Simone
Holloway Glenn H.
Jones Timothy
Reddi Vijay Janapa
Wei Gu-Yeon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/02/2016
Field of study

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.Engineering and Applied Science

Harvard University - DASH