832 research outputs found
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
Architectural and Complier Mechanisms for Accelerating Single Thread Applications on Mulitcore Processors.
Multicore systems have become the dominant mainstream computing platform. One of the biggest challenges going forward is how to efficiently utilize the ever increasing computational power provided by multicore systems. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores. However, single-thread applications realize little to no gains from multicore systems.
This work investigates architectural and compiler mechanisms to automatically accelerate single thread applications on multicore processors by efficiently exploiting three types of parallelism across multiple cores: instruction level parallelism (ILP), fine-grain thread level parallelism (TLP), and speculative loop level parallelism (LLP).
A multicore architecture called Voltron is proposed to exploit different types of parallelism. Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, several in-order cores are coalesced to emulate a wide-issue VLIW processor. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. By executing fine-grain threads in parallel, Voltron provides coarse-grained out-of-order execution capability using in-order cores. Architectural mechanisms for speculative execution of loop iterations are also supported under the decoupled mode. Voltron can dynamically switch between two modes with low overhead to exploit the best form of available parallelism.
This dissertation also investigates compiler techniques to exploit different types of parallelism on the proposed architecture. First, this work proposes compiler techniques to manage multiple instruction streams to collectively function as a single logical stream on a conventional VLIW to exploit ILP. Second, this work studies compiler algorithms to extract fine-grain threads. Third, this dissertation proposes a series of systematic compiler transformations and a general code generation framework to expose hidden speculative LLP hindered by register and memory dependences in the code. These transformations collectively remove inter-iteration dependences that are caused by subsets of isolatable instructions, are unwindable, or occur infrequently.
Experimental results show that proposed mechanisms can achieve speedups of 1.33 and 1.14 on 4 core machines by exploiting ILP and TLP respectively. The proposed transformations increase the DOALL loop coverage in applications from 27% to 61%, resulting in a speedup of 1.84 on 4 core systems.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58419/1/hongtaoz_1.pd
Exploiting tightly-coupled cores
This is the published manuscript. It was first published by Springer in the Journal of Signal Processing Systems here: http://link.springer.com/article/10.1007%2Fs11265-014-0944-6.The individual processors of a chip-multiprocessor
traditionally have rigid boundaries. Inter-core communication is
only possible via memory and control over a core’s resources is
localised. Specialisation necessary to meet today’s challenging
energy targets is typically provided through the provision of
a range of processor types and accelerators. An alternative
approach is to permit specialisation by tailoring the way a large
number of homogeneous cores are used. The approach here
is to relax processor boundaries, create a richer mix of intercore
communication mechanisms and provide finer-grain control
over, and access to, the resources of each core. We evaluate one
such design, called Loki, that aims to support specialisation in
software on a homogeneous many-core architecture. We focus
on the design of a single 8-core tile, conceived as the building
block for a larger many-core system. We explore the tile’s ability
to support a range of parallelisation opportunities and detail
the control and communication mechanisms needed to exploit
each core’s resources in a flexible manner. Performance and a
detailed breakdown of energy usage is provided for a range of
benchmarks and configurations.This work was supported by EPSRC grant EP/G033110/1
Does dynamic and speculative parallelization enable advanced parallelizing and optimizing code transformations?
International audienceThread-Level Speculation (TLS) is a dynamic and automatic parallelization strategy allowing to handle codes that cannot be parallelized at compile-time, because of insufficient information that can be extracted from the source code. However, the proposed TLS systems are strongly limited in the kind of parallelization they can apply on the original sequential code. Consequently, they often yield poor performance. In this paper, we give the main reasons of their limits and show that it is possible in some cases for a TLS system to handle more advanced parallelizing transformations. In particular, it is shown that codes characterized by phases where the memory behavior can be modeled by linear functions, can take advantage of a dynamic use of the polytope model
Encoding & Characterization of process models for Deep Predictive Process Monitoring.
La sempre crescente digitalizzazione di molti aspetti della vita, sta modificando l'esecuzione operativa di molte attività umane, producendo anche una grande quantità di informazione sotto forma di log di dati. Questi possono essere sfruttati per migliorare la qualità di queste esecuzioni. Un modo per sfruttare queste informazioni è usarle per predire come l'esecuzione di un'attività umana possa evolvere fino al suo completamento, così da supportare i manager nel determinare, per esempio, se intervenire per prevenire delle situazioni indesiderate o per meglio allocare le risorse a disposizione. Nella presente tesi, si propone un approccio che usa l'informazione relativa al parallelismo presente tra le attività per eseguire i task tipici del Predictive Process Monitoring. Questo viene fatto rappresentando le esecuzioni di processo con il corrispondente Instance Graph e processandole utilizzando delle graph convolutional neural networks. Inoltre, per definire gli ambiti in cui tale approccio funziona al meglio nel presente elaborato si illustra una nuova metrica ideata per misurare il parallelismo all'interno dei processi di business. Infine, è presentato un insieme di metriche che descrivono il contesto di esecuzione di una attività all'interno di un processo per rappresentare l'attività stessa. Questo è utilizzato sia per definire un meccanismo di "querying" per le attività all'interno dei processi sia per introdurre la nozione di "location" come un ulteriore obiettivo di predizione per le tecniche di Predictive Process Monitoring. Gli approcci proposti sono stati valutati utilizzando vari dataset reali e i risultati ottenuti sono promettenti.Ever-increasing digitalization of all aspects of life modifies the operative executions of most human tasks and produces a huge wealth of information, in the form of data logs, that could be leveraged to further improve the general quality of such executions. One way of leveraging such information is to predict how the execution of such tasks will unfold until their completion so as to be capable of supporting the managers in determining, for example, whether to intervene to prevent undesired process outcomes or how to best allocate resources. In the present thesis, it is proposed an approach that uses the information about the parallelism among activities for the Predictive Process Monitoring tasks, by representing process executions with their corresponding Instance Graph and processing them using deep graph convolutional neural networks. Also, to define the scope to best apply such an approach is devised a novel metric that manages to effectively measure the parallelism in a business process model. Lastly, the definition of a set of metrics that describe the execution context of an activity inside a process to represent the activity itself is presented. This is used both to define a querying mechanism for activities in processes and to introduce the notion of "location" as a further relevant prediction target for Predictive Process Monitoring techniques. The proposed techniques have been experimentally evaluated using several real-world datasets and the results are promising
Recommended from our members
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing.
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.Engineering and Applied Science
- …