10,249 research outputs found
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Jointly Optimal Routing and Caching for Arbitrary Network Topologies
We study a problem of fundamental importance to ICNs, namely, minimizing
routing costs by jointly optimizing caching and routing decisions over an
arbitrary network topology. We consider both source routing and hop-by-hop
routing settings. The respective offline problems are NP-hard. Nevertheless, we
show that there exist polynomial time approximation algorithms producing
solutions within a constant approximation from the optimal. We also produce
distributed, adaptive algorithms with the same approximation guarantees. We
simulate our adaptive algorithms over a broad array of different topologies.
Our algorithms reduce routing costs by several orders of magnitude compared to
prior art, including algorithms optimizing caching under fixed routing.Comment: This is the extended version of the paper "Jointly Optimal Routing
and Caching for Arbitrary Network Topologies", appearing in the 4th ACM
Conference on Information-Centric Networking (ICN 2017), Berlin, Sep. 26-28,
201
GPU acceleration of brain image proccessing
Durante los últimos años se ha venido demostrando el alto poder computacional
que ofrecen las GPUs a la hora de resolver determinados problemas.
Al mismo tiempo, existen campos en los que no es posible beneficiarse completamente
de las mejoras conseguidas por los investigadores, debido principalmente
a que los tiempos de ejecución de las aplicaciones llegan a ser extremadamente
largos. Este es por ejemplo el caso del registro de imágenes en medicina.
A pesar de que se han conseguido aceleraciones sobre el registro de imágenes,
su uso en la práctica clÃnica es aún limitado. Entre otras cosas, esto se debe
al rendimiento conseguido.
Por lo tanto se plantea como objetivo de este proyecto, conseguir mejorar los
tiempos de ejecución de una aplicación dedicada al resgitro de imágenes en medicina,
con el fin de ayudar a aliviar este problema
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
- …