3,706 research outputs found
Near-optimal loop tiling by means of cache miss equations and genetic algorithms
The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.Peer ReviewedPostprint (published version
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Nodal domains of the equilateral triangle billiard
We characterise the eigenfunctions of an equilateral triangle billiard in
terms of its nodal domains. The number of nodal domains has a quadratic form in
terms of the quantum numbers, with a non-trivial number-theoretic factor. The
patterns of the eigenfunctions follow a group-theoretic connection in a way
that makes them predictable as one goes from one state to another. Extensive
numerical investigations bring out the distribution functions of the mode
number and signed areas. The statistics of the boundary intersections is also
treated analytically. Finally, the distribution functions of the nodal loop
count and the nodal counting function are shown to contain information about
the classical periodic orbits using the semiclassical trace formula. We believe
that the results belong generically to non-separable systems, thus extending
the previous works which are concentrated on separable and chaotic systems.Comment: 26 pages, 13 figure
Refactoring intermediately executed code to reduce cache capacity misses
The growing memory wall requires that more attention is given to the data cache behavior of programs. In this paper, attention is given to the capacity misses i.e. the misses that occur because the cache size is smaller than the data footprint between the use and the reuse of the same data. The data footprint is measured with the reuse distance metric, by counting the distinct memory locations accessed between use and reuse. For reuse distances larger than the cache size, the associated code needs to be refactored in a way that reduces the reuse distance to below the cache size so that the capacity misses are eliminated. In a number of simple loops, the reuse distance can be calculated analytically. However, in most cases profiling is needed to pinpoint the areas where the program needs to be transformed for better data locality. This is achieved by the reuse distance visualizer, RDVIS, which shows the intermediately executed code for critical data reuses. In addition, another tool, SLO, annotates the source program with suggestions for locality ptimization. Both tools have been used to analyze and to refactor a number of SPEC2000 benchmark programs with very positive results
Domino tilings and the six-vertex model at its free fermion point
At the free-fermion point, the six-vertex model with domain wall boundary
conditions (DWBC) can be related to the Aztec diamond, a domino tiling problem.
We study the mapping on the level of complete statistics for general domains
and boundary conditions. This is obtained by associating to both models a set
of non-intersecting lines in the Lindstroem-Gessel-Viennot (LGV) scheme. One of
the consequence for DWBC is that the boundaries of the ordered phases are
described by the Airy process in the thermodynamic limit.Comment: 14 pages, 8 figure
Open boundary Quantum Knizhnik-Zamolodchikov equation and the weighted enumeration of Plane Partitions with symmetries
We propose new conjectures relating sum rules for the polynomial solution of
the qKZ equation with open (reflecting) boundaries as a function of the quantum
parameter and the -enumeration of Plane Partitions with specific
symmetries, with . We also find a conjectural relation \`a la
Razumov-Stroganov between the limit of the qKZ solution and refined
numbers of Totally Symmetric Self Complementary Plane Partitions.Comment: 27 pages, uses lanlmac, epsf and hyperbasics, minor revision
- …