1,331 research outputs found
Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms
The layout of multi-dimensional data can have a significant impact on the
efficacy of hardware caches and, by extension, the performance of applications.
Common multi-dimensional layouts include the canonical row-major and
column-major layouts as well as the Morton curve layout. In this paper, we
describe how the Morton layout can be generalized to a very large family of
multi-dimensional data layouts with widely varying performance characteristics.
We posit that this design space can be efficiently explored using a
combinatorial evolutionary methodology based on genetic algorithms. To this
end, we propose a chromosomal representation for such layouts as well as a
methodology for estimating the fitness of array layouts using cache simulation.
We show that our fitness function correlates to kernel running time in real
hardware, and that our evolutionary strategy allows us to find candidates with
favorable simulated cache properties in four out of the eight real-world
applications under consideration in a small number of generations. Finally, we
demonstrate that the array layouts found using our evolutionary method perform
well not only in simulated environments but that they can effect significant
performance gains -- up to a factor ten in extreme cases -- in real hardware
Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays
The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware
Format Abstraction for Sparse Tensor Algebra Compilers
This paper shows how to build a sparse tensor algebra compiler that is
agnostic to tensor formats (data layouts). We develop an interface that
describes formats in terms of their capabilities and properties, and show how
to build a modular code generator where new formats can be added as plugins. We
then describe six implementations of the interface that compose to form the
dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants
thereof. With these implementations at hand, our code generator can generate
code to compute any tensor algebra expression on any combination of the
aforementioned formats.
To demonstrate our technique, we have implemented it in the taco tensor
algebra compiler. Our modular code generator design makes it simple to add
support for new tensor formats, and the performance of the generated code is
competitive with hand-optimized implementations. Furthermore, by extending taco
to support a wider range of formats specialized for different application and
data characteristics, we can improve end-user application performance. For
example, if input data is provided in the COO format, our technique allows
computing a single matrix-vector multiplication directly with the data in COO,
which is up to 3.6 faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201
The Tensor Algebra Compiler
Tensor and linear algebra is pervasive in data analytics and the physical sciences. Often the tensors, matrices or even vectors are sparse. Computing expressions involving a mix of sparse and dense tensors, matrices and vectors requires writing kernels for every operation and combination of formats of interest. The number of possibilities is infinite, which makes it impossible to write library code for all. This problem cries out for a compiler approach. This paper presents a new technique that compiles compound tensor algebra expressions combined with descriptions of tensor formats into efficient loops. The technique is evaluated in a prototype compiler called taco, demonstrating competitive performance to best-in-class hand-written codes for tensor and matrix operations
A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures
Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures
Çeşitli sıradışı özelliklere sahip üç boyutlu metamalzemelerin hesaplamalı benzetimleri ve gerçeklenmeleri.
In this study, computational analysis and realization of three-dimensional metamaterial structures that induce negative and zero permittivity and/or permeability values in their host environment, as well as plasmonic nanoparticles that are used to design metamaterials at optical frequencies are presented. All these electromagnetic problems are challenging since effective material properties become negative/zero, while numerical solvers are commonly developed for ordinary positive parameters. In real life, three-dimensional metamaterial structures, involving split-ring resonators (SRR), thin wires, and similar subwavelength elements, are designed to exhibit single negativity (imaginary refractive index) and double negativity (negative refractive index) behaviors. However, metamaterial elements have small details with respect to wavelength and they operate when they resonate. Then, their numerical models lead to large matrix equations that are also ill-conditioned, making their solutions extremely difficult, if not impossible. If performed accurately, homogenization simplifies the analysis of metamaterials, while new challenges arise due to extreme parameters. For example, a combination of zero-index (ZI) and near-zero-index (NZI) materials with ordinary media (metals, free space, etc.) results in a high-contrast problem, and numerical instabilities occur particularly due to huge values of wavelength. Similar difficulties arise when considering the plasmonic effects of metals at optical frequencies since they must be modeled as penetrable bodies with negative real permittivity, leading to imaginary index values. Different surface-integral-equation (SIE) formulations and broadband multilevel fast multipole algorithm (MLFMA) implementations are extensively tested for accurate and efficient numerical solutions of ZI, NZI, imaginary-index, and negative-index materials. In addition to their computational simulations, metamaterial designs are fabricated with a low-cost inkjet-printing setup, which is based on using conventional printers that are modified and loaded with silver-based inks. Measurements demonstrate the feasibility of fabricating very low-cost three-dimensional metamaterials using simple inkjet printing.Thesis (M.S.) -- Graduate School of Natural and Applied Sciences. Electrical and Electronics Engineering
Automatic Generation of Efficient Sparse Tensor Format Conversion Routines
This paper shows how to generate code that efficiently converts sparse
tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL,
and many others. We decompose sparse tensor conversion into three logical
phases: coordinate remapping, analysis, and assembly. We then develop a
language that precisely describes how different formats group together and
order a tensor's nonzeros in memory. This lets a compiler emit code that
performs complex remappings of nonzeros when converting between formats. We
also develop a query language that can extract statistics about sparse tensors,
and we show how to emit efficient analysis code that computes such queries.
Finally, we define an abstract interface that captures how data structures for
storing a tensor can be efficiently assembled given specific statistics about
the tensor. Disparate formats can implement this common interface, thus letting
a compiler emit optimized sparse tensor conversion code for arbitrary
combinations of many formats without hard-coding for any specific combination.
Our evaluation shows that the technique generates sparse tensor conversion
routines with performance between 1.00 and 2.01 that of hand-optimized
versions in SPARSKIT and Intel MKL, two popular sparse linear algebra
libraries. And by emitting code that avoids materializing temporaries, which
both libraries need for many combinations of source and target formats, our
technique outperforms those libraries by 1.78 to 4.01 for CSC/COO to
DIA/ELL conversion.Comment: Presented at PLDI 202
- …