1,331 research outputs found

    Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

    Full text link
    The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware

    Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays

    Get PDF
    The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware

    Format Abstraction for Sparse Tensor Algebra Compilers

    Full text link
    This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats. To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6×\times faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201

    The Tensor Algebra Compiler

    Get PDF
    Tensor and linear algebra is pervasive in data analytics and the physical sciences. Often the tensors, matrices or even vectors are sparse. Computing expressions involving a mix of sparse and dense tensors, matrices and vectors requires writing kernels for every operation and combination of formats of interest. The number of possibilities is infinite, which makes it impossible to write library code for all. This problem cries out for a compiler approach. This paper presents a new technique that compiles compound tensor algebra expressions combined with descriptions of tensor formats into efficient loops. The technique is evaluated in a prototype compiler called taco, demonstrating competitive performance to best-in-class hand-written codes for tensor and matrix operations

    A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

    Get PDF
    Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures

    Çeşitli sıradışı özelliklere sahip üç boyutlu metamalzemelerin hesaplamalı benzetimleri ve gerçeklenmeleri.

    Get PDF
    In this study, computational analysis and realization of three-dimensional metamaterial structures that induce negative and zero permittivity and/or permeability values in their host environment, as well as plasmonic nanoparticles that are used to design metamaterials at optical frequencies are presented. All these electromagnetic problems are challenging since effective material properties become negative/zero, while numerical solvers are commonly developed for ordinary positive parameters. In real life, three-dimensional metamaterial structures, involving split-ring resonators (SRR), thin wires, and similar subwavelength elements, are designed to exhibit single negativity (imaginary refractive index) and double negativity (negative refractive index) behaviors. However, metamaterial elements have small details with respect to wavelength and they operate when they resonate. Then, their numerical models lead to large matrix equations that are also ill-conditioned, making their solutions extremely difficult, if not impossible. If performed accurately, homogenization simplifies the analysis of metamaterials, while new challenges arise due to extreme parameters. For example, a combination of zero-index (ZI) and near-zero-index (NZI) materials with ordinary media (metals, free space, etc.) results in a high-contrast problem, and numerical instabilities occur particularly due to huge values of wavelength. Similar difficulties arise when considering the plasmonic effects of metals at optical frequencies since they must be modeled as penetrable bodies with negative real permittivity, leading to imaginary index values. Different surface-integral-equation (SIE) formulations and broadband multilevel fast multipole algorithm (MLFMA) implementations are extensively tested for accurate and efficient numerical solutions of ZI, NZI, imaginary-index, and negative-index materials. In addition to their computational simulations, metamaterial designs are fabricated with a low-cost inkjet-printing setup, which is based on using conventional printers that are modified and loaded with silver-based inks. Measurements demonstrate the feasibility of fabricating very low-cost three-dimensional metamaterials using simple inkjet printing.Thesis (M.S.) -- Graduate School of Natural and Applied Sciences. Electrical and Electronics Engineering

    Automatic Generation of Efficient Sparse Tensor Format Conversion Routines

    Full text link
    This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor's nonzeros in memory. This lets a compiler emit code that performs complex remappings of nonzeros when converting between formats. We also develop a query language that can extract statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of many formats without hard-coding for any specific combination. Our evaluation shows that the technique generates sparse tensor conversion routines with performance between 1.00 and 2.01×\times that of hand-optimized versions in SPARSKIT and Intel MKL, two popular sparse linear algebra libraries. And by emitting code that avoids materializing temporaries, which both libraries need for many combinations of source and target formats, our technique outperforms those libraries by 1.78 to 4.01×\times for CSC/COO to DIA/ELL conversion.Comment: Presented at PLDI 202