297 research outputs found

    Optimizing construction of scheduled data flow graph for on-line testability

    Get PDF
    The objective of this work is to develop a new methodology for behavioural synthesis using a flow of synthesis, better suited to the scheduling of independent calculations and non-concurrent online testing. The traditional behavioural synthesis process can be defined as the compilation of an algorithmic specification into an architecture composed of a data path and a controller. This stream of synthesis generally involves scheduling, resource allocation, generation of the data path and controller synthesis. Experiments showed that optimization started at the high level synthesis improves the performance of the result, yet the current tools do not offer synthesis optimizations that from the RTL level. This justifies the development of an optimization methodology which takes effect from the behavioural specification and accompanying the synthesis process in its various stages. In this paper we propose the use of algebraic properties (commutativity, associativity and distributivity) to transform readable mathematical formulas of algorithmic specifications into mathematical formulas evaluated efficiently. This will effectively reduce the execution time of scheduling calculations and increase the possibilities of testability

    Fast, Sparse Matrix Factorization and Matrix Algebra via Random Sampling for Integral Equation Formulations in Electromagnetics

    Get PDF
    Many systems designed by electrical & computer engineers rely on electromagnetic (EM) signals to transmit, receive, and extract either information or energy. In many cases, these systems are large and complex. Their accurate, cost-effective design requires high-fidelity computer modeling of the underlying EM field/material interaction problem in order to find a design with acceptable system performance. This modeling is accomplished by projecting the governing Maxwell equations onto finite dimensional subspaces, which results in a large matrix equation representation (Zx = b) of the EM problem. In the case of integral equation-based formulations of EM problems, the M-by-N system matrix, Z, is generally dense. For this reason, when treating large problems, it is necessary to use compression methods to store and manipulate Z. One such sparse representation is provided by so-called H^2 matrices. At low-to-moderate frequencies, H^2 matrices provide a controllably accurate data-sparse representation of Z. The scale at which problems in EM are considered ``large\u27\u27 is continuously being redefined to be larger. This growth of problem scale is not only happening in EM, but respectively across all other sub-fields of computational science as well. The pursuit of increasingly large problems is unwavering in all these sub-fields, and this drive has long outpaced the rate of advancements in processing and storage capabilities in computing. This has caused computational science communities to now face the computational limitations of standard linear algebraic methods that have been relied upon for decades to run quickly and efficiently on modern computing hardware. This common set of algorithms can only produce reliable results quickly and efficiently for small to mid-sized matrices that fit into the memory of the host computer. Therefore, the drive to pursue larger problems has even began to outpace the reasonable capabilities of these common numerical algorithms; the deterministic numerical linear algebra algorithms that have gotten matrix computation this far have proven to be inadequate for many problems of current interest. This has computational science communities focusing on improvements in their mathematical and software approaches in order to push further advancement. Randomized numerical linear algebra (RandNLA) is an emerging area that both academia and industry believe to be strong candidates to assist in overcoming the limitations faced when solving massive and computationally expensive problems. This thesis presents results of recent work that uses a random sampling method (RSM) to implement algebraic operations involving multiple H^2 matrices. Significantly, this work is done in a manner that is non-invasive to an existing H^2 code base for filling and factoring H^2 matrices. The work presented thus expands the existing code\u27s capabilities with minimal impact on existing (and well-tested) applications. In addition to this work with randomized H^2 algebra, improvements in sparse factorization methods for the compressed H^2 data structure will also be presented. The reported developments in filling and factoring H^2 data structures assist in, and allow for, the further pursuit of large and complex problems in computational EM (CEM) within simulation code bases that utilize the H^2 data structure

    Compiler optimization and ordering effects on VLIW code compression

    Get PDF

    Compiler optimization and ordering effects on VLIW code compression

    Get PDF
    Code size has always been an important issue for all embedded applications as well as larger systems. Code compression techniques have been devised as a way of battling bloated code; however, the impact of VLIW compiler methods and outputs on these compression schemes has not been thoroughly investigated. This paper describes the application of single- and multipleinstruction dictionary methods for code compression to decrease overall code size for the TI TMS320C6xxx DSP family. The compression scheme is applied to benchmarks taken from the Mediabench benchmark suite built with differing compiler optimization parameters. In the single instruction encoding scheme, it was found that compression ratios were not a useful indicator of the best overall code size – the best results (smallest overall code size) were obtained when the compression scheme was applied to sizeoptimized code. In the multiple instruction encoding scheme, changing parallel instruction order was found to only slightly improve compression in unoptimized code and does not affect the code compression when it is applied to builds already optimized for size

    A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization

    Full text link
    We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver

    Numerical aerodynamic simulation facility

    Get PDF
    Critical to the advancement of computational aerodynamics capability is the ability to simulate flows about three-dimensional configurations that contain both compressible and viscous effects, including turbulence and flow separation at high Reynolds numbers. Analyses were conducted of two solution techniques for solving the Reynolds averaged Navier-Stokes equations describing the mean motion of a turbulent flow with certain terms involving the transport of turbulent momentum and energy modeled by auxiliary equations. The first solution technique is an implicit approximate factorization finite-difference scheme applied to three-dimensional flows that avoids the restrictive stability conditions when small grid spacing is used. The approximate factorization reduces the solution process to a sequence of three one-dimensional problems with easily inverted matrices. The second technique is a hybrid explicit/implicit finite-difference scheme which is also factored and applied to three-dimensional flows. Both methods are applicable to problems with highly distorted grids and a variety of boundary conditions and turbulence models

    Format Abstraction for Sparse Tensor Algebra Compilers

    Full text link
    This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats. To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6Ă—\times faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201

    Area Efficient DST Architectures for HEVC

    Get PDF
    This work analyses the actual throughput of the Discrete Sine Transform (DST) stage in a realistic HEVC encoder, which executes the rate-distortion optimization algorithm to achieve high compression quality. Then, a low complexity DST factorization, where all the integer multiplications are substituted with add-and-shift operations, is exploited to design an efficient 1D-DST core. The proposed 1D-DST core is employed to derive two area efficient architectures, namely Folded and Full-parallel, for computing the 4Ă—4 2D-DST in HEVC. Finally, the proposed 2D-DST architectures are synthesized on a 90-nm standard cell technology to support the actual target throughput required to encode 4K UHD @30fps video sequences, showing better area efficiency with respect to existing DST architectures for HEVC
    • …
    corecore