208 research outputs found

    Computational dynamics for robotics systems using a non-strict computational approach

    Get PDF
    A Non-Strict computational approach for real-time robotics control computations is proposed. In contrast to the traditional approach to scheduling such computations, based strictly on task dependence relations, the proposed approach relaxes precedence constraints and scheduling is guided instead by the relative sensitivity of the outputs with respect to the various paths in the task graph. An example of the computation of the Inverse Dynamics of a simple inverted pendulum is used to demonstrate the reduction in effective computational latency through use of the Non-Strict approach. A speedup of 5 has been obtained when the processes of the task graph are scheduled to reduce the latency along the crucial path of the computation. While error is introduced by the relaxation of precedence constraints, the Non-Strict approach has a smaller error than the conventional Strict approach for a wide range of input conditions

    On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

    Get PDF
    Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

    Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

    Get PDF
    Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

    High-performance sparse matrix-vector multiplication on GPUs for structured grid computations

    Get PDF
    ABSTRACT In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. With uniform grids, the resulting linear system A x = b has a matrix A that is sparse with a very regular structure. The specific focus of this paper is on sparse matrices that have a block structure due to the large number of unknowns at each grid point. Sparse matrix storage formats such as Compressed Sparse Row (CSR) and Diagonal format (DIA) are not the most effective for such matrices. In this work, we present a new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids. Unlike other formats such as the Diagonal storage format (DIA), we specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix. We develop efficient sparse matrix-vector multiplication for structured grid computations on GPU architectures using CUD

    TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

    Full text link
    Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tuckercompressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 3.14X speedup over cuDNN, 1.45X speedup over TVM, and 4.57X over the original models using cuDNN with up to 0.05% accuracy loss.Comment: 12 pages, 8 figures, 3 tables, accepted by PPoPP '2

    MicroRNA-210-mediated proliferation, survival, and angiogenesis promote cardiac repair post myocardial infarction in rodents

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1007/s00109-017-1591-8.An innovative approach for cardiac regeneration following injury is to induce endogenous cardiomyocyte (CM) cell cycle re-entry. In the present study, CMs from adult rat hearts were isolated and transfected with cel-miR-67 (control) and rno-miR-210. A significant increase in CM proliferation and mono-nucleation were observed in miR-210 group, in addition to a reduction in CM size, multi-nucleation, and cell death. When compared to control, β-catenin and Bcl-2 were upregulated while APC (adenomatous polyposis coli), p16, and caspase-3 were downregulated in miR-210 group. In silico analysis predicted cell cycle inhibitor, APC, as a direct target of miR-210 in rodents. Moreover, compared to control, a significant increase in CM survival and proliferation were observed with siRNA-mediated inhibition of APC. Furthermore, miR-210 overexpressing C57BL/6 mice (210-TG) were used for short-term ischemia/reperfusion study, revealing smaller cell size, increased mono-nucleation, decreased multi-nucleation, and increased CM proliferation in 210-TG hearts in contrast to wild-type (NTG). Likewise, myocardial infarction (MI) was created in adult mice, echocardiography was performed, and the hearts were harvested for immunohistochemistry and molecular studies. Compared to NTG, 210-TG hearts showed a significant increase in CM proliferation, reduced apoptosis, upregulated angiogenesis, reduced infarct size, and overall improvement in cardiac function following MI. β-catenin, Bcl-2, and VEGF (vascular endothelial growth factor) were upregulated while APC, p16, and caspase-3 were downregulated in 210-TG hearts. Overall, constitutive overexpression of miR-210 rescues heart function following cardiac injury in adult mice via promoting CM proliferation, cell survival, and angiogenesis

    The Promises of Hybrid Hexagonal/Classical Tiling for GPU

    Get PDF
    Time-tiling is necessary for efficient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil's spatial dimensions. Several prior efforts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of inefficiency, such as unbalanced parallelism, redundant computations, or increased control flow overhead incompatible with efficient GPU execution. We explore a new path to maximize the effectivness of time-tiling on iterative stencil computations. Our approach is particularly well suited for GPUs. It does not require any redundant computations, it favors coalesced global-memory access and data reuse in shared-memory/cache, avoids thread divergence, and extracts a high degree of parallelism. We introduce hybrid hexagonal tiling, combining hexagonal tile shapes along the time (sequential) dimension and one spatial dimension, with classical tiling for other spatial dimensions. An hexagonal tile shape simultaneously enable parallel tile execution and reuse along the time dimension. Experimental results demonstrate significant performance improvements over existing stencil compilers.Le partitionnement temporel est indispensable pour l'exécution efficace de stencils itératifs. En revanche les tuiles hyper-parallélépipédiques usuelles ne sont pas applicables en raison du mélange de dépendances en avant et en arrière suivant les dimensions spatiales du stencil. Plusieurs études ont été consacrées à ce problème. Pourtant, les techniques connues tendent à échanger une meilleure réutilisation des données contre d'autres sources d'inefficacité, telles que le déséquilibre du parallélisme, des calculs redondants, ou un surcoût induit par la complexité du flot de contrôle incompatible avec l'exécution sur GPU. Nous explorons une autre voie pour maximiser l'efficacité du partitionnement temporel sur des stencils itératifs. Notre approche est particulièrement bien adaptée aux GPUs. Elle n'induit pas de calculs redondants, favorise l'agglomération des accès à la mémoire globale et la réutilisation de données dans les mémoires locales ou caches, tout en évitant la divergence de threads et en exposant un degré élevé de parallélisme. Nous proposons le partitionnement hybride hexagonal, qui repose sur des tuiles hexagonales selon la dimension temporelle (séquentielle) et une dimension spatiale, combinées avec un partitionnement classique selon les autres dimensions spatiales. La forme de tuile hexagonale autorise l'expression de parallélisme entre tuiles et la réutilisation selon la dimension temporelle. Nos résultats expérimentaux mettent en évidence des améliorations sensibles de performance par rapport aux compilateurs spécialisés dans l'optimisation de stencils

    Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

    Get PDF
    International audienceThe four-index integral transform is a fundamental and com-putationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite

    Myonuclear accretion is a determinant of exercise-induced remodeling in skeletal muscle.

    Get PDF
    Skeletal muscle adapts to external stimuli such as increased work. Muscle progenitors (MPs) control muscle repair due to severe damage, but the role of MP fusion and associated myonuclear accretion during exercise are unclear. While we previously demonstrated that MP fusion is required for growth using a supra-physiological model (Goh and Millay, 2017), questions remained about the need for myonuclear accrual during muscle adaptation in a physiological setting. Here, we developed an 8 week high-intensity interval training (HIIT) protocol and assessed the importance of MP fusion. In 8 month-old mice, HIIT led to progressive myonuclear accretion throughout the protocol, and functional muscle hypertrophy. Abrogation of MP fusion at the onset of HIIT resulted in exercise intolerance and fibrosis. In contrast, ablation of MP fusion 4 weeks into HIIT, preserved exercise tolerance but attenuated hypertrophy. We conclude that myonuclear accretion is required for different facets of exercise-induced adaptive responses, impacting both muscle repair and hypertrophic growth