208 research outputs found
Computational dynamics for robotics systems using a non-strict computational approach
A Non-Strict computational approach for real-time robotics control computations is proposed. In contrast to the traditional approach to scheduling such computations, based strictly on task dependence relations, the proposed approach relaxes precedence constraints and scheduling is guided instead by the relative sensitivity of the outputs with respect to the various paths in the task graph. An example of the computation of the Inverse Dynamics of a simple inverted pendulum is used to demonstrate the reduction in effective computational latency through use of the Non-Strict approach. A speedup of 5 has been obtained when the processes of the task graph are scheduled to reduce the latency along the crucial path of the computation. While error is introduced by the relaxation of precedence constraints, the Non-Strict approach has a smaller error than the conventional Strict approach for a wide range of input conditions
On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution
Technology trends are making the cost of data movement increasingly dominant,
both in terms of energy and time, over the cost of performing arithmetic
operations in computer systems. The fundamental ratio of aggregate data
movement bandwidth to the total computational power (also referred to the
machine balance parameter) in parallel computer systems is decreasing. It is
there- fore of considerable importance to characterize the inherent data
movement requirements of parallel algorithms, so that the minimal architectural
balance parameters required to support it on future systems can be well
understood. In this paper, we develop an extension of the well-known red-blue
pebble game to develop lower bounds on the data movement complexity for the
parallel execution of computational directed acyclic graphs (CDAGs) on parallel
systems. We model multi-node multi-core parallel systems, with the total
physical memory distributed across the nodes (that are connected through some
interconnection network) and in a multi-level shared cache hierarchy for
processors within a node. We also develop new techniques for lower bound
characterization of non-homogeneous CDAGs. We demonstrate the use of the
methodology by analyzing the CDAGs of several numerical algorithms, to develop
lower bounds on data movement for their parallel execution
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
ABSTRACT In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. With uniform grids, the resulting linear system A x = b has a matrix A that is sparse with a very regular structure. The specific focus of this paper is on sparse matrices that have a block structure due to the large number of unknowns at each grid point. Sparse matrix storage formats such as Compressed Sparse Row (CSR) and Diagonal format (DIA) are not the most effective for such matrices. In this work, we present a new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids. Unlike other formats such as the Diagonal storage format (DIA), we specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix. We develop efficient sparse matrix-vector multiplication for structured grid computations on GPU architectures using CUD
TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition
Tucker decomposition is one of the SOTA CNN model compression techniques.
However, unlike the FLOPs reduction, we observe very limited inference time
reduction with Tuckercompressed models using existing GPU software such as
cuDNN. To this end, we propose an efficient end-to-end framework that can
generate highly accurate and compact CNN models via Tucker decomposition and
optimized inference code on GPUs. Specifically, we propose an ADMM-based
training algorithm that can achieve highly accurate Tucker-format models. We
also develop a high-performance kernel for Tucker-format convolutions and
analytical performance models to guide the selection of execution parameters.
We further propose a co-design framework to determine the proper Tucker ranks
driven by practical inference time (rather than FLOPs). Our evaluation on five
modern CNNs with A100 demonstrates that our compressed models with our
optimized code achieve up to 3.14X speedup over cuDNN, 1.45X speedup over TVM,
and 4.57X over the original models using cuDNN with up to 0.05% accuracy loss.Comment: 12 pages, 8 figures, 3 tables, accepted by PPoPP '2
MicroRNA-210-mediated proliferation, survival, and angiogenesis promote cardiac repair post myocardial infarction in rodents
The final publication is available at Springer via http://dx.doi.org/10.1007/s00109-017-1591-8.An innovative approach for cardiac regeneration following injury is to induce endogenous cardiomyocyte (CM) cell cycle re-entry. In the present study, CMs from adult rat hearts were isolated and transfected with cel-miR-67 (control) and rno-miR-210. A significant increase in CM proliferation and mono-nucleation were observed in miR-210 group, in addition to a reduction in CM size, multi-nucleation, and cell death. When compared to control, β-catenin and Bcl-2 were upregulated while APC (adenomatous polyposis coli), p16, and caspase-3 were downregulated in miR-210 group. In silico analysis predicted cell cycle inhibitor, APC, as a direct target of miR-210 in rodents. Moreover, compared to control, a significant increase in CM survival and proliferation were observed with siRNA-mediated inhibition of APC. Furthermore, miR-210 overexpressing C57BL/6 mice (210-TG) were used for short-term ischemia/reperfusion study, revealing smaller cell size, increased mono-nucleation, decreased multi-nucleation, and increased CM proliferation in 210-TG hearts in contrast to wild-type (NTG). Likewise, myocardial infarction (MI) was created in adult mice, echocardiography was performed, and the hearts were harvested for immunohistochemistry and molecular studies. Compared to NTG, 210-TG hearts showed a significant increase in CM proliferation, reduced apoptosis, upregulated angiogenesis, reduced infarct size, and overall improvement in cardiac function following MI. β-catenin, Bcl-2, and VEGF (vascular endothelial growth factor) were upregulated while APC, p16, and caspase-3 were downregulated in 210-TG hearts. Overall, constitutive overexpression of miR-210 rescues heart function following cardiac injury in adult mice via promoting CM proliferation, cell survival, and angiogenesis
The Promises of Hybrid Hexagonal/Classical Tiling for GPU
Time-tiling is necessary for efficient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil's spatial dimensions. Several prior efforts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of inefficiency, such as unbalanced parallelism, redundant computations, or increased control flow overhead incompatible with efficient GPU execution. We explore a new path to maximize the effectivness of time-tiling on iterative stencil computations. Our approach is particularly well suited for GPUs. It does not require any redundant computations, it favors coalesced global-memory access and data reuse in shared-memory/cache, avoids thread divergence, and extracts a high degree of parallelism. We introduce hybrid hexagonal tiling, combining hexagonal tile shapes along the time (sequential) dimension and one spatial dimension, with classical tiling for other spatial dimensions. An hexagonal tile shape simultaneously enable parallel tile execution and reuse along the time dimension. Experimental results demonstrate significant performance improvements over existing stencil compilers.Le partitionnement temporel est indispensable pour l'exécution efficace de stencils itératifs. En revanche les tuiles hyper-parallélépipédiques usuelles ne sont pas applicables en raison du mélange de dépendances en avant et en arrière suivant les dimensions spatiales du stencil. Plusieurs études ont été consacrées à ce problème. Pourtant, les techniques connues tendent à échanger une meilleure réutilisation des données contre d'autres sources d'inefficacité, telles que le déséquilibre du parallélisme, des calculs redondants, ou un surcoût induit par la complexité du flot de contrôle incompatible avec l'exécution sur GPU. Nous explorons une autre voie pour maximiser l'efficacité du partitionnement temporel sur des stencils itératifs. Notre approche est particulièrement bien adaptée aux GPUs. Elle n'induit pas de calculs redondants, favorise l'agglomération des accès à la mémoire globale et la réutilisation de données dans les mémoires locales ou caches, tout en évitant la divergence de threads et en exposant un degré élevé de parallélisme. Nous proposons le partitionnement hybride hexagonal, qui repose sur des tuiles hexagonales selon la dimension temporelle (séquentielle) et une dimension spatiale, combinées avec un partitionnement classique selon les autres dimensions spatiales. La forme de tuile hexagonale autorise l'expression de parallélisme entre tuiles et la réutilisation selon la dimension temporelle. Nos résultats expérimentaux mettent en évidence des améliorations sensibles de performance par rapport aux compilateurs spécialisés dans l'optimisation de stencils
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
International audienceThe four-index integral transform is a fundamental and com-putationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite
Myonuclear accretion is a determinant of exercise-induced remodeling in skeletal muscle.
Skeletal muscle adapts to external stimuli such as increased work. Muscle progenitors (MPs) control muscle repair due to severe damage, but the role of MP fusion and associated myonuclear accretion during exercise are unclear. While we previously demonstrated that MP fusion is required for growth using a supra-physiological model (Goh and Millay, 2017), questions remained about the need for myonuclear accrual during muscle adaptation in a physiological setting. Here, we developed an 8 week high-intensity interval training (HIIT) protocol and assessed the importance of MP fusion. In 8 month-old mice, HIIT led to progressive myonuclear accretion throughout the protocol, and functional muscle hypertrophy. Abrogation of MP fusion at the onset of HIIT resulted in exercise intolerance and fibrosis. In contrast, ablation of MP fusion 4 weeks into HIIT, preserved exercise tolerance but attenuated hypertrophy. We conclude that myonuclear accretion is required for different facets of exercise-induced adaptive responses, impacting both muscle repair and hypertrophic growth
- …