10 research outputs found
Dynamic Analyses of Result Quality in Energy-Aware Approximate Programs
Thesis (Ph.D.)--University of Washington, 2014Energy efficiency is a key concern in the design of modern computer systems. One promising approach to energy-efficient computation, approximate computing, trades off output precision for energy efficiency. However, this tradeoff can have unexpected effects on computation quality. This thesis presents dynamic analysis tools to study, debug, and monitor the quality and energy efficiency of approximate computations. We propose three styles of tools: prototyping tools that allow developers to experiment with approximation in their applications, offline tools that instrument code to determine the key sources of error, and online tools that monitor the quality of deployed applications in real time. Our prototyping tool is based on an extension to the functional language OCaml. We add approximation constructs to the language, an approximation simulator to the runtime, and profiling and auto-tuning tools for studying and experimenting with energy-quality tradeoffs. We also present two offline debugging tools and three online monitoring tools. The first offline tool identifies correlations between output quality and the total number of executions of, and errors in, individual approximate operations. The second tracks the number of approximate operations that flow into a particular value. Our online tools comprise three low-cost approaches to dynamic quality monitoring. They are designed to monitor quality in deployed applications without spending more energy than is saved by approximation. Online monitors can be used to perform real time adjustments to energy usage in order to meet specific quality goals. We present prototype implementations of all of these tools and describe their usage with several applications. Our prototyping, profiling, and autotuning tools allow us to experiment with approximation strategies and identify new strategies, our offline tools succeed in providing new insights into the effects of approximation on output quality, and our monitors succeed in controlling output quality while still maintaining significant energy efficiency gains
Applying the Vector Radix Method to Multidimensional, Multiprocessor, Out-of-Core Fast Fourier Transforms
We describe an efficient algorithm for calculating Fast Fourier Transforms on matrices of arbitrarily high dimension using the vector-radix method when the problem size is out-of-core (i.e., when the size of the data set is larger than the total available memory of the system). The algorithm takes advantage of multiple processors when they are present, but it is also efficient on single-processor systems. Our work is an extension of work done by Lauren Baptist in [Bapt99], which applied the vector-radix method to 2-dimensional out-of-core matrices. To determine the effectiveness of the algorithm, we present empirical results as well as an analysis of the I/O, communication, and computational complexity. We perform the empirical tests on a DEC 2100 server and on a cluster of Pentium-based Linux workstations. We compare our results with the traditional dimensional method of calculating multidimensional FFTs, and show that as the number of dimensions increases, the vector-radix-based algorithm becomes increasingly effective relative to the dimensional method. In order to calculate the complexity of the algorithm, it was necessary to develop a method for analyzing the interprocessor communication costs of the BMMC data-permutation algorithm (presented in [CSW98]) used by our FFT algorithms. We present this analysis method and show how it was derived
Preventing Format-String Attacks via Automatic and Efficient Dynamic Checking
We propose preventing format-string attacks with a combination of static dataflow analysis and dynamic white-lists of safe address ranges. The dynamic nature of our white-lists provides the flexibility necessary to encode a very precise security policyânamely, that %n-specifiers in printf-style functions should modify a memory location x only if the programmer explicitly passes a pointer to x. Our static dataflow analysis and source transformations let us automatically maintain and check the white-list without any programmer effortâthey merely need to change the Makefile. Our analysis also detects pointers passed to vprintfstyle functions through (possibly multiple layers of) wrapper functions. Our results establish that our approach provides better protection than previous work and incurs little performance overhead
Type Safety and Erasure Proofs for âA Type System for Coordinated Data Structuresâ
We prove the Type Safety and Erasure Theorems presented in Section 4 of Ringenburg and Grossmanâs paper âA Type System for Coordinated Data Structures â [1]. We also remind the reader of the syntax, semantics, and typing rules for the coordinated list language described in Section 3 of the same paper. We refer the reader to the original paper for a detailed presentation of the coordinated data structure type system. 1 The Language Figures 1, 2, and 3 present, respectively, the syntax, semantics, and typing rules for our coordinated list language. We implicitly assume â and Î do not have repeated elements. For example, â, α:Îș is ill-formed if α â Dom(â). To avoid conflicts, we can systematically rename constructs with binding occurrences. We therefore treat â and Î as partial functions. All explicit occurrences of α and x in the grammar are binding (except when they constitute the entire type or expression, of course). Substitution is defined as usual
Recommended from our members
A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark:
We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with the fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments
CosmoFlow: Using deep learning to learn the universe at scale
Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on IntelÂź Xeon Phiâą processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. These enhancements enable us to process large 3D dark matter distribution and predict the cosmological parameters ΩsubM/sub, Ïsub8/sub and nsubs/sub with unprecedented accuracy