Search CORE

27 research outputs found

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Author: J. Sugerman
K. Fatahalian
P. Hanrahan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse

CiteSeerX

Crossref

Self-refining games using player analytics

Author: Fatahalian K
Humberston B
Kase B
O'Brien JF
Stanton M
Treuille A
Publication venue: eScholarship, University of California
Publication date: 01/01/2014
Field of study

Data-driven simulation demands good training data drawn from a vast space of possible simulations. While fully sampling these large spaces is infeasible, we observe that in practical applications, such as gameplay, users explore only a vanishingly small subset of the dynamical state space. In this paper we present a sampling approach that takes advantage of this observation by concentrating precomputation around the states that users are most likely to encounter. We demonstrate our technique in a prototype self-refining game whose dynamics improve with play, ultimately providing realistically rendered, rich fluid dynamics in real time on a mobile device. Our results show that our analytics-driven training approach yields lower model error and fewer visual artifacts than a heuristic training strategy. Copyright © ACM

Crossref

eScholarship - University of California

Fusion: Abstractions for Multicore/Manycore Heterogenous Parallel Programming Using GPUs

Author: C. Dubach
D. Blythe
J. Nickolls
J.D. Owens
K. Fatahalian
M.C. Herbordt
Y. Yan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref

A performance and energy evaluation of many-light rendering algorithms

Author: B Johnsson
B Walter
Björn Johnsson
CA Burns
K Akeley
K Fatahalian
O Olsson
SW Keckler
Tomas Akenine-Möller
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Recently, the performance of many-light algorithms, where thousands of light sources are used to compute the lighting in a scene, has improved so much that they have reached the realm of real-time rendering. In general, the algorithm that is considered “best” is the one that is the fastest in terms of time per frame. Given that power efficiency may become or already is one of the most important optimization factors for both hardware and software vendors for graphics, we take a different route and instead measure both energy usage per frame and frame time for a number of popular many-light rendering algorithms on an Intel Iris Pro. We use Pareto frontiers for each configuration to examine the possibilities for trade-offs between rendering time and energy consumption. Furthermore, we examine the optimal algorithms at each configuration, and are able to draw generalized conclusions on when each algorithm is most efficient. We also record several other statistics on the algorithms, e.g., bandwidth, and are able to draw further conclusions with regard to energy consumption

Lund University Publications

Crossref

LED Street Light Research Project Part II: New Findings

Author: C+C Lighting (5048597)
Cynthia Limauro (3889297)
Donald K Carter (5048591)
Kayvon Fatahalian (5048594)
Stephen Quick (4324582)
Publication venue
Publication date: 02/04/2018
Field of study

<p>Many cities are converting their existing street lighting to Light Emitting Diode (LED) source luminaires due to anticipated energy savings of 40 to 80 percent, as compared to high intensity discharge (HID) source luminaires, and maintenance savings estimated to be 50 to 75 percent due to the longer life of LED luminaires. Addressable electronic lighting controls and sensors are now available that can transform a basic streetlight into an intelligent, smart city device with public safety and other benefits. The number of variables that civic officials must consider for any street lighting conversion project has increased as a result of the rate of technological advances in LED luminaires, control systems, and optional components.</p> <p>The purpose of this report is to provide an understanding of recent industry and technology changes, address common concerns raised when using LED light sources, recommend model specifications for LED luminaires and lighting controls in the public right of way, make suggestions for improving industry norms and code changes, comment on add-on features that show promise, and discuss what to expect as technology advances and the LED lighting industry matures.</p

FigShare

A performance and energy evaluation of many-light rendering algorithms

Author: B Johnsson
B Walter
Björn Johnsson
CA Burns
K Akeley
K Fatahalian
O Olsson
SW Keckler
Tomas Akenine-Möller
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

Learning to optimize halide with tree search and random programs

Author: Adams A
Anderson L
Baghdadi R
Durand F
Fatahalian K
Gharbi M
Johnson S
Li TM
Ma K
Ragan-Kelley J
Steiner B
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

We present a new algorithm to automatically schedule Halide programs for high-performance image processing and deep learning. We significantly improve upon the performance of previous methods, which considered a limited subset of schedules. We define a parameterization of possible schedules much larger than prior methods and use a variant of beam search to search over it. The search optimizes runtime predicted by a cost model based on a combination of new derived features and machine learning. We train the cost model by generating and featurizing hundreds of thousands of random programs and schedules. We show that this approach operates effectively with or without autotuning. It produces schedules which are on average almost twice as fast as the existing Halide autoscheduler without autotuning, or more than twice as fast with, and is the first automatic scheduling algorithm to significantly outperform human experts on average

eScholarship - University of California

Recommended from our members

Learning to optimize halide with tree search and random programs

Author: Adams A
Anderson L
Baghdadi R
Durand F
Fatahalian K
Gharbi M
Johnson S
Li TM
Ma K
Ragan-Kelley J
Steiner B
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

eScholarship - University of California

Concurrent Number Cruncher An Efficient Sparse Linear Solver on the GPU

Author: B. Levy
I. Buck
I. Buck
J. Bolz
J. Krüger
J. Mallet
K. Fatahalian
M.R. Hestenes
N. Galoppo
R. Fernando
Publication venue
Publication date: 01/01/2007
Field of study

Abstract. A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse generalpurpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML)

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-INSU

HAL-Rennes 1