1,713 research outputs found

    The Iray Light Transport Simulation and Rendering System

    Full text link
    While ray tracing has become increasingly common and path tracing is well understood by now, a major challenge lies in crafting an easy-to-use and efficient system implementing these technologies. Following a purely physically-based paradigm while still allowing for artistic workflows, the Iray light transport simulation and rendering system allows for rendering complex scenes by the push of a button and thus makes accurate light transport simulation widely available. In this document we discuss the challenges and implementation choices that follow from our primary design decisions, demonstrating that such a rendering system can be made a practical, scalable, and efficient real-world application that has been adopted by various companies across many fields and is in use by many industry professionals today

    A High-Throughput Solver for Marginalized Graph Kernels on GPU

    Get PDF
    We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    GraphR: Accelerating Graph Processing Using ReRAM

    Full text link
    This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
    corecore