3,475 research outputs found

    Online Tensor Methods for Learning Latent Variable Models

    Get PDF
    We introduce an online tensor decomposition based approach for two latent variable modeling problems namely, (1) community detection, in which we learn the latent communities that the social actors in social networks belong to, and (2) topic modeling, in which we infer hidden topics of text articles. We consider decomposition of moment tensors using stochastic gradient descent. We conduct optimization of multilinear operations in SGD and avoid directly forming the tensors, to save computational and storage costs. We present optimized algorithm in two platforms. Our GPU-based implementation exploits the parallelism of SIMD architectures to allow for maximum speed-up by a careful optimization of storage and data transfer, whereas our CPU-based implementation uses efficient sparse matrix computations and is suitable for large sparse datasets. For the community detection problem, we demonstrate accuracy and computational efficiency on Facebook, Yelp and DBLP datasets, and for the topic modeling problem, we also demonstrate good performance on the New York Times dataset. We compare our results to the state-of-the-art algorithms such as the variational method, and report a gain of accuracy and a gain of several orders of magnitude in the execution time.Comment: JMLR 201

    Pipelining the Fast Multipole Method over a Runtime System

    Get PDF
    Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

    Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping

    Get PDF
    FLUSEPA (Registered trademark in France No. 134009261) is an advanced simulation tool which performs a large panel of aerodynamic studies. It is the unstructured finite-volume solver developed by Airbus Safran Launchers company to calculate compressible, multidimensional, unsteady, viscous and reactive flows around bodies in relative motion. The time integration in FLUSEPA is done using an explicit temporal adaptive method. The current production version of the code is based on MPI and OpenMP. This implementation leads to important synchronizations that must be reduced. To tackle this problem, we present the study of a task-based parallelization of the aerodynamic solver of FLUSEPA using the runtime system StarPU and combining up to three levels of parallelism. We validate our solution by the simulation (using a finite-volume mesh with 80 million cells) of a take-off blast wave propagation for Ariane 5 launcher.Comment: Accepted manuscript of a paper in Journal of Computational Scienc

    Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures

    Get PDF
    Heterogeneous Architectures Are Being Used Extensively To Improve System Processing Capabilities. Critical Functions Of Each Application (Kernels) Can Be Mapped To Different Computing Devices (I.E. Cpus, Gpgpus, Accelerators) To Maximize Performance. However, Best Performance Can Only Be Achieved If Kernels Are Accurately Mapped To The Right Device. Moreover, In Some Cases Those Kernels Could Be Split And Executed Over Several Devices At The Same Time To Maximize The Use Of Compute Resources On Heterogeneous Parallel Architectures. In This Paper, We Define A Static Partitioning Model Based On Profiling Information From Previous Executions. This Model Follows A Quantitative Model Approach Which Computes The Optimal Match According To User-Defined Constraints. We Test Different Scenarios To Evaluate Our Model: Single Kernel And Multi-Kernel Applications. Experimental Results Show That Our Static Partitioning Model Could Increase Performance Of Parallel Applications By Deploying Not Only Different Kernels Over Different Devices But A Single Kernel Over Multiple Devices. This Allows To Avoid Having Idle Compute Resources On Heterogeneous Platforms, As Well As Enhancing The Overall Performance. (C) 2015 Elsevier B.V. All Rights Reserved.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement n. 609666 [24]

    SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

    Get PDF
    In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp
    • 

    corecore