822 research outputs found

    Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids

    Full text link
    We present a global optimization approach to optical flow estimation. The approach optimizes a classical optical flow objective over the full space of mappings between discrete grids. No descriptor matching is used. The highly regular structure of the space of mappings enables optimizations that reduce the computational complexity of the algorithm's inner loop from quadratic to linear and support efficient matching of tens of thousands of nodes to tens of thousands of displacements. We show that one-shot global optimization of a classical Horn-Schunck-type objective over regular grids at a single resolution is sufficient to initialize continuous interpolation and achieve state-of-the-art performance on challenging modern benchmarks.Comment: To be presented at CVPR 201

    Speeding up neighborhood search in local Gaussian process prediction

    Full text link
    Recent implementations of local approximate Gaussian process models have pushed computational boundaries for non-linear, non-parametric prediction problems, particularly when deployed as emulators for computer experiments. Their flavor of spatially independent computation accommodates massive parallelization, meaning that they can handle designs two or more orders of magnitude larger than previously. However, accomplishing that feat can still require massive supercomputing resources. Here we aim to ease that burden. We study how predictive variance is reduced as local designs are built up for prediction. We then observe how the exhaustive and discrete nature of an important search subroutine involved in building such local designs may be overly conservative. Rather, we suggest that searching the space radially, i.e., continuously along rays emanating from the predictive location of interest, is a far thriftier alternative. Our empirical work demonstrates that ray-based search yields predictors with accuracy comparable to exhaustive search, but in a fraction of the time - bringing a supercomputer implementation back onto the desktop.Comment: 24 pages, 5 figures, 4 table

    Enabling Factor Analysis on Thousand-Subject Neuroimaging Datasets

    Full text link
    The scale of functional magnetic resonance image data is rapidly increasing as large multi-subject datasets are becoming widely available and high-resolution scanners are adopted. The inherent low-dimensionality of the information in this data has led neuroscientists to consider factor analysis methods to extract and analyze the underlying brain activity. In this work, we consider two recent multi-subject factor analysis methods: the Shared Response Model and Hierarchical Topographic Factor Analysis. We perform analytical, algorithmic, and code optimization to enable multi-node parallel implementations to scale. Single-node improvements result in 99x and 1812x speedups on these two methods, and enables the processing of larger datasets. Our distributed implementations show strong scaling of 3.3x and 5.5x respectively with 20 nodes on real datasets. We also demonstrate weak scaling on a synthetic dataset with 1024 subjects, on up to 1024 nodes and 32,768 cores

    On-the-fly tracing for data-centric computing : parallelization, workflow and applications

    Get PDF
    As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources. In this work we propose a cross-platform programming paradigm, called on-the-fly data tracing , to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a big-data approximation computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements

    An OpenMP based Parallelization Compiler for C Applications

    Get PDF
    Directive-drive programming models, such as OpenMP, are one solution for exploiting the potential of multi-core architectures, and enable developers to accelerate software applications by adding annotations on for-type loops and other code regions. However, manual parallelization of applications is known to be a non trivial and time consuming process, requiring parallel programming skills. Automatic parallelization approaches can reduce the burden on the application development side. This paper presents an OpenMP based automatic parallelization compiler, named AutoPar-Clava, for automatic identification and annotation of loops in C code. By using static analysis, parallelizable regions are detected, and a compilable OpenMP parallel code from the sequential version is produced. In order to reduce the accesses to shared memory by each thread, each variable is categorized into the proper OpenMP scoping. Also, AutoPar-Clava is able to support reduction on arrays, which is available since OpenMP 4.5. The effectiveness of AutoPar-Clava is evaluated by means of the Polyhedral Benchmark suite, and targeting a N-cores x86-based computing platform. The achieved results are very promising and compare favorably with closely related auto-parallelization compilers such as Intel C/C++ Compiler (i.e., icc), ROSE, TRACO, and Cetus

    Scaling and universality in the phase diagram of the 2D Blume-Capel model

    Get PDF
    We review the pertinent features of the phase diagram of the zero-field Blume-Capel model, focusing on the aspects of transition order, finite-size scaling and universality. In particular, we employ a range of Monte Carlo simulation methods to study the 2D spin-1 Blume-Capel model on the square lattice to investigate the behavior in the vicinity of the first-order and second-order regimes of the ferromagnet-paramagnet phase boundary, respectively. To achieve high-precision results, we utilize a combination of (i) a parallel version of the multicanonical algorithm and (ii) a hybrid updating scheme combining Metropolis and generalized Wolff cluster moves. These techniques are combined to study for the first time the correlation length of the model, using its scaling in the regime of second-order transitions to illustrate universality through the observed identity of the limiting value of Ο/L\xi/L with the exactly known result for the Ising universality class.Comment: 16 pages, 7 figures, 1 table, submitted to Eur. Phys. J. Special Topic

    Haloes gone MAD: The Halo-Finder Comparison Project

    Full text link
    [abridged] We present a detailed comparison of fundamental dark matter halo properties retrieved by a substantial number of different halo finders. These codes span a wide range of techniques including friends-of-friends (FOF), spherical-overdensity (SO) and phase-space based algorithms. We further introduce a robust (and publicly available) suite of test scenarios that allows halo finder developers to compare the performance of their codes against those presented here. This set includes mock haloes containing various levels and distributions of substructure at a range of resolutions as well as a cosmological simulation of the large-scale structure of the universe. All the halo finding codes tested could successfully recover the spatial location of our mock haloes. They further returned lists of particles (potentially) belonging to the object that led to coinciding values for the maximum of the circular velocity profile and the radius where it is reached. All the finders based in configuration space struggled to recover substructure that was located close to the centre of the host halo and the radial dependence of the mass recovered varies from finder to finder. Those finders based in phase space could resolve central substructure although they found difficulties in accurately recovering its properties. Via a resolution study we found that most of the finders could not reliably recover substructure containing fewer than 30-40 particles. However, also here the phase space finders excelled by resolving substructure down to 10-20 particles. By comparing the halo finders using a high resolution cosmological volume we found that they agree remarkably well on fundamental properties of astrophysical significance (e.g. mass, position, velocity, and peak of the rotation curve).Comment: 27 interesting pages, 20 beautiful figures, and 4 informative tables accepted for publication in MNRAS. The high-resolution version of the paper as well as all the test cases and analysis can be found at the web site http://popia.ft.uam.es/HaloesGoingMA

    A Similarity Measure for GPU Kernel Subgraph Matching

    Full text link
    Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners and compilers in generating high performing code
    • 

    corecore