822 research outputs found
Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids
We present a global optimization approach to optical flow estimation. The
approach optimizes a classical optical flow objective over the full space of
mappings between discrete grids. No descriptor matching is used. The highly
regular structure of the space of mappings enables optimizations that reduce
the computational complexity of the algorithm's inner loop from quadratic to
linear and support efficient matching of tens of thousands of nodes to tens of
thousands of displacements. We show that one-shot global optimization of a
classical Horn-Schunck-type objective over regular grids at a single resolution
is sufficient to initialize continuous interpolation and achieve
state-of-the-art performance on challenging modern benchmarks.Comment: To be presented at CVPR 201
Speeding up neighborhood search in local Gaussian process prediction
Recent implementations of local approximate Gaussian process models have
pushed computational boundaries for non-linear, non-parametric prediction
problems, particularly when deployed as emulators for computer experiments.
Their flavor of spatially independent computation accommodates massive
parallelization, meaning that they can handle designs two or more orders of
magnitude larger than previously. However, accomplishing that feat can still
require massive supercomputing resources. Here we aim to ease that burden. We
study how predictive variance is reduced as local designs are built up for
prediction. We then observe how the exhaustive and discrete nature of an
important search subroutine involved in building such local designs may be
overly conservative. Rather, we suggest that searching the space radially,
i.e., continuously along rays emanating from the predictive location of
interest, is a far thriftier alternative. Our empirical work demonstrates that
ray-based search yields predictors with accuracy comparable to exhaustive
search, but in a fraction of the time - bringing a supercomputer implementation
back onto the desktop.Comment: 24 pages, 5 figures, 4 table
Enabling Factor Analysis on Thousand-Subject Neuroimaging Datasets
The scale of functional magnetic resonance image data is rapidly increasing
as large multi-subject datasets are becoming widely available and
high-resolution scanners are adopted. The inherent low-dimensionality of the
information in this data has led neuroscientists to consider factor analysis
methods to extract and analyze the underlying brain activity. In this work, we
consider two recent multi-subject factor analysis methods: the Shared Response
Model and Hierarchical Topographic Factor Analysis. We perform analytical,
algorithmic, and code optimization to enable multi-node parallel
implementations to scale. Single-node improvements result in 99x and 1812x
speedups on these two methods, and enables the processing of larger datasets.
Our distributed implementations show strong scaling of 3.3x and 5.5x
respectively with 20 nodes on real datasets. We also demonstrate weak scaling
on a synthetic dataset with 1024 subjects, on up to 1024 nodes and 32,768
cores
On-the-fly tracing for data-centric computing : parallelization, workflow and applications
As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources. In this work we propose a cross-platform programming paradigm, called on-the-fly data tracing , to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a big-data approximation computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements
An OpenMP based Parallelization Compiler for C Applications
Directive-drive programming models, such as OpenMP, are one solution for exploiting the potential of multi-core architectures, and enable developers to accelerate software applications by adding annotations on for-type loops and other code regions. However, manual parallelization of applications is known to be a non trivial and time consuming process, requiring parallel programming skills. Automatic parallelization approaches can reduce the burden on the application development side. This paper presents an OpenMP based automatic parallelization compiler, named AutoPar-Clava, for automatic identification and annotation of loops in C code. By using static analysis, parallelizable regions are detected, and a compilable OpenMP parallel code from the sequential version is produced. In order to reduce the accesses to shared memory by each thread, each variable is categorized into the proper OpenMP scoping. Also, AutoPar-Clava is able to support reduction on arrays, which is available since OpenMP 4.5. The effectiveness of AutoPar-Clava is evaluated by means of the Polyhedral Benchmark suite, and targeting a N-cores x86-based computing platform. The achieved results are very promising and compare favorably with closely related auto-parallelization compilers such as Intel C/C++ Compiler (i.e., icc), ROSE, TRACO, and Cetus
Scaling and universality in the phase diagram of the 2D Blume-Capel model
We review the pertinent features of the phase diagram of the zero-field
Blume-Capel model, focusing on the aspects of transition order, finite-size
scaling and universality. In particular, we employ a range of Monte Carlo
simulation methods to study the 2D spin-1 Blume-Capel model on the square
lattice to investigate the behavior in the vicinity of the first-order and
second-order regimes of the ferromagnet-paramagnet phase boundary,
respectively. To achieve high-precision results, we utilize a combination of
(i) a parallel version of the multicanonical algorithm and (ii) a hybrid
updating scheme combining Metropolis and generalized Wolff cluster moves. These
techniques are combined to study for the first time the correlation length of
the model, using its scaling in the regime of second-order transitions to
illustrate universality through the observed identity of the limiting value of
with the exactly known result for the Ising universality class.Comment: 16 pages, 7 figures, 1 table, submitted to Eur. Phys. J. Special
Topic
Haloes gone MAD: The Halo-Finder Comparison Project
[abridged] We present a detailed comparison of fundamental dark matter halo
properties retrieved by a substantial number of different halo finders. These
codes span a wide range of techniques including friends-of-friends (FOF),
spherical-overdensity (SO) and phase-space based algorithms. We further
introduce a robust (and publicly available) suite of test scenarios that allows
halo finder developers to compare the performance of their codes against those
presented here. This set includes mock haloes containing various levels and
distributions of substructure at a range of resolutions as well as a
cosmological simulation of the large-scale structure of the universe. All the
halo finding codes tested could successfully recover the spatial location of
our mock haloes. They further returned lists of particles (potentially)
belonging to the object that led to coinciding values for the maximum of the
circular velocity profile and the radius where it is reached. All the finders
based in configuration space struggled to recover substructure that was located
close to the centre of the host halo and the radial dependence of the mass
recovered varies from finder to finder. Those finders based in phase space
could resolve central substructure although they found difficulties in
accurately recovering its properties. Via a resolution study we found that most
of the finders could not reliably recover substructure containing fewer than
30-40 particles. However, also here the phase space finders excelled by
resolving substructure down to 10-20 particles. By comparing the halo finders
using a high resolution cosmological volume we found that they agree remarkably
well on fundamental properties of astrophysical significance (e.g. mass,
position, velocity, and peak of the rotation curve).Comment: 27 interesting pages, 20 beautiful figures, and 4 informative tables
accepted for publication in MNRAS. The high-resolution version of the paper
as well as all the test cases and analysis can be found at the web site
http://popia.ft.uam.es/HaloesGoingMA
A Similarity Measure for GPU Kernel Subgraph Matching
Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an in-depth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code
- âŠ