4,906 research outputs found
Internal Diffusion-Limited Aggregation: Parallel Algorithms and Complexity
The computational complexity of internal diffusion-limited aggregation (DLA)
is examined from both a theoretical and a practical point of view. We show that
for two or more dimensions, the problem of predicting the cluster from a given
set of paths is complete for the complexity class CC, the subset of P
characterized by circuits composed of comparator gates. CC-completeness is
believed to imply that, in the worst case, growing a cluster of size n requires
polynomial time in n even on a parallel computer.
A parallel relaxation algorithm is presented that uses the fact that clusters
are nearly spherical to guess the cluster from a given set of paths, and then
corrects defects in the guessed cluster through a non-local annihilation
process. The parallel running time of the relaxation algorithm for
two-dimensional internal DLA is studied by simulating it on a serial computer.
The numerical results are compatible with a running time that is either
polylogarithmic in n or a small power of n. Thus the computational resources
needed to grow large clusters are significantly less on average than the
worst-case analysis would suggest.
For a parallel machine with k processors, we show that random clusters in d
dimensions can be generated in O((n/k + log k) n^{2/d}) steps. This is a
significant speedup over explicit sequential simulation, which takes
O(n^{1+2/d}) time on average.
Finally, we show that in one dimension internal DLA can be predicted in O(log
n) parallel time, and so is in the complexity class NC
Asynchronous and corrected-asynchronous numerical solutions of parabolic PDES on MIMD multiprocessors
A major problem in achieving significant speed-up on parallel machines is the overhead involved with synchronizing the concurrent process. Removing the synchronization constraint has the potential of speeding up the computation. The authors present asynchronous (AS) and corrected-asynchronous (CA) finite difference schemes for the multi-dimensional heat equation. Although the discussion concentrates on the Euler scheme for the solution of the heat equation, it has the potential for being extended to other schemes and other parabolic partial differential equations (PDEs). These schemes are analyzed and implemented on the shared memory multi-user Sequent Balance machine. Numerical results for one and two dimensional problems are presented. It is shown experimentally that the synchronization penalty can be about 50 percent of run time: in most cases, the asynchronous scheme runs twice as fast as the parallel synchronous scheme. In general, the efficiency of the parallel schemes increases with processor load, with the time level, and with the problem dimension. The efficiency of the AS may reach 90 percent and over, but it provides accurate results only for steady-state values. The CA, on the other hand, is less efficient, but provides more accurate results for intermediate (non steady-state) values
Runtime Optimizations for Prediction with Tree-Based Models
Tree-based models have proven to be an effective solution for web ranking as
well as other problems in diverse domains. This paper focuses on optimizing the
runtime performance of applying such models to make predictions, given an
already-trained model. Although exceedingly simple conceptually, most
implementations of tree-based models do not efficiently utilize modern
superscalar processor architectures. By laying out data structures in memory in
a more cache-conscious fashion, removing branches from the execution flow using
a technique called predication, and micro-batching predictions using a
technique called vectorization, we are able to better exploit modern processor
architectures and significantly improve the speed of tree-based models over
hard-coded if-else blocks. Our work contributes to the exploration of
architecture-conscious runtime implementations of machine learning algorithms
Developing Efficient Discrete Simulations on Multicore and GPU Architectures
In this paper we show how to efficiently implement parallel discrete simulations on multicoreandGPUarchitecturesthrougharealexampleofanapplication: acellularautomatamodel of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientific computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100GPU,bothrunningonAmazonWebServer(AWS)Cloud;and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.Ministerio de Economía, Industria y Competitividad, Gobierno de España (MINECO), and the Agencia Estatal de Investigación (AEI) of Spain, cofinanced by FEDER funds (EU) TIN2017-89842
Scalable and Sustainable Deep Learning via Randomized Hashing
Current deep learning architectures are growing larger in order to learn from
complex datasets. These architectures require giant matrix multiplication
operations to train millions of parameters. Conversely, there is another
growing trend to bring deep learning to low-power, embedded devices. The matrix
operations, associated with both training and testing of deep networks, are
very expensive from a computational and energy standpoint. We present a novel
hashing based technique to drastically reduce the amount of computation needed
to train and test deep networks. Our approach combines recent ideas from
adaptive dropouts and randomized hashing for maximum inner product search to
select the nodes with the highest activation efficiently. Our new algorithm for
deep learning reduces the overall computational cost of forward and
back-propagation by operating on significantly fewer (sparse) nodes. As a
consequence, our algorithm uses only 5% of the total multiplications, while
keeping on average within 1% of the accuracy of the original model. A unique
property of the proposed hashing based back-propagation is that the updates are
always sparse. Due to the sparse gradient updates, our algorithm is ideally
suited for asynchronous and parallel training leading to near linear speedup
with increasing number of cores. We demonstrate the scalability and
sustainability (energy efficiency) of our proposed algorithm via rigorous
experimental evaluations on several real datasets
Complexity, parallel computation and statistical physics
The intuition that a long history is required for the emergence of complexity
in natural systems is formalized using the notion of depth. The depth of a
system is defined in terms of the number of parallel computational steps needed
to simulate it. Depth provides an objective, irreducible measure of history
applicable to systems of the kind studied in statistical physics. It is argued
that physical complexity cannot occur in the absence of substantial depth and
that depth is a useful proxy for physical complexity. The ideas are illustrated
for a variety of systems in statistical physics.Comment: 21 pages, 7 figure
- …