3,042 research outputs found
Iterative methods with memory for solving systems of nonlinear equations using a second order approximation
[EN] Iterative methods for solving nonlinear equations are said to have memory when the calculation of the next iterate requires the use of more than one previous iteration. Methods with memory usually have a very stable behavior in the sense of the wideness of the set of convergent initial estimations. With the right choice of parameters, iterative methods without memory can increase their order of convergence significantly, becoming schemes with memory. In this work, starting from a simple method without memory, we increase its order of convergence without adding new functional evaluations by approximating the accelerating parameter with Newton interpolation polynomials of degree one and two. Using this technique in the multidimensional case, we extend the proposed method to systems of nonlinear equations. Numerical tests are presented to verify the theoretical results and a study of the dynamics of the method is applied to different problems to show its stability.This research was supported by PGC2018-095896-B-C22 (MCIU/AEI/FEDER, UE), Generalitat Valenciana PROMETEO/2016/089, and FONDOCYT 2016-2017-212 Republica Dominicana.Cordero Barbero, A.; Maimó, JG.; Torregrosa Sánchez, JR.; Vassileva, MP. (2019). Iterative methods with memory for solving systems of nonlinear equations using a second order approximation. Mathematics. 7(11):1-12. https://doi.org/10.3390/math7111069S112711Soleymani, F., Lotfi, T., Tavakoli, E., & Khaksar Haghani, F. (2015). Several iterative methods with memory using self-accelerators. Applied Mathematics and Computation, 254, 452-458. doi:10.1016/j.amc.2015.01.045Petković, M. S., & Sharma, J. R. (2015). On some efficient derivative-free iterative methods with memory for solving systems of nonlinear equations. Numerical Algorithms, 71(2), 457-474. doi:10.1007/s11075-015-0003-9Narang, M., Bhatia, S., Alshomrani, A. S., & Kanwar, V. (2019). General efficient class of Steffensen type methods with memory for solving systems of nonlinear equations. Journal of Computational and Applied Mathematics, 352, 23-39. doi:10.1016/j.cam.2018.10.048Potra, F. A. (1982). An error analysis for the secant method. Numerische Mathematik, 38(3), 427-445. doi:10.1007/bf01396443Fatou, P. (1919). Sur les équations fonctionnelles. Bulletin de la Société mathématique de France, 2, 161-271. doi:10.24033/bsmf.998Cordero, A., & Torregrosa, J. R. (2007). Variants of Newton’s Method using fifth-order quadrature formulas. Applied Mathematics and Computation, 190(1), 686-698. doi:10.1016/j.amc.2007.01.062Campos, B., Cordero, A., Torregrosa, J. R., & Vindel, P. (2015). A multidimensional dynamical approach to iterative methods with memory. Applied Mathematics and Computation, 271, 701-715. doi:10.1016/j.amc.2015.09.056Chicharro, F. I., Cordero, A., & Torregrosa, J. R. (2013). Drawing Dynamical and Parameters Planes of Iterative Families and Methods. The Scientific World Journal, 2013, 1-11. doi:10.1155/2013/78015
A Fast Parallel Poisson Solver on Irregular Domains Applied to Beam Dynamic Simulations
We discuss the scalable parallel solution of the Poisson equation within a
Particle-In-Cell (PIC) code for the simulation of electron beams in particle
accelerators of irregular shape. The problem is discretized by Finite
Differences. Depending on the treatment of the Dirichlet boundary the resulting
system of equations is symmetric or `mildly' nonsymmetric positive definite. In
all cases, the system is solved by the preconditioned conjugate gradient
algorithm with smoothed aggregation (SA) based algebraic multigrid (AMG)
preconditioning. We investigate variants of the implementation of SA-AMG that
lead to considerable improvements in the execution times. We demonstrate good
scalability of the solver on distributed memory parallel processor with up to
2048 processors. We also compare our SAAMG-PCG solver with an FFT-based solver
that is more commonly used for applications in beam dynamics
Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era
The key challenge to improving performance in the age of Dark Silicon is how
to leverage transistors when they cannot all be used at the same time. In
modern SOCs, these transistors are often used to create specialized
accelerators which improve energy efficiency for some applications by 10-1000X.
While this might seem like the magic bullet we need, for most CPU applications
more energy is dissipated in the memory system than in the processor: these
large gains in efficiency are only possible if the DRAM and memory hierarchy
are mostly idle. We refer to this desirable state as Dark Memory, and it only
occurs for applications with an extreme form of locality.
To show our findings, we introduce Pareto curves in the energy/op and
mm/(ops/s) metric space for compute units, accelerators, and on-chip
memory/interconnect. These Pareto curves allow us to solve the power,
performance, area constrained optimization problem to determine which
accelerators should be used, and how to set their design parameters to optimize
the system. This analysis shows that memory accesses create a floor to the
achievable energy-per-op. Thus high performance requires Dark Memory, which in
turn requires co-design of the algorithm for parallelism and locality, with the
hardware.Comment: 8 pages, To appear in IEEE Design and Test Journa
Parallel implementation of electronic structure eigensolver using a partitioned folded spectrum method
A parallel implementation of an eigensolver designed for electronic structure
calculations is presented. The method is applicable to computational tasks that
solve a sequence of eigenvalue problems where the solution for a particular
iteration is similar but not identical to the solution from the previous
iteration. Such problems occur frequently when performing electronic structure
calculations in which the eigenvectors are solutions to the Kohn-Sham
equations. The eigenvectors are represented in some type of basis but the
problem sizes are normally too large for direct diagonalization in that basis.
Instead a subspace diagonalization procedure is employed in which matrix
elements of the Hamiltonian operator are generated and the eigenvalues and
eigenvectors of the resulting reduced matrix are obtained using a standard
eigensolver from a package such as LAPACK or SCALAPACK. While this method works
well and is widely used, the standard eigensolvers scale poorly on massively
parallel computer systems for the matrix sizes typical of electronic structure
calculations. We present a new method that utilizes a partitioned folded
spectrum scheme (PFSM) that takes into account the iterative nature of the
problem and performs well on massively parallel systems. Test results for a
range of problems are presented that demonstrate an equivalent level of
accuracy when compared to the standard eigensolvers, while also executing up to
an order of magnitude faster. Unlike O(N) methods, the technique works equally
well for metals and systems with unoccupied orbitals as for insulators and
semiconductors. Timing and accuracy results are presented for a range of
systems, including a 512 atom diamond cell, a cluster of 13 C60 molecules, bulk
copper, a 216 atom silicon cell with a vacancy, using 40 unoccupied
states/atom, and a 4000 atom aluminum supercell
Analytical Cost Metrics : Days of Future Past
As we move towards the exascale era, the new architectures must be capable of
running the massive computational problems efficiently. Scientists and
researchers are continuously investing in tuning the performance of
extreme-scale computational problems. These problems arise in almost all areas
of computing, ranging from big data analytics, artificial intelligence, search,
machine learning, virtual/augmented reality, computer vision, image/signal
processing to computational science and bioinformatics. With Moore's law
driving the evolution of hardware platforms towards exascale, the dominant
performance metric (time efficiency) has now expanded to also incorporate
power/energy efficiency. Therefore, the major challenge that we face in
computing systems research is: "how to solve massive-scale computational
problems in the most time/power/energy efficient manner?"
The architectures are constantly evolving making the current performance
optimizing strategies less applicable and new strategies to be invented. The
solution is for the new architectures, new programming models, and applications
to go forward together. Doing this is, however, extremely hard. There are too
many design choices in too many dimensions. We propose the following strategy
to solve the problem: (i) Models - Develop accurate analytical models (e.g.
execution time, energy, silicon area) to predict the cost of executing a given
program, and (ii) Complete System Design - Simultaneously optimize all the cost
models for the programs (computational problems) to obtain the most
time/area/power/energy efficient solution. Such an optimization problem evokes
the notion of codesign
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
Particle-in-Cell Laser-Plasma Simulation on Xeon Phi Coprocessors
This paper concerns development of a high-performance implementation of the
Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors.
We discuss suitability of the method for Xeon Phi architecture and present our
experience of porting and optimization of the existing parallel
Particle-in-Cell code PICADOR. Direct porting with no code modification gives
performance on Xeon Phi close to 8-core CPU on a benchmark problem with 50
particles per cell. We demonstrate step-by-step application of optimization
techniques such as improving data locality, enhancing parallelization
efficiency and vectorization that leads to 3.75 x speedup on CPU and 7.5 x on
Xeon Phi. The optimized version achieves 18.8 ns per particle update on Intel
Xeon E5-2660 CPU and 9.3 ns per particle update on Intel Xeon Phi 5110P. On a
real problem of laser ion acceleration in targets with surface grating that
requires a large number of macroparticles per cell the speedup of Xeon Phi
compared to CPU is 1.6 x.Comment: 16 pages, 3 figure
Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms
When a computational task tolerates a relaxation of its specification or when
an algorithm tolerates the effects of noise in its execution, hardware,
programming languages, and system software can trade deviations from correct
behavior for lower resource usage. We present, for the first time, a synthesis
of research results on computing systems that only make as many errors as their
users can tolerate, from across the disciplines of computer aided design of
circuits, digital system design, computer architecture, programming languages,
operating systems, and information theory.
Rather than over-provisioning resources at each layer to avoid errors, it can
be more efficient to exploit the masking of errors occurring at one layer which
can prevent them from propagating to a higher layer. We survey tradeoffs for
individual layers of computing systems from the circuit level to the operating
system level and illustrate the potential benefits of end-to-end approaches
using two illustrative examples. To tie together the survey, we present a
consistent formalization of terminology, across the layers, which does not
significantly deviate from the terminology traditionally used by research
communities in their layer of focus.Comment: 35 page
AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference
The intrinsic error tolerance of neural network (NN) makes approximate
computing a promising technique to improve the energy efficiency of NN
inference. Conventional approximate computing focuses on balancing the
efficiency-accuracy trade-off for existing pre-trained networks, which can lead
to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented
training framework to facilitate approximate computing for NN inference.
Specifically, AxTrain leverages the synergy between two orthogonal
methods---one actively searches for a network parameters distribution with high
error tolerance, and the other passively learns resilient weights by
numerically incorporating the noise distributions of the approximate hardware
in the forward pass during the training phase. Experimental results from
various datasets with near-threshold computing and approximation multiplication
strategies demonstrate AxTrain's ability to obtain resilient neural network
parameters and system energy efficiency improvement.Comment: In International Symposium on Low Power Electronics and Design
(ISLPED) 201
Porting of the DBCSR library for Sparse Matrix-Matrix Multiplications to Intel Xeon Phi systems
Multiplication of two sparse matrices is a key operation in the simulation of
the electronic structure of systems containing thousands of atoms and
electrons. The highly optimized sparse linear algebra library DBCSR
(Distributed Block Compressed Sparse Row) has been specifically designed to
efficiently perform such sparse matrix-matrix multiplications. This library is
the basic building block for linear scaling electronic structure theory and low
scaling correlated methods in CP2K. It is parallelized using MPI and OpenMP,
and can exploit GPU accelerators by means of CUDA. We describe a performance
comparison of DBCSR on systems with Intel Xeon Phi Knights Landing (KNL)
processors, with respect to systems with Intel Xeon CPUs (including systems
with GPUs). We find that the DBCSR on Cray XC40 KNL-based systems is 11%-14%
slower than on a hybrid Cray XC50 with Nvidia P100 cards, at the same number of
nodes. When compared to a Cray XC40 system equipped with dual-socket Intel Xeon
CPUs, the KNL is up to 24% faster.Comment: Submitted to the ParCo2017 conference, Bologna, Italy 12-15 September
201
- …