Search CORE

3,042 research outputs found

Iterative methods with memory for solving systems of nonlinear equations using a second order approximation

Author: Cordero Barbero Alicia
Maimó Javier G.
Torregrosa Sánchez Juan Ramón
Vassileva María P.
Publication venue: 'MDPI AG'
Publication date: 01/11/2019
Field of study

[EN] Iterative methods for solving nonlinear equations are said to have memory when the calculation of the next iterate requires the use of more than one previous iteration. Methods with memory usually have a very stable behavior in the sense of the wideness of the set of convergent initial estimations. With the right choice of parameters, iterative methods without memory can increase their order of convergence significantly, becoming schemes with memory. In this work, starting from a simple method without memory, we increase its order of convergence without adding new functional evaluations by approximating the accelerating parameter with Newton interpolation polynomials of degree one and two. Using this technique in the multidimensional case, we extend the proposed method to systems of nonlinear equations. Numerical tests are presented to verify the theoretical results and a study of the dynamics of the method is applied to different problems to show its stability.This research was supported by PGC2018-095896-B-C22 (MCIU/AEI/FEDER, UE), Generalitat Valenciana PROMETEO/2016/089, and FONDOCYT 2016-2017-212 Republica Dominicana.Cordero Barbero, A.; Maimó, JG.; Torregrosa Sánchez, JR.; Vassileva, MP. (2019). Iterative methods with memory for solving systems of nonlinear equations using a second order approximation. Mathematics. 7(11):1-12. https://doi.org/10.3390/math7111069S112711Soleymani, F., Lotfi, T., Tavakoli, E., & Khaksar Haghani, F. (2015). Several iterative methods with memory using self-accelerators. Applied Mathematics and Computation, 254, 452-458. doi:10.1016/j.amc.2015.01.045Petković, M. S., & Sharma, J. R. (2015). On some efficient derivative-free iterative methods with memory for solving systems of nonlinear equations. Numerical Algorithms, 71(2), 457-474. doi:10.1007/s11075-015-0003-9Narang, M., Bhatia, S., Alshomrani, A. S., & Kanwar, V. (2019). General efficient class of Steffensen type methods with memory for solving systems of nonlinear equations. Journal of Computational and Applied Mathematics, 352, 23-39. doi:10.1016/j.cam.2018.10.048Potra, F. A. (1982). An error analysis for the secant method. Numerische Mathematik, 38(3), 427-445. doi:10.1007/bf01396443Fatou, P. (1919). Sur les équations fonctionnelles. Bulletin de la Société mathématique de France, 2, 161-271. doi:10.24033/bsmf.998Cordero, A., & Torregrosa, J. R. (2007). Variants of Newton’s Method using fifth-order quadrature formulas. Applied Mathematics and Computation, 190(1), 686-698. doi:10.1016/j.amc.2007.01.062Campos, B., Cordero, A., Torregrosa, J. R., & Vindel, P. (2015). A multidimensional dynamical approach to iterative methods with memory. Applied Mathematics and Computation, 271, 701-715. doi:10.1016/j.amc.2015.09.056Chicharro, F. I., Cordero, A., & Torregrosa, J. R. (2013). Drawing Dynamical and Parameters Planes of Iterative Families and Methods. The Scientific World Journal, 2013, 1-11. doi:10.1155/2013/78015

RiuNet

A Fast Parallel Poisson Solver on Irregular Domains Applied to Beam Dynamic Simulations

Author: A. Adelmann
Adams
Forsythe
Gluckstern
Greenbaum
Hackbusch
Hackbusch
Heroux
Hestenes
Hockney
Jomaa
Landau
LeVeque
McCorquodale
P. Arbenz
Pöplau
Qiang
Qiang
Saad
Sacherer
Serafini
Shortley
Struckmeier
Swarztrauber
Trottenberg
Trottenberg
van der Vorst
Vaněk
Wiedemann
Y. Ineichen
Young
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

We discuss the scalable parallel solution of the Poisson equation within a Particle-In-Cell (PIC) code for the simulation of electron beams in particle accelerators of irregular shape. The problem is discretized by Finite Differences. Depending on the treatment of the Dirichlet boundary the resulting system of equations is symmetric or `mildly' nonsymmetric positive definite. In all cases, the system is solved by the preconditioned conjugate gradient algorithm with smoothed aggregation (SA) based algebraic multigrid (AMG) preconditioning. We investigate variants of the implementation of SA-AMG that lead to considerable improvements in the execution times. We demonstrate good scalability of the solver on distributed memory parallel processor with up to 2048 processors. We also compare our SAAMG-PCG solver with an FFT-based solver that is more commonly used for applications in beam dynamics

arXiv.org e-Print Archive

CiteSeerX

Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

Author: Galal Sameh
Horowitz Mark A.
Kvatinsky Shahar
Pedram Ardavan
Richardson Stephen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/04/2016
Field of study

The key challenge to improving performance in the age of Dark Silicon is how to leverage transistors when they cannot all be used at the same time. In modern SOCs, these transistors are often used to create specialized accelerators which improve energy efficiency for some applications by 10-1000X. While this might seem like the magic bullet we need, for most CPU applications more energy is dissipated in the memory system than in the processor: these large gains in efficiency are only possible if the DRAM and memory hierarchy are mostly idle. We refer to this desirable state as Dark Memory, and it only occurs for applications with an extreme form of locality. To show our findings, we introduce Pareto curves in the energy/op and mm

^2

/(ops/s) metric space for compute units, accelerators, and on-chip memory/interconnect. These Pareto curves allow us to solve the power, performance, area constrained optimization problem to determine which accelerators should be used, and how to set their design parameters to optimize the system. This analysis shows that memory accesses create a floor to the achievable energy-per-op. Thus high performance requires Dark Memory, which in turn requires co-design of the algorithm for parallelism and locality, with the hardware.Comment: 8 pages, To appear in IEEE Design and Test Journa

arXiv.org e-Print Archive

Parallel implementation of electronic structure eigensolver using a partitioned folded spectrum method

Author: Bernholc J.
Briggs E. L.
Kelley C. T.
Publication venue
Publication date: 26/02/2015
Field of study

A parallel implementation of an eigensolver designed for electronic structure calculations is presented. The method is applicable to computational tasks that solve a sequence of eigenvalue problems where the solution for a particular iteration is similar but not identical to the solution from the previous iteration. Such problems occur frequently when performing electronic structure calculations in which the eigenvectors are solutions to the Kohn-Sham equations. The eigenvectors are represented in some type of basis but the problem sizes are normally too large for direct diagonalization in that basis. Instead a subspace diagonalization procedure is employed in which matrix elements of the Hamiltonian operator are generated and the eigenvalues and eigenvectors of the resulting reduced matrix are obtained using a standard eigensolver from a package such as LAPACK or SCALAPACK. While this method works well and is widely used, the standard eigensolvers scale poorly on massively parallel computer systems for the matrix sizes typical of electronic structure calculations. We present a new method that utilizes a partitioned folded spectrum scheme (PFSM) that takes into account the iterative nature of the problem and performs well on massively parallel systems. Test results for a range of problems are presented that demonstrate an equivalent level of accuracy when compared to the standard eigensolvers, while also executing up to an order of magnitude faster. Unlike O(N) methods, the technique works equally well for metals and systems with unoccupied orbitals as for insulators and semiconductors. Timing and accuracy results are presented for a range of systems, including a 512 atom diamond cell, a cluster of 13 C60 molecules, bulk copper, a 216 atom silicon cell with a vacancy, using 40 unoccupied states/atom, and a 4000 atom aluminum supercell

arXiv.org e-Print Archive

Analytical Cost Metrics : Days of Future Past

Author: Djidjev Hristo
Prajapati Nirmal
Rajopadhye Sanjay
Publication venue
Publication date: 05/02/2018
Field of study

As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?" The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign

arXiv.org e-Print Archive

GraphR: Accelerating Graph Processing Using ReRAM

Author: Chen Yiran
Li Hai
Qian Xuehai
Song Linghao
Zhuo Youwei
Publication venue
Publication date: 08/12/2017
Field of study

This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201

arXiv.org e-Print Archive

Particle-in-Cell Laser-Plasma Simulation on Xeon Phi Coprocessors

Author: Bastrakov S. I.
Efimenko E. S.
Gonoskov A. A.
Korzhimanov A. V.
Meyerov I. B.
Surmin I. A.
Publication venue: 'Elsevier BV'
Publication date: 27/05/2015
Field of study

This paper concerns development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss suitability of the method for Xeon Phi architecture and present our experience of porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting with no code modification gives performance on Xeon Phi close to 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step application of optimization techniques such as improving data locality, enhancing parallelization efficiency and vectorization that leads to 3.75 x speedup on CPU and 7.5 x on Xeon Phi. The optimized version achieves 18.8 ns per particle update on Intel Xeon E5-2660 CPU and 9.3 ns per particle update on Intel Xeon Phi 5110P. On a real problem of laser ion acceleration in targets with surface grating that requires a large number of macroparticles per cell the speedup of Xeon Phi compared to CPU is 1.6 x.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms

Author: Alaghi Armin
Cacciotti Mattia
Carbin Michael
Daglis Alexandros
Darulova Eva
Dolecek Lara
Falsafi Babak
Gerstlauer Andreas
Gillani Ghayoor
Jerger Natalie Enright
Jevdjic Djordje
Misailovic Sasa
Moreau Thierry
Sampson Adrian
Stanley-Marbell Phillip
Zufferey Damien
Publication venue
Publication date: 16/09/2018
Field of study

When a computational task tolerates a relaxation of its specification or when an algorithm tolerates the effects of noise in its execution, hardware, programming languages, and system software can trade deviations from correct behavior for lower resource usage. We present, for the first time, a synthesis of research results on computing systems that only make as many errors as their users can tolerate, from across the disciplines of computer aided design of circuits, digital system design, computer architecture, programming languages, operating systems, and information theory. Rather than over-provisioning resources at each layer to avoid errors, it can be more efficient to exploit the masking of errors occurring at one layer which can prevent them from propagating to a higher layer. We survey tradeoffs for individual layers of computing systems from the circuit level to the operating system level and illustrate the potential benefits of end-to-end approaches using two illustrative examples. To tie together the survey, we present a consistent formalization of terminology, across the layers, which does not significantly deviate from the terminology traditionally used by research communities in their layer of focus.Comment: 35 page

arXiv.org e-Print Archive

AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

Author: He Xin
Ke Liu
Lu Wenyan
Yan Guihai
Zhang Xuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/05/2018
Field of study

The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods---one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain's ability to obtain resilient neural network parameters and system energy efficiency improvement.Comment: In International Symposium on Low Power Electronics and Design (ISLPED) 201

arXiv.org e-Print Archive

Porting of the DBCSR library for Sparse Matrix-Matrix Multiplications to Intel Xeon Phi systems

Author: Bethune Iain
Gloess Andeas
Hutter Juerg
Lazzaro Alfio
Pabst Hans
Reid Fiona
Publication venue: 'IOS Press'
Publication date: 21/08/2017
Field of study

Multiplication of two sparse matrices is a key operation in the simulation of the electronic structure of systems containing thousands of atoms and electrons. The highly optimized sparse linear algebra library DBCSR (Distributed Block Compressed Sparse Row) has been specifically designed to efficiently perform such sparse matrix-matrix multiplications. This library is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. It is parallelized using MPI and OpenMP, and can exploit GPU accelerators by means of CUDA. We describe a performance comparison of DBCSR on systems with Intel Xeon Phi Knights Landing (KNL) processors, with respect to systems with Intel Xeon CPUs (including systems with GPUs). We find that the DBCSR on Cray XC40 KNL-based systems is 11%-14% slower than on a hybrid Cray XC50 with Nvidia P100 cards, at the same number of nodes. When compared to a Cray XC40 system equipped with dual-socket Intel Xeon CPUs, the KNL is up to 24% faster.Comment: Submitted to the ParCo2017 conference, Bologna, Italy 12-15 September 201

arXiv.org e-Print Archive