3,042 research outputs found

    Iterative methods with memory for solving systems of nonlinear equations using a second order approximation

    Full text link
    [EN] Iterative methods for solving nonlinear equations are said to have memory when the calculation of the next iterate requires the use of more than one previous iteration. Methods with memory usually have a very stable behavior in the sense of the wideness of the set of convergent initial estimations. With the right choice of parameters, iterative methods without memory can increase their order of convergence significantly, becoming schemes with memory. In this work, starting from a simple method without memory, we increase its order of convergence without adding new functional evaluations by approximating the accelerating parameter with Newton interpolation polynomials of degree one and two. Using this technique in the multidimensional case, we extend the proposed method to systems of nonlinear equations. Numerical tests are presented to verify the theoretical results and a study of the dynamics of the method is applied to different problems to show its stability.This research was supported by PGC2018-095896-B-C22 (MCIU/AEI/FEDER, UE), Generalitat Valenciana PROMETEO/2016/089, and FONDOCYT 2016-2017-212 Republica Dominicana.Cordero Barbero, A.; Maimó, JG.; Torregrosa Sánchez, JR.; Vassileva, MP. (2019). Iterative methods with memory for solving systems of nonlinear equations using a second order approximation. Mathematics. 7(11):1-12. https://doi.org/10.3390/math7111069S112711Soleymani, F., Lotfi, T., Tavakoli, E., & Khaksar Haghani, F. (2015). Several iterative methods with memory using self-accelerators. Applied Mathematics and Computation, 254, 452-458. doi:10.1016/j.amc.2015.01.045Petković, M. S., & Sharma, J. R. (2015). On some efficient derivative-free iterative methods with memory for solving systems of nonlinear equations. Numerical Algorithms, 71(2), 457-474. doi:10.1007/s11075-015-0003-9Narang, M., Bhatia, S., Alshomrani, A. S., & Kanwar, V. (2019). General efficient class of Steffensen type methods with memory for solving systems of nonlinear equations. Journal of Computational and Applied Mathematics, 352, 23-39. doi:10.1016/j.cam.2018.10.048Potra, F. A. (1982). An error analysis for the secant method. Numerische Mathematik, 38(3), 427-445. doi:10.1007/bf01396443Fatou, P. (1919). Sur les équations fonctionnelles. Bulletin de la Société mathématique de France, 2, 161-271. doi:10.24033/bsmf.998Cordero, A., & Torregrosa, J. R. (2007). Variants of Newton’s Method using fifth-order quadrature formulas. Applied Mathematics and Computation, 190(1), 686-698. doi:10.1016/j.amc.2007.01.062Campos, B., Cordero, A., Torregrosa, J. R., & Vindel, P. (2015). A multidimensional dynamical approach to iterative methods with memory. Applied Mathematics and Computation, 271, 701-715. doi:10.1016/j.amc.2015.09.056Chicharro, F. I., Cordero, A., & Torregrosa, J. R. (2013). Drawing Dynamical and Parameters Planes of Iterative Families and Methods. The Scientific World Journal, 2013, 1-11. doi:10.1155/2013/78015

    A Fast Parallel Poisson Solver on Irregular Domains Applied to Beam Dynamic Simulations

    Full text link
    We discuss the scalable parallel solution of the Poisson equation within a Particle-In-Cell (PIC) code for the simulation of electron beams in particle accelerators of irregular shape. The problem is discretized by Finite Differences. Depending on the treatment of the Dirichlet boundary the resulting system of equations is symmetric or `mildly' nonsymmetric positive definite. In all cases, the system is solved by the preconditioned conjugate gradient algorithm with smoothed aggregation (SA) based algebraic multigrid (AMG) preconditioning. We investigate variants of the implementation of SA-AMG that lead to considerable improvements in the execution times. We demonstrate good scalability of the solver on distributed memory parallel processor with up to 2048 processors. We also compare our SAAMG-PCG solver with an FFT-based solver that is more commonly used for applications in beam dynamics

    Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

    Full text link
    The key challenge to improving performance in the age of Dark Silicon is how to leverage transistors when they cannot all be used at the same time. In modern SOCs, these transistors are often used to create specialized accelerators which improve energy efficiency for some applications by 10-1000X. While this might seem like the magic bullet we need, for most CPU applications more energy is dissipated in the memory system than in the processor: these large gains in efficiency are only possible if the DRAM and memory hierarchy are mostly idle. We refer to this desirable state as Dark Memory, and it only occurs for applications with an extreme form of locality. To show our findings, we introduce Pareto curves in the energy/op and mm2^2/(ops/s) metric space for compute units, accelerators, and on-chip memory/interconnect. These Pareto curves allow us to solve the power, performance, area constrained optimization problem to determine which accelerators should be used, and how to set their design parameters to optimize the system. This analysis shows that memory accesses create a floor to the achievable energy-per-op. Thus high performance requires Dark Memory, which in turn requires co-design of the algorithm for parallelism and locality, with the hardware.Comment: 8 pages, To appear in IEEE Design and Test Journa

    Parallel implementation of electronic structure eigensolver using a partitioned folded spectrum method

    Full text link
    A parallel implementation of an eigensolver designed for electronic structure calculations is presented. The method is applicable to computational tasks that solve a sequence of eigenvalue problems where the solution for a particular iteration is similar but not identical to the solution from the previous iteration. Such problems occur frequently when performing electronic structure calculations in which the eigenvectors are solutions to the Kohn-Sham equations. The eigenvectors are represented in some type of basis but the problem sizes are normally too large for direct diagonalization in that basis. Instead a subspace diagonalization procedure is employed in which matrix elements of the Hamiltonian operator are generated and the eigenvalues and eigenvectors of the resulting reduced matrix are obtained using a standard eigensolver from a package such as LAPACK or SCALAPACK. While this method works well and is widely used, the standard eigensolvers scale poorly on massively parallel computer systems for the matrix sizes typical of electronic structure calculations. We present a new method that utilizes a partitioned folded spectrum scheme (PFSM) that takes into account the iterative nature of the problem and performs well on massively parallel systems. Test results for a range of problems are presented that demonstrate an equivalent level of accuracy when compared to the standard eigensolvers, while also executing up to an order of magnitude faster. Unlike O(N) methods, the technique works equally well for metals and systems with unoccupied orbitals as for insulators and semiconductors. Timing and accuracy results are presented for a range of systems, including a 512 atom diamond cell, a cluster of 13 C60 molecules, bulk copper, a 216 atom silicon cell with a vacancy, using 40 unoccupied states/atom, and a 4000 atom aluminum supercell

    Analytical Cost Metrics : Days of Future Past

    Full text link
    As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?" The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign

    GraphR: Accelerating Graph Processing Using ReRAM

    Full text link
    This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201

    Particle-in-Cell Laser-Plasma Simulation on Xeon Phi Coprocessors

    Full text link
    This paper concerns development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss suitability of the method for Xeon Phi architecture and present our experience of porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting with no code modification gives performance on Xeon Phi close to 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step application of optimization techniques such as improving data locality, enhancing parallelization efficiency and vectorization that leads to 3.75 x speedup on CPU and 7.5 x on Xeon Phi. The optimized version achieves 18.8 ns per particle update on Intel Xeon E5-2660 CPU and 9.3 ns per particle update on Intel Xeon Phi 5110P. On a real problem of laser ion acceleration in targets with surface grating that requires a large number of macroparticles per cell the speedup of Xeon Phi compared to CPU is 1.6 x.Comment: 16 pages, 3 figure

    Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms

    Full text link
    When a computational task tolerates a relaxation of its specification or when an algorithm tolerates the effects of noise in its execution, hardware, programming languages, and system software can trade deviations from correct behavior for lower resource usage. We present, for the first time, a synthesis of research results on computing systems that only make as many errors as their users can tolerate, from across the disciplines of computer aided design of circuits, digital system design, computer architecture, programming languages, operating systems, and information theory. Rather than over-provisioning resources at each layer to avoid errors, it can be more efficient to exploit the masking of errors occurring at one layer which can prevent them from propagating to a higher layer. We survey tradeoffs for individual layers of computing systems from the circuit level to the operating system level and illustrate the potential benefits of end-to-end approaches using two illustrative examples. To tie together the survey, we present a consistent formalization of terminology, across the layers, which does not significantly deviate from the terminology traditionally used by research communities in their layer of focus.Comment: 35 page

    AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

    Full text link
    The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods---one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain's ability to obtain resilient neural network parameters and system energy efficiency improvement.Comment: In International Symposium on Low Power Electronics and Design (ISLPED) 201

    Porting of the DBCSR library for Sparse Matrix-Matrix Multiplications to Intel Xeon Phi systems

    Full text link
    Multiplication of two sparse matrices is a key operation in the simulation of the electronic structure of systems containing thousands of atoms and electrons. The highly optimized sparse linear algebra library DBCSR (Distributed Block Compressed Sparse Row) has been specifically designed to efficiently perform such sparse matrix-matrix multiplications. This library is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. It is parallelized using MPI and OpenMP, and can exploit GPU accelerators by means of CUDA. We describe a performance comparison of DBCSR on systems with Intel Xeon Phi Knights Landing (KNL) processors, with respect to systems with Intel Xeon CPUs (including systems with GPUs). We find that the DBCSR on Cray XC40 KNL-based systems is 11%-14% slower than on a hybrid Cray XC50 with Nvidia P100 cards, at the same number of nodes. When compared to a Cray XC40 system equipped with dual-socket Intel Xeon CPUs, the KNL is up to 24% faster.Comment: Submitted to the ParCo2017 conference, Bologna, Italy 12-15 September 201
    corecore