71 research outputs found

    Fast parallel algorithms for Graeffe's root squaring technique

    Get PDF
    AbstractThis paper presents two parallel algorithms for the solution of a polynomial equation of degree n, where n can be very large. The algorithms are based on Graeffe's root squaring technique implemented on two different systolic architectures, built around mesh of trees and multitrees, respectively. Each of these algorithms requires O(log n) time using O(n2) processors

    Performance Improvements of Common Sparse Numerical Linear Algebra Computations

    Get PDF
    Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

    High Throughput Lattice-based Signatures on GPUs: Comparing Falcon and Mitaka

    Get PDF
    The US National Institute of Standards and Technology initiated a standardization process for post-quantum cryptography in 2017, with the aim of selecting key encapsulation mechanisms and signature schemes that can withstand the threat from emerging quantum computers. In 2022, Falcon was selected as one of the standard signature schemes, eventually attracting effort to optimize the implementation of Falcon on various hardware architectures for practical applications. Recently, Mitaka was proposed as an alternative to Falcon, allowing parallel execution of most of its operations. These recent advancements motivate us to develop high throughput implementations of Falcon and Mitaka signature schemes on Graphics Processing Units (GPUs), a massively parallel architecture widely available on cloud service platforms. In this paper, we propose the first parallel implementation of Falcon on various GPUs. An iterative version of the sampling process in Falcon, which is also the most time-consuming Falcon operation, was developed. This allows us to implement Falcon signature generation without relying on expensive recursive function calls on GPUs. In addition, we propose a parallel random samples generation approach to accelerate the performance of Mitaka on GPUs. We evaluate our implementation techniques on state-of-the-art GPU architectures (RTX 3080, A100, T4 and V100). Experimental results show that our Falcon-512 implementation achieves 58, 595 signatures/second and 2, 721, 562 verifications/second on an A100 GPU, which is 20.03× and 29.51× faster than the highly optimized AVX2 implementation on CPU. Our Mitaka implementation achieves 161, 985 signatures/second and 1, 421, 046 verifications/second on the same GPU. Due to the adoption of a parallelizable sampling process, Mitaka signature generation enjoys ≈ 2 – 20× higher throughput than Falcon on various GPUs. The high throughput signature generation and verification achieved by this work can be very useful in various emerging applications, including the Internet of Things

    KfK-SUPRENUM-Seminar 19.-20.10.1989. Tagungsbericht

    Get PDF

    A New Method for Efficient Parallel Solution of Large Linear Systems on a SIMD Processor.

    Get PDF
    This dissertation proposes a new technique for efficient parallel solution of very large linear systems of equations on a SIMD processor. The model problem used to investigate both the efficiency and applicability of the technique was of a regular structure with semi-bandwidth β,\beta, and resulted from approximation of a second order, two-dimensional elliptic equation on a regular domain under the Dirichlet and periodic boundary conditions. With only slight modifications, chiefly to properly account for the mathematical effects of varying bandwidths, the technique can be extended to encompass solution of any regular, banded systems. The computational model used was the MasPar MP-X (model 1208B), a massively parallel processor hostnamed hurricane and housed in the Concurrent Computing Laboratory of the Physics/Astronomy department, Louisiana State University. The maximum bandwidth which caused the problem\u27s size to fit the nyproc ×\times nxproc machine array exactly, was determined. This as well as smaller sizes were used in four experiments to evaluate the efficiency of the new technique. Four benchmark algorithms, two direct--Gauss elimination (GE), Orthogonal factorization--and two iterative--symmetric over-relaxation (SOR) (ω\omega = 2), the conjugate gradient method (CG)--were used to test the efficiency of the new approach based upon three evaluation metrics--deviations of results of computations, measured as average absolute errors, from the exact solution, the cpu times, and the mega flop rates of executions. All the benchmarks, except the GE, were implemented in parallel. In all evaluation categories, the new approach outperformed the benchmarks and very much so when N ≫\gg p, p being the number of processors and N the problem size. At the maximum system\u27s size, the new method was about 2.19 more accurate, and about 1.7 times faster than the benchmarks. But when the system size was a lot smaller than the machine\u27s size, the new approach\u27s performance deteriorated precipitously, and, in fact, in this circumstance, its performance was worse than that of GE, the serial code. Hence, this technique is recommended for solution of linear systems with regular structures on array processors when the problem\u27s size is large in relation to the processor\u27s size

    Computational methods and software systems for dynamics and control of large space structures

    Get PDF
    Two key areas of crucial importance to the computer-based simulation of large space structures are discussed. The first area involves multibody dynamics (MBD) of flexible space structures, with applications directed to deployment, construction, and maneuvering. The second area deals with advanced software systems, with emphasis on parallel processing. The latest research thrust in the second area involves massively parallel computers

    The design and implementation of a purely digital stereo-photogrammetric system on the IBM 3090 multi-user mainframe computer

    Get PDF
    This thesis is concerned with an investigation into the possibilities of implementing various aspects of a purely digital stereo-photogrammetric (DSP) system on the IBM 3090 150E mainframe multi-user computer. The main aspects discussed within the context of this thesis are:-i) Mathematical modelling of the process of formation of digital images in the space and frequency domains.ii) Experiments on improving the pictorial quality of digital aerial photos using Inverse and Wiener filters.iii) Devising and implementing an approach for the automatic sub-pixel measurement of cross-type fiducial marks for the inner orientation, using the Gradient operator and image modelling least squares (IML) approach.iv) Devising and implementing a method for the digital rectification of overlapping aerial photos and the formation of the stereo-model.v) Design and implementation of a digital stereo-photogrammetric system (DSP) and the generation of a DTM using visual measurement.vi) Investigating the feasibility of stereo-viewing of binary images and the possibility of performing measurements on such images.vii) Implementing a method for the automatic generation of a DTM using a one-dimensional image correlation along epipolar lines and experimentally optimizing the size of the correlation window.viii) Assessment of the accuracy of the DTM data generated both by the DSP and the automatic correlation method.ix) Vectorization of the rectification and correlation programs to achieve higher speed-up factors in the computational process

    A design methodology for portable software on parallel computers

    Get PDF
    This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured the performance of a portion of this subsystem on the Intel iPSC/2 parallel computer. These results are provided in section four. Our future work is summarized in section five, our acknowledgements are stated in section six, and references for published papers associated with NAG-1-995 are provided in section seven

    The Eigenvalue Spectrum of the Fermion Matrix in Lattice Higgs Systems

    Get PDF
    In the first part of this thesis we consider the performance of various block algorithms for the inversion of large sparse matrices. By computing the eigenvalue spectra of the matrices under consideration we are able to directly relate the performance of the algorithms to the difficulty of the calculation. We find that the block Lanczos algorithm is superior to all others considered for the inversion of the Kogut Susskind fermion matrix. Furthermore we investigate the performance of the block Lanczos algorithm on matrices constructed to have specific eigenvalue spectra. From this study we are able to make quantitative predictive statements about the number of iterations that the algorithm will take to converge given the form of the eigenvalue spectrum of the matrix whose inversion is attempted. The rest of this thesis is concerned with lattice Higgs systems. Specifically we study a model where staggered fermions are coupled to Ising spins via an on-site Yukawa term with coupling constant y. This is a very simple model that seems to embody most of the relevant phenomena observed in more complicated systems. Most importantly there are two symmetric regions PM1 and PM2 where the renormalised fermion mass my is non-zero for large y in the PM2 region despite the scalar field having zero expectation value. We study the model in the quenched approximation and by examining the distribution of the eigenvalues of the fermion matrix M in the complex plane we qualitatively explain the features of the model as being due to the transition of eigenvalues from the imaginary to the real axis via the origin as y is increased. An approximate method for calculating mf from the value of a fermion condensate is developed and we reproduce the values for mf obtained by other authors who calculate it using the standard method involving the fermion propagator. However, our method has the advantage that it is applicable on very small volumes where the propagator definition breaks down. We investigate the behaviour in the quenched infinite volume limit by evaluating the low lying eigenvalues of the matrix A+M. We show that the small eigenvalues observed in the spectrum of M at intermediate y on finite lattices imply that there is a finite density of zero modes in the infinite volume limit. By performing dynamical simulations on a small lattice we determine the phase diagram of the model and demonstrate the validity of mean field calculations of the phase boundaries. From calculations of mf we identify the PM1 and PM2 phases. It is shown that the inclusion of fermion dynamics eliminates the small eigenvalues of M present in the quenched model and as y is increased the eigenvalues now transfer from the real to imaginary axis via a path avoiding the origin. It is only by using the block Lanczos algorithm that simulations in certain regions of the phase plane are feasible, and only by our method of considering a fermion condensate can we calculate mf on such a small volume

    A generalized vortex lattice method for subsonic and supersonic flow applications

    Get PDF
    If the discrete vortex lattice is considered as an approximation to the surface-distributed vorticity, then the concept of the generalized principal part of an integral yields a residual term to the vorticity-induced velocity field. The proper incorporation of this term to the velocity field generated by the discrete vortex lines renders the present vortex lattice method valid for supersonic flow. Special techniques for simulating nonzero thickness lifting surfaces and fusiform bodies with vortex lattice elements are included. Thickness effects of wing-like components are simulated by a double (biplanar) vortex lattice layer, and fusiform bodies are represented by a vortex grid arranged on a series of concentrical cylindrical surfaces. The analysis of sideslip effects by the subject method is described. Numerical considerations peculiar to the application of these techniques are also discussed. The method has been implemented in a digital computer code. A users manual is included along with a complete FORTRAN compilation, an executed case, and conversion programs for transforming input for the NASA wave drag program
    • …
    corecore