8 research outputs found

    Parallel Algorithms for Summing Floating-Point Numbers

    Full text link
    The problem of exactly summing n floating-point numbers is a fundamental problem that has many applications in large-scale simulations and computational geometry. Unfortunately, due to the round-off error in standard floating-point operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed. In this paper, we provide several efficient parallel algorithms for summing n floating point numbers, so as to produce a faithfully rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201

    Exploration architecturale de l'accumulateur de Kulisch

    Get PDF
    National audienceLes sommes de produit utilisant le format flottant accumulent des erreurs d’arrondi pouvantaltérer la précision du résultat. Face à ce constat, Kulisch a proposé d’utiliser un accumulateurinterne suffisamment grand pour couvrir l’éventail d’exposants flottants, ce qui permet de nejamais arrondir les additions. Cette architecture n’a jamais aboutie dans les processeurs grandpublic car elle était considérée comme trop lente et/ou utilisant trop de ressources. Cependant,elle peut être intéressante dans le cadre des FPGAs pour deux raisons. D’une part, on peut yutiliser un format flottant non standard, plus petit que 32 bits. La faible précision du formatpeut être compensée par l’exactitude de l’accumulation dans une architecture dont la tailledevient raisonnable. D’autre part, l’addition de nombres flottants dans un tel accumulateurest associative. Dans un flot de synthèse de haut niveau, ceci permet des optimisations quisont interdites en flottant standard. Ce travail compare donc plusieurs implémentations del’accumulateur de Kulisch, dont deux sont originales. Ces architectures sont implémentées dansun générateur de C++ entièrement configurable produisant du code compatible avec VivadoHLS.Les comparaisons effectuées sur FPGAs Xilinx Kintex 7 montrent une amélioration par rapportà la solution de Kulisch en termes de surface et de rapidité. De plus, la comparaison avec desimplémentations flottantes classiques montre des compromis intéressants

    Design-space exploration for the Kulisch accumulator

    Get PDF
    Floating-point sums and dot products accumulate rounding errors that may render the result very inaccurate. To address this, Kulisch proposed to use an internal accumulator large enough to cover the full exponent range of floating-point. With it, sums and dot products become exact operations. This idea failed to materialize in general purpose processors, as it was considered to slow and/or too expensive in terms of resources. It may however be an interesting option in recon-figurable computing, where a designer may use use smaller, more resource-efficient floating-point formats, knowing that sums and dot products will be exact. Another motivation of this work is that these exact operations, contrary to classical floating point ones, are associative, which enables better compiler optimizations. This work therefore compares, in the context of modern FPGAs, several implementations of the Kulisch accumulator: three proposed by Kulisch, and two novel ones. These architectures are implemented in a VivadoHLS-compliant C++ generator that is fully customiz-able. Comparisons targeting Xilinx's Kintex 7 FPGAs show improvement over Kulisch' proposal in both area and speed. In single precision, compared with a naive use of classical operators , the proposed accumulator runs at similar frequency, consumes 10x more resource in single precision, but reduces the overall latency of a large dot product by 25x while vastly improving accuracy

    Numerics of Discrete Element Simulations in Milli-g Environments: Challenges and Solutions

    Get PDF
    JAXA scheduled a sample return mission to the martian moon Phobos, which is expected to launch in 2024. This mission features a small rover, jointly developed by DLR and CNES, which will scout the landing site of the sampling spacecraft. One of the main challenges the rover will have to face during the mission is the vastly unknown regolith surface of Phobos. Previous exploration missions, like the Mars Exploration Rover missions of NASA, showed that a rover getting stuck in loose regolith poses a huge threat to the success of planetary exploration missions. To prevent this, DEM simulations are used to optimize the rover's wheel geometry for Phobos' surface. The DLR framework for such simulations is partsival, a collision-based particle and many-body simulation tool for GPUs. Since partsival was programmed with simulations in Lunar, Martian, or Earth environments in mind, the framework has to be adapted for milli-g environments. Without any adaptations, the simulation results show severe scattering and physically unrealistic behavior. These issues can be traced back to numerical problems related to the introduction of microgravity. The low gravity requires very slow movement of the rover wheel, resulting in a very long simulation duration and therefore the necessity to compute billions of time steps. Error propagation is thus strongly favored. Furthermore, while simulations in Earth or Lunar environments can be conducted using single-precision, milli-g environments showed to require double-precision computations. This is due to the gravity influencing many (but not all) parameters of partsival's physics model, resulting in changes by orders of magnitude and loss of significance errors. partsival can be adapted to milli-g environments by adapting the macroscopic soil stiffness to allow for larger time steps while at the same time increasing the numerical precision. Thus, deterministic results can be computed while maintaining fast, efficient computation

    Optimisations arithmétiques et synthèse de haut niveau

    Get PDF
    High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés

    Accuracy, Cost and Performance Trade-Offs for Streaming Set-Wise Floating Point Accumulation on FPGAs

    Get PDF
    The set-wise summation operation is perhaps one of the most fundamental and widely used operations in scientific applications. In these applications, maintaining the accuracy of the summation is also important as floating point operations have inherent errors associated with them. Designing floating-point accumulators presents a unique set of challenges: double-precision addition is usually deeply pipelined and without special micro-architectural or data scheduling techniques, the data hazard that exists. There have been several efforts to design floating point accumulators and accurate summation architecture using different algorithms on FPGAs but these problems have been dealt with separately. In this dissertation, we present a general purpose reduction circuit architecture which addresses the issues of data hazard and accuracy in set-wise floating point summation. The reduction circuit architecture is parametrizable and can be scaled according to the depth of the adder pipeline. Also, the dynamic scheduling logic we use in this makes it highly resource efficient. Further, the resource requirements for this design are low. We also study various methods to improve the accuracy of summation of floating point numbers. We have implemented four designs. The reduction circuit architecture serves as the framework for these designs. Two of the designs namely AEC and AECSA are based on compensated summation while the two designs called EPRC80 and EPRC128 implement set-wise floating point accumulation in extended precision. We present and compare the accuracy and cost- operating frequency and resource requirements- tradeoffs associated with these designs. On the basis of our experiments, we find that these designs achieve significantly better accuracy. Three of the designs– AEC, EPRC80 and EPRC128– operate at around 180MHz on Xilinx Virtex 5 FPGA which is comparable to the reduction circuit while AECSA operates at 28% less frequency. The increase in resource requirement ranges from 41% to 320%. We conclude that accuracy can be achieved at the expense of more resources but the operating frequency can be maintained

    Efficient Domain Partitioning for Stencil-based Parallel Operators

    Get PDF
    Partial Differential Equations (PDEs) are used ubiquitously in modelling natural phenomena. It is generally not possible to obtain an analytical solution and hence they are commonly discretized using schemes such as the Finite Difference Method (FDM) and the Finite Element Method (FEM), converting the continuous PDE to a discrete system of sparse algebraic equations. The solution of this system can be approximated using iterative methods, which are better suited to many sparse systems than direct methods. In this thesis we use the FDM to discretize linear, second order, Elliptic PDEs and consider parallel implementations of standard iterative solvers. The dominant paradigm in this field is distributed memory parallelism which requires the FDM grid to be partitioned across the available computational cores. The orthodox approach to domain partitioning aims to minimize only the communication volume and achieve perfect load-balance on each core. In this work, we re-examine and challenge this traditional method of domain partitioning and show that for well load-balanced problems, minimizing only the communication volume is insufficient for obtaining optimal domain partitions. To this effect we create a high-level, quasi-cache-aware mathematical model that quantifies cache-misses at the sub-domain level and minimizes them to obtain families of high performing domain decompositions. To our knowledge this is the first work that optimizes domain partitioning by analyzing cache misses, establishing a relationship between cache-misses and domain partitioning. To place our model in its true context, we identify and qualitatively examine multiple other factors such as the Least Recently Used policy, Cache Line Utilization and Vectorization, that influence the choice of optimal sub-domain dimensions. Since the convergence rate of point iterative methods, such as Jacobi, for uniform meshes is not acceptable at a high mesh resolution, we extend the model to Parallel Geometric Multigrid (GMG). GMG is a multilevel, iterative, optimal algorithm for numerically solving Elliptic PDEs. Adaptive Mesh Refinement (AMR) is another multilevel technique that allows local refinement of a global mesh based on parameters such as error estimates or geometric importance. We study a massively parallel, multiphysics, multi-resolution AMR framework called BoxLib, and implement and discuss our model on single level and adaptively refined meshes, respectively. We conclude that “close to 2-D” partitions are optimal for stencil-based codes on structured 3-D domains and that it is necessary to optimize for both minimizing cache-misses and communication. We advise that in light of the evolving hardware-software ecosystem, there is an imperative need to re-examine conventional domain partitioning strategies
    corecore