144 research outputs found

    Accelerated deconvolution of radio interferometric images using orthogonal matching pursuit and graphics hardware

    Get PDF
    Deconvolution of native radio interferometric images constitutes a major computational component of the radio astronomy imaging process. An efficient and robust deconvolution operation is essential for reconstruction of the true sky signal from measured correlator data. Traditionally, radio astronomers have mostly used the CLEAN algorithm, and variants thereof. However, the techniques of compressed sensing provide a mathematically rigorous framework within which deconvolution of radio interferometric images can be implemented. We present an accelerated implementation of the orthogonal matching pursuit (OMP) algorithm (a compressed sensing method) that makes use of graphics processing unit (GPU) hardware, and show significant accuracy improvements over the standard CLEAN. In particular, we show that OMP correctly identifies more sources than CLEAN, identifying up to 82% of the sources in 100 test images, while CLEAN only identifies up to 61% of the sources. In addition, the residual after source extraction is 2.7 times lower for OMP than for CLEAN. Furthermore, the GPU implementation of OMP performs around 23 times faster than a 4-core CPU

    Accelerating radio transient detection using the Bispectrum algorithm and GPGPU

    Get PDF
    Modern radio interferometers such as those in the Square Kilometre Array (SKA) project are powerful tools to discover completely new classes of astronomical phenomena. Amongst these phenomena are radio transients. Transients are bursts of electromagnetic radiation and is an exciting area of research as localizing pulsars (transient emitters) allow physicists to test and formulate theories on strong gravitational forces. Current methods for detecting transients requires an image of the sky to be produced at every time step. Since interferometers have more information available to them, the computational demands for producing images becomes infeasible due to the larger data sets provided by larger interferometers. Law and Bower (2012) formulated a different approach by using a closure quantity known as the "bispectrum": the product of visibilities around a closed loop of antennae. The proposed algorithm has been shown to be easily parallelized and suitable for Graphics processing units (GPUs).Recent advancements in the field of many core technology such as GPUs has demonstrated significant performance enhancements to many scientific applications. A GPU implementation of the bispectrum algorithm has yet to be explored. In this thesis, we present a number of modified implementations of the bispectrum algorithm, allowing both instruction-level and data-level parallelism. Firstly, a multi-threaded CPU version is developed in C++ using OpenMP and then compared to a GPU version developed using Compute Unified Device Architecture (CUDA).In order to verify validity of the implementations presented, the implementations were firstly run on simulated data created from MeqTrees: a tool for simulating transients developed by the SKA. Thereafter, data from the Karl Jansky Very Large Array (JVLA) containing the B0355+54pulsar was used to test the implementation on real data. This research concludes that the bispectrum algorithm is well suited for both CPU and GPU implementations as we achieved a 3.2x speed up on a 4-core multi-threaded CPU implementation over a single thread implementation. The GPU implementation on a GTX670, achieved about a 20 times speed-up over the multi-threaded CPU implementation. These results show that the bispectrum algorithm will open doors to a series of efficient transient surveys suitable for modern data-intensive radio interferometers

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Parallel computing, benchmarking and ATLAS software on ARM

    Get PDF
    Includes bibliographical references,This thesis explores the use of the ARM architecture in high energy physics computing. ARM processors are predominantly found in smartphones and mobile tablets. Results from benchmarks which were performed on the armv7l architecture are presented. These provide qualitative data as well as confirmation that specialized high energy physics software does run on ARM. This thesis presents the first ever port of the ATLAS software stack to the ARM architecture, as well as the issues that ensued. A new framework, ANA, is introduced which facilitates the compilation of the ATLAS software stack on ARM

    Accelerated coplanar facet radio synthesis imaging

    Get PDF
    Imaging in radio astronomy entails the Fourier inversion of the relation between the sampled spatial coherence of an electromagnetic field and the intensity of its emitting source. This inversion is normally computed by performing a convolutional resampling step and applying the Inverse Fast Fourier Transform, because this leads to computational savings. Unfortunately, the resulting planar approximation of the sky is only valid over small regions. When imaging over wider fields of view, and in particular using telescope arrays with long non-East-West components, significant distortions are introduced in the computed image. We propose a coplanar faceting algorithm, where the sky is split up into many smaller images. Each of these narrow-field images are further corrected using a phase-correcting tech- nique known as w-projection. This eliminates the projection error along the edges of the facets and ensures approximate coplanarity. The combination of faceting and w-projection approaches alleviates the memory constraints of previous w-projection implementations. We compared the scaling performance of both single and double precision resampled images in both an optimized multi-threaded CPU implementation and a GPU implementation that uses a memory-access- limiting work distribution strategy. We found that such a w-faceting approach scales slightly better than a traditional w-projection approach on GPUs. We also found that double precision resampling on GPUs is about 71% slower than its single precision counterpart, making double precision resampling on GPUs less power efficient than CPU-based double precision resampling. Lastly, we have seen that employing only single precision in the resampling summations produces significant error in continuum images for a MeerKAT-sized array over long observations, especially when employing the large convolution filters necessary to create large images

    Combiner approches statique et dynamique pour modéliser la performance de boucles HPC

    Get PDF
    The complexity of CPUs has increased considerably since their beginnings, introducing mechanisms such as register renaming, out-of-order execution, vectorization,prefetchers and multi-core environments to keep performance rising with each product generation. However, so has the difficulty in making proper use of all these mechanisms, or even evaluating whether one’s program makes good use of a machine,whether users’ needs match a CPU’s design, or, for CPU architects, knowing how each feature really affects customers.This thesis focuses on increasing the observability of potential bottlenecks inHPC computational loops and how they relate to each other in modern microarchitectures.We will first introduce a framework combining CQA and DECAN (respectively static and dynamic analysis tools) to get detailed performance metrics on smallcodelets in various execution scenarios.We will then present PAMDA, a performance analysis methodology leveraging elements obtained from codelet analysis to detect potential performance problems in HPC applications and help resolve them. A work extending the Cape linear model to better cover Sandy Bridge and give it more flexibility for HW/SW codesign purposes will also be described. It will bedirectly used in VP3, a tool evaluating the performance gains vectorizing loops could provide.Finally, we will describe UFS, an approach combining static analysis and cycle accurate simulation to very quickly estimate a loop’s execution time while accounting for out-of-order limitations in modern CPUsLa complexité des CPUs s’est accrue considérablement depuis leurs débuts, introduisant des mécanismes comme le renommage de registres, l’exécution dans le désordre, la vectorisation, les préfetchers et les environnements multi-coeurs pour améliorer les performances avec chaque nouvelle génération de processeurs. Cependant, la difficulté a suivi la même tendance pour ce qui est a) d’utiliser ces mêmes mécanismes à leur plein potentiel, b) d’évaluer si un programme utilise une machine correctement, ou c) de savoir si le design d’un processeur répond bien aux besoins des utilisateurs.Cette thèse porte sur l’amélioration de l’observabilité des facteurs limitants dans les boucles de calcul intensif, ainsi que leurs interactions au sein de microarchitectures modernes.Nous introduirons d’abord un framework combinant CQA et DECAN (des outils d’analyse respectivement statique et dynamique) pour obtenir des métriques détaillées de performance sur des petits codelets et dans divers scénarios d’exécution.Nous présenterons ensuite PAMDA, une méthodologie d’analyse de performance tirant partie de l’analyse de codelets pour détecter d’éventuels problèmes de performance dans des applications de calcul à haute performance et en guider la résolution.Un travail permettant au modèle linéaire Cape de couvrir la microarchitecture Sandy Bridge de façon détaillée sera décrit, lui donnant plus de flexibilité pour effectuer du codesign matériel / logiciel. Il sera mis en pratique dans VP3, un outil évaluant les gains de performance atteignables en vectorisant des boucles.Nous décrirons finalement UFS, une approche combinant analyse statique et simulation au cycle près pour permettre l’estimation rapide du temps d’exécution d’une boucle en prenant en compte certaines des limites de l’exécution en désordre dans des microarchitectures moderne

    Simulation of hydrodynamics and sediment transport patterns in Delaware Bay

    Get PDF
    This research seeks to increase understanding of hydrodynamic processes influencing the salinity intrusion and sediment transport patterns by simulating the complex flows in Delaware Estuary. For this purpose, a three-dimensional numerical model is developed for the tidal portion of the Delaware Estuary using the UnTRIM hydrodynamic kernel. The model extends from Trenton, NJ south past the inlet at Cape May, NJ and incorporates a large portion of the continental shelf.The simulation efforts are focused on summer 2003. A variable, harmonically decomposed, water level boundary condition of three diurnal (K1, Q1, O1) and four semidiurnal (K2, S2, N2, M2) components are used to regenerate the observed tidal signals in the bay. The effect of forcing by the Chesapeake Bay through the Chesapeake-Delaware canal is also modeled. The major forcings such as inflow and wind is used to better reproduce the observed characteristics.Various turbulence closure models are compared for use in Delaware Estuary to best represent the salinity intrusion patterns. In particular, seven different turbulence closures, five of which are two-equation closure models, are used for comparison. Four of these models are implemented in the UnTRIM hydrodynamic code using Generic Length Scale (GLS) approach that mimics the models through its parameter combinations. The original Yamada Mellor level 2.5 code is used as the fifth one.The water levels are compared with data available from National Oceanic and Atmospheric Administration observation stations. Harmonic analysis to observations and simulations are performed. All turbulence models perform similar in performance representing the tidal conditions.Salinity time series data is available at Ship John Shoal Light Station for the 62 day simulation period. In addition to the time series data, a survey performed by University of Delaware along the main shipping channel in June 2003 is available. Simulation with different turbulence closures yielded substantially different results. Among the seven closures compared, the k −ε parameterization of GLS is found to best represent the observed salinity characteristics.The k −ε model is used in the sediment transport modeling. The model results are compared to the available sediment data from a survey performed in spring 2003. The location of turbidity maximum is accurately identified by k −ε model.Ph.D., Civil, Architectural & Environmental Engineering -- Drexel University, 200

    Accurate EMC engineering on realistic platforms using an integral equation domain decomposition approach

    Get PDF
    This article investigates the efficiency, accuracy and versatility of a surface integral equation (SIE) multisolver scheme to address very complex and large-scale radiation problems including multiple scale features, in the context of realistic electromagnetic compatibility (EMC)/electromagnetic interference (EMI) studies. The tear-and-interconnect domain decomposition (DD) method is applied to properly decompose the problem into multiple subdomains attending to their material, geometrical, and scale properties, while different materials and arbitrarily shaped connections between them can be combined by using the so-called multiregion vector basis functions. The SIE-DD approach has been widely reported in the literature, mainly applied to scattering problems or small radiation problems. Complementarily, in this article, the focus is placed on realistic radiation problems, involving tens of antennas and sensors and including multiscale ingredients and multiple materials. Such kind of problems are very demanding in terms of both convergence and computational resources. Throughout two realistic case studies, the proposed SIE-DD approach is shown to be a powerful electromagnetic modeling tool to provide the accurate and fast solution which is indispensable to rigorously accomplish real-life EMC/EMI studies.Agencia Estatal de Investigación | Ref. TEC2017-85376-C2-1-RAgencia Estatal de Investigación | Ref. TEC2017-85376-C2-2-
    • …
    corecore