Search CORE

662 research outputs found

Accelerating cross-correlation with GPUs

Author: Maděra Karel
Publication venue: Univerzita Karlova, Matematicko-fyzikální fakulta
Publication date: 01/01/2022
Field of study

Cross-correlation is a commonly used tool in the field of signal processing, with ap- plications in pattern recognition, particle physics, electron tomography, and many other areas. For many of these applications, it is often the limiting factor on system perfor- mance due to its computational complexity. In this thesis, we analyze the cross-correlation algorithm and its optimization and parallelization possibilities. We then implement sev- eral optimizations of the definition-based algorithm, mainly focused on parallelization using the Graphical processing unit (GPU). Even though the definition-based algorithm provides many possibilities for parallelization, the implementation needs to solve several problems, such as the algorithm's low arithmetic intensity. Furthermore, the problems differ between computation types, which include cross-correlating a pair of inputs, one in- put with many other inputs, or many inputs with many other inputs. Lastly, we compare the optimizations of the definition-based algorithm with the asymptotically faster and commonly used algorithm based on the Fast Fourier Transform. Depending on the total size of the data, we achieve parity between the two algorithms for matrix sizes ranging from 60x60 to 150x150, allowing performance improvements for systems using matrices smaller...Vzájemná korelace je často používaný nástroj v oboru zpracování signálu, který je možné aplikovat pro rozpoznávání obrazu, částicovou fyziku, elektronovou tomografii a pro mnoho dalších oblastí. Pro mnohé z těchto aplikací je výkon vzájemné korelace lim- itujícím faktorem pro celkový výkon systému z důvodů její výpočetní náročnosti. V této práci provedeme analýzu vzájemné korelace vzhledem k možnostem pro její optimalizaci a paralelizaci. Následně implementujeme několik optimalizací algoritmu odvozeného z definice vzájemné korelace, se zaměřením na paralelizaci pomocí grafických karet (GPU). Přestože tento algoritmus poskytuje mnoho možností pro paralelizaci, je pro jejich využití potřeba vyřešit několik problémů, jako je například nízká aritmetická intenzita algoritmu. Problémy se nadále liší podle typu vstupních dat, mezi které patří korelace jednoho páru vstupů, jednoho vstupu s množinou jiných vstupů, případně korelace mnoha vstupů s mnoha jinými vstupy. V závěru práce poté porovnáme námi implementované optimal- izace algoritmu založeného na definici vzájemné korelace s asymptoticky rychlejším a často používaným algoritmem založeným na Rychlé Fourierově transformaci (FFT). V závislosti na celkové velikosti vstupních dat dosahuje naše implementace stejné rychlosti jako algoritmus založený na FFT při...Department of Distributed and Dependable SystemsKatedra distribuovaných a spolehlivých systémůMatematicko-fyzikální fakultaFaculty of Mathematics and Physic

CU Digital Repository

Multidimensional Range Queries on Modern Hardware

Author: Leser Ulf
Schäfer Patrick
Sprenger Stefan
Publication venue
Publication date: 14/05/2018
Field of study

Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom is largely based on evaluations that are almost two decades old, performed on data being held on disks, applying IO-optimized data structures, and using single-core systems. The question is whether this rule of thumb still holds when multidimensional range queries (MDRQ) are performed on modern architectures with large main memories holding all data, multi-core CPUs and data-parallel instruction sets. In this paper, we study the question whether and how much modern hardware influences the performance ratio between index structures and scans for MDRQ. To this end, we conservatively adapted three popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit features of modern servers and compared their performance to different flavors of parallel scans using multiple (synthetic and real-world) analytical workloads over multiple (synthetic and real-world) datasets of varying size, dimensionality, and skew. We find that all approaches benefit considerably from using main memory and parallelization, yet to varying degrees. Our evaluation indicates that, on current machines, scanning should be favored over parallel versions of classical MDIS even for very selective queries

arXiv.org e-Print Archive

Crossref

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor

Author: Bailey D
Ganapathi A
IBM Blue Gene Team
Kamil S
Nguyen A
Peng L
Sosa C and International Business Machines Corporation
Williams S
Publication venue: 'SAGE Publications'
Publication date: 17/01/2012
Field of study

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the Central Processing Unit (CPU). We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM Blue Gene/P supercomputer's PowerPC 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7x speedup over the best previously published results

arXiv.org e-Print Archive

Crossref

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Author: Chao Mei
Chris Harrison
Eric J. Bohm
Gengbin Zheng
James C. Phillips
Laxmikant V. Kale
Yanhua Sun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1

CiteSeerX

Crossref

On I/O Performance and Cost Efficiency of Cloud Storage: A Client\u27s Perspective

Author: Hou Binbing
Publication venue: LSU Digital Commons
Publication date: 04/11/2019
Field of study

Cloud storage has gained increasing popularity in the past few years. In cloud storage, data are stored in the service provider’s data centers; users access data via the network and pay the fees based on the service usage. For such a new storage model, our prior wisdom and optimization schemes on conventional storage may not remain valid nor applicable to the emerging cloud storage. In this dissertation, we focus on understanding and optimizing the I/O performance and cost efficiency of cloud storage from a client’s perspective. We first conduct a comprehensive study to gain insight into the I/O performance behaviors of cloud storage from the client side. Through extensive experiments, we have obtained several critical findings and useful implications for system optimization. We then design a client cache framework, called Pacaca, to further improve end-to-end performance of cloud storage. Pacaca seamlessly integrates parallelized prefetching and cost-aware caching by utilizing the parallelism potential and object correlations of cloud storage. In addition to improving system performance, we have also made efforts to reduce the monetary cost of using cloud storage services by proposing a latency- and cost-aware client caching scheme, called GDS-LC, which can achieve two optimization goals for using cloud storage services: low access latency and low monetary cost. Our experimental results show that our proposed client-side solutions significantly outperform traditional methods. Our study contributes to inspiring the community to reconsider system optimization methods in the cloud environment, especially for the purpose of integrating cloud storage into the current storage stack as a primary storage layer

Louisiana State University

木を用いた構造化並列プログラミング

Author: Shigeyuki Sato
佐藤重幸
Publication venue
Publication date: 02/09/2016
Field of study

High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201

Creative Repository of Electro-Communications

Locality-Sensitive and Re-use Promoting Personalized PageRank Computations

Author: Jung Hyun Kim
Sapino Maria Luisa
Sel&#231
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Institutional Research Information System University of Turin

Energy-Aware High Performance Computing

Author: Dolz Manuel F.
Heuveline Vincent
I. Malossi A. Cristiano
Ludwig Thomas
Quintana-Orti Enrique S.
Reza Heidari M.
Wlotzka Martin
Publication venue: 'IntechOpen'
Publication date: 22/03/2017
Field of study

High performance computing centres consume substantial amounts of energy to power large-scale supercomputers and the necessary building and cooling infrastructure. Recently, considerable performance gains resulted predominantly from developments in multi-core, many-core and accelerator technology. Computing centres rapidly adopted this hardware to serve the increasing demand for computational power. However, further performance increases in large-scale computing systems are limited by the aggregate energy budget required to operate them. Power consumption has become a major cost factor for computing centres. Furthermore, energy consumption results in carbon dioxide emissions, a hazard for the environment and public health; and heat, which reduces the reliability and lifetime of hardware components. Energy efficiency is therefore crucial in high performance computing

IntechOpen