1,195 research outputs found
Regular and almost universal hashing: an efficient implementation
Random hashing can provide guarantees regarding the performance of data
structures such as hash tables---even in an adversarial setting. Many existing
families of hash functions are universal: given two data objects, the
probability that they have the same hash value is low given that we pick hash
functions at random. However, universality fails to ensure that all hash
functions are well behaved. We further require regularity: when picking data
objects at random they should have a low probability of having the same hash
value, for any fixed hash function. We present the efficient implementation of
a family of non-cryptographic hash functions (PM+) offering good running times,
good memory usage as well as distinguishing theoretical guarantees: almost
universality and component-wise regularity. On a variety of platforms, our
implementations are comparable to the state of the art in performance. On
recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit
outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization
through SIMD instructions (e.g., AVX2) and optimizations for superscalar
execution.Comment: accepted for publication in Software: Practice and Experience in
September 201
Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks
This paper presents an in-depth analysis of Intel's Haswell microarchitecture
for streaming loop kernels. Among the new features examined is the dual-ring
Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements
as new and improved execution units, as well as improvements throughout the
memory hierarchy. The Execution-Cache-Memory diagnostic performance model is
used together with a generic set of microbenchmarks to quantify the efficiency
of the microarchitecture. The set of microbenchmarks is chosen such that it can
serve as a blueprint for other streaming loop kernels.Comment: arXiv admin note: substantial text overlap with arXiv:1509.0311
Software-Based Side Channel Attacks and the Future of Hardened Microarchitecture
Side channel attack vectors found in microarchitecture of computing devices expose systems to potentially system-level breaches. This thesis consists of a comprehensive report on current exploits of this nature, describing their fundamental basis and usage, paving the way to further research into hardware mitigations that may be utilized to combat these and future vulnerabilities. It will discuss several modern software-based side channel attacks, describing the mechanisms they utilize to gain access to privileged information. Attack vectors will be exemplified, along with applicability to various architectures utilized in modern computing. Finally, discussion of how future architectural changes must successfully harden chips against attacks of this type will occur, ending with a reinforced call for development of these integral architectural revisions to resolve the threat
Using AVX2 Instruction Set to Increase Performance of High Performance Computing Code
In this paper we discuss new Intel instruction extensions - Intel Advance Vector Extensions 2 (AVX2) and what these bring to high performance computing (HPC). To illustrate this new systems utilizing AVX2 are evaluated to demonstrate how to effectively exploit AVX2 for HPC types of the code and expose the situation when AVX2 might not be the most effective way to increase performance
GenArchBench: Porting and Optimizing a Genomics Benchmark Suite to Arm-based HPC Processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the second position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While the majority of genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines, let alone exploit the advantages of the newly introduced Scalable Vector Extensions (SVE). This thesis presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected a set of computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. The porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. All in all, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Moreover, our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. In this work, we present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different architectures (i.e., A64FX, Graviton3, Intel Xeon Platinum, and AMD EPYC). Additionally, as proof of the impact of this work, we study the performance improvement in a production-ready genomics pipeline using the GenArchBench optimized kernels
- …