34 research outputs found
Minotaur: A SIMD-Oriented Synthesizing Superoptimizer
Minotaur is a superoptimizer for LLVM's intermediate representation that
focuses on integer SIMD instructions, both portable and specific to x86-64. We
created it to attack problems in finding missing peephole optimizations for
SIMD instructions-this is challenging because there are many such instructions
and they can be semantically complex. Minotaur runs a hybrid synthesis
algorithm where instructions are enumerated concretely, but literal constants
are generated by the solver. We use Alive2 as a verification engine; to do this
we modified it to support synthesis and also to support a large subset of
Intel's vector instruction sets (SSE, AVX, AVX2, and AVX-512). Minotaur finds
many profitable optimizations that are missing from LLVM. It achieves limited
speedups on the integer parts of SPEC CPU2017, around 1.3%, and it speeds up
the test suite for the libYUV library by 2.2%, on average, and by 1.64x
maximum, when targeting an Intel Cascade Lake processor
Merkkijonohaun tekniikat ja algoritmit luonnollisen tekstin nopeaan hakuun
Merkkijonohaut ovat tietojenkäsittelytieteen yksi klassisista aiheista. Merkkijonohakua on tutkittu tietojenkäsittelytieteen alusta lähtien, ja ne ovat perustana hakukoneille, jotka ovat nykyiselle yhteiskunnalle tärkeitä. Merkkijonohaussa teksti on merkkijono, josta hakusanoja etsitään. Tarkastelen luonnollisen kielen merkkijonohakuja, jotka löytävät kaikki hakuosumat tarkalla merkkijonohaulla, eli hakuosumaa ei tule yhdenkään merkin erosta hakusanassa. Työssäni tutkin merkkijonohaun tekniikoita ja esitän nopeimpia hakualgoritmeja, kun tekstin esikäsittely sallitaan ja kun sitä ei sallita. Tutkimukseni jakautuu siis kolmeen osaan: merkkijonohaun tekniikat, nopeimmat hakualgoritmit ilman tekstin esikäsittelyä ja nopeimmat hakualgoritmit tekstin esikäsittelyllä.
Selvitän työssäni merkkijonohaun tekniikoita tarkastelemalla naiivia algoritmia, Boyer-Moore-algoritmia, bittirinnakkaisuutta käyttäviä algoritmeja, suodattavia algoritmeja ja suffiksitaulukon hakualgoritmeja. Otan näistä tarkempaan tarkasteluun naiivin algoritmin ja Boyer-Mooren. Osoitan, miten vertailujärjestyksen muutoksella naiivi algoritmi saadaan varsin tehokkaaksi, ja selitän, miten Boyer-Mooren käänteisestä vertailujärjestyksestä saadaan huonon merkin ja hyvän suffiksin sääntö havaintojen perusteella. Työssäni saan tekniikoiksi vertailujärjestyksen muuttaminen, huonon merkin sääntö, hyvän suffiksin sääntö, hakusanan esikäsittely, parhaan hakualgoritmin valitseminen, suodatus, bittirinnakkaisuus ja suffiksitietorakenne.
Selvitän nopeimmat merkkijonohakualgoritmit kirjallisuudesta. Saan työssäni selville, että EPSMA (Exact Packed String Matching AVX2) ja SSEFA (Streaming SIMD Extensions Filter AVX2) ovat nopeimpia merkkijonohakualgoritmeja, jotka eivät esikäsittele tekstiä mutta voivat esikäsitellä hakusanaa. Yhteistä näissä molemmissa algoritmeissa on se, että ne käyttävät suodatusta ja bittirinnakkaisuutta.
Kun sallitaan tekstin esikäsittely, voidaan käyttää suffiksitietorakennetta, jolla pystyy poistamaan merkkijonohaun aikavaatimuksen riippuvuuden tekstin pituudesta. Valitsen suffiksitietorakenteista suffiksitaulukon, ja esitän kirjallisuudesta useita tapoja hakea merkkijonoja suffiksitaulukolla. Nopeimman merkkijonohaun voi toteuttaa Burrows-Wheeler-muunnoksella, jonka avulla merkkijonohaut onnistuvat hakusanan pituuden suhteen lineaarisessa ajassa
State-of-the-art in Smith-Waterman Protein Database Search on HPC Platforms
Searching biological sequence database is a common and repeated task in bioinformatics and molecular biology. The Smith–Waterman algorithm is the most accurate method for this kind of search. Unfortunately, this algorithm is computationally demanding and the situation gets worse due to the exponential growth of biological data in the last years. For that reason, the scientific community has made great efforts to accelerate Smith–Waterman biological database searches in a wide variety of hardware platforms. We give a survey of the state-of-the-art in Smith–Waterman protein database search, focusing on four hardware architectures: central processing units, graphics processing units, field programmable gate arrays and Xeon Phi coprocessors. After briefly describing each hardware platform, we analyse temporal evolution, contributions, limitations and experimental work and the results of each implementation. Additionally, as energy efficiency is becoming more important every day, we also survey performance/power consumption works. Finally, we give our view on the future of Smith–Waterman protein searches considering next generations of hardware architectures and its upcoming technologies.Instituto de Investigación en InformáticaUniversidad Complutense de Madri
Recommended from our members
Efficient analysis and storage of large-scale genomic data
The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed
algorithms, and heterogeneous computing.
First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available
instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus
GenArchBench: Porting and Optimizing a Genomics Benchmark Suite to Arm-based HPC Processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the second position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While the majority of genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines, let alone exploit the advantages of the newly introduced Scalable Vector Extensions (SVE). This thesis presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected a set of computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. The porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. All in all, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Moreover, our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. In this work, we present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different architectures (i.e., A64FX, Graviton3, Intel Xeon Platinum, and AMD EPYC). Additionally, as proof of the impact of this work, we study the performance improvement in a production-ready genomics pipeline using the GenArchBench optimized kernels
Scientific Programming and Computer Architecture
A variety of programming models relevant to scientists explained, with an emphasis on how programming constructs map to parts of the computer.What makes computer programs fast or slow? To answer this question, we have to get behind the abstractions of programming languages and look at how a computer really works. This book examines and explains a variety of scientific programming models (programming models relevant to scientists) with an emphasis on how programming constructs map to different parts of the computer's architecture. Two themes emerge: program speed and program modularity. Throughout this book, the premise is to "get under the hood," and the discussion is tied to specific programs. The book digs into linkers, compilers, operating systems, and computer architecture to understand how the different parts of the computer interact with programs. It begins with a review of C/C++ and explanations of how libraries, linkers, and Makefiles work. Programming models covered include Pthreads, OpenMP, MPI, TCP/IP, and CUDA.The emphasis on how computers work leads the reader into computer architecture and occasionally into the operating system kernel. The operating system studied is Linux, the preferred platform for scientific computing. Linux is also open source, which allows users to peer into its inner workings. A brief appendix provides a useful table of machines used to time programs. The book's website (https://github.com/divakarvi/bk-spca) has all the programs described in the book as well as a link to the html text
Recommended from our members
Analytical Query Execution Optimized for all Layers of Modern Hardware
Analytical database queries are at the core of business intelligence and decision support. To analyze the vast amounts of data available today, query execution needs to be orders of magnitude faster. Hardware advances have made a profound impact on database design and implementation. The large main memory capacity allows queries to execute exclusively in memory and shifts the bottleneck from disk access to memory bandwidth. In the new setting, to optimize query performance, databases must be aware of an unprecedented multitude of complicated hardware features. This thesis focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution for all layers of modern hardware. The hardware layers include the network across multiple machines, main memory and the NUMA interconnection across multiple processors, the multiple levels of caches across multiple processor cores, and the execution pipeline within each core. For the network layer, we introduce a distributed join algorithm that minimizes the network traffic. For the memory hierarchy, we describe partitioning variants aware to the dynamics of the CPU caches and the NUMA interconnection. To improve the memory access rate of linear scans, we optimize lightweight compression variants and evaluate their trade-offs. To accelerate query execution within the core pipeline, we introduce advanced SIMD vectorization techniques generalizable across multiple operators. We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features of many-core CPUs. In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance analytical query execution
Resiliency Mechanisms for In-Memory Column Stores
The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems.
In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION
1.1 Contributions of this Thesis
1.2 Outline
2 PROBLEM DESCRIPTION AND RELATED WORK
2.1 Reliable Data Management on Reliable Hardware
2.2 The Shift Towards Unreliable Hardware
2.3 Hardware-Based Mitigation of Bit Flips
2.4 Data Management System Requirements
2.5 Software-Based Techniques For Handling Bit Flips
2.5.1 Operating System-Level Techniques
2.5.2 Compiler-Level Techniques
2.5.3 Application-Level Techniques
2.6 Summary and Conclusions
3 ANALYSIS OF CODING TECHNIQUES
3.1 Selection of Error Codes
3.1.1 Hamming Coding
3.1.2 XOR Checksums
3.1.3 AN Coding
3.1.4 Summary and Conclusions
3.2 Probabilities of Silent Data Corruption
3.2.1 Probabilities of Hamming Codes
3.2.2 Probabilities of XOR Checksums
3.2.3 Probabilities of AN Codes
3.2.4 Concrete Error Models
3.2.5 Summary and Conclusions
3.3 Throughput Considerations
3.3.1 Test Systems Descriptions
3.3.2 Vectorizing Hamming Coding
3.3.3 Vectorizing XOR Checksums
3.3.4 Vectorizing AN Coding
3.3.5 Summary and Conclusions
3.4 Comparison of Error Codes
3.4.1 Effectiveness
3.4.2 Efficiency
3.4.3 Runtime Adaptability
3.5 Performance Optimizations for AN Coding
3.5.1 The Modular Multiplicative Inverse
3.5.2 Faster Softening
3.5.3 Faster Error Detection
3.5.4 Comparison to Original AN Coding
3.5.5 The Multiplicative Inverse Anomaly
3.6 Summary
4 BIT FLIP DETECTING STORAGE
4.1 Column Store Architecture
4.1.1 Logical Data Types
4.1.2 Storage Model
4.1.3 Data Representation
4.1.4 Data Layout
4.1.5 Tree Index Structures
4.1.6 Summary
4.2 Hardened Data Storage
4.2.1 Hardened Physical Data Types
4.2.2 Hardened Lightweight Compression
4.2.3 Hardened Data Layout
4.2.4 UDI Operations
4.2.5 Summary and Conclusions
4.3 Hardened Tree Index Structures
4.3.1 B-Tree Verification Techniques
4.3.2 Justification For Further Techniques
4.3.3 The Error Detecting B-Tree
4.4 Summary
5 BIT FLIP DETECTING QUERY PROCESSING
5.1 Column Store Query Processing
5.2 Bit Flip Detection Opportunities
5.2.1 Early Onetime Detection
5.2.2 Late Onetime Detection
5.2.3 Continuous Detection
5.2.4 Miscellaneous Processing Aspects
5.2.5 Summary and Conclusions
5.3 Hardened Intermediate Results
5.3.1 Materialization of Hardened Intermediates
5.3.2 Hardened Bitmaps
5.4 Summary
6 END-TO-END EVALUATION
6.1 Prototype Implementation
6.1.1 AHEAD Architecture
6.1.2 Diversity of Physical Operators
6.1.3 One Concrete Operator Realization
6.1.4 Summary and Conclusions
6.2 Performance of Individual Operators
6.2.1 Selection on One Predicate
6.2.2 Selection on Two Predicates
6.2.3 Join Operators
6.2.4 Grouping and Aggregation
6.2.5 Delta Operator
6.2.6 Summary and Conclusions
6.3 Star Schema Benchmark Queries
6.3.1 Query Runtimes
6.3.2 Improvements Through Vectorization
6.3.3 Storage Overhead
6.3.4 Summary and Conclusions
6.4 Error Detecting B-Tree
6.4.1 Single Key Lookup
6.4.2 Key Value-Pair Insertion
6.5 Summary
7 SUMMARY AND CONCLUSIONS
7.1 Future Work
A APPENDIX
A.1 List of Golden As
A.2 More on Hamming Coding
A.2.1 Code examples
A.2.2 Vectorization
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
LIST OF LISTINGS
LIST OF ACRONYMS
LIST OF SYMBOLS
LIST OF DEFINITION