3 research outputs found

    An evaluation of current SIMD programming models for C++

    Get PDF
    SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current programming models for the C++ language, which claim to simplify SIMD programming while maintaining high performance. The proposals were assessed by implementing two kernels: one standard floating-point benchmark and one real-world integer-based application, both highly data parallel. Results show that the proposed solutions perform well for the floating point kernel, achieving close to the maximum possible speed-up. For the real-world application, the programming models exhibit significant performance gaps due to data type issues, missing template support and other problems discussed in this paper

    Octo-Tiger: Binary star systems with HPX on Nvidia P100

    Get PDF
    Stellar mergers between two suns are a significant field of study since they can lead to astrophysical phenomena such as type Ia supernovae. Octo-Tiger simulates merging stars by computing self-gravitating astrophysical fluids. By relying on the high-level library HPX for parallelization and Vc for vectorization, Octo-Tiger combines high performance with ease of development. For accurate simulations, Octo-Tiger requires massive computational resources. To improve hardware utilization, we introduce a stencil-based approach for computing the gravitational field using the fast multipole method. This approach was tailored for machines with wide vector units like Intel's Knights Landing or modern GPUs. Our implementation targets AVX512 enabled processors and is backward compatible with older vector extensions (AVX2, AVX, SSE). We further extended our approach to make use of available NVIDIA GPUs as coprocessors. We developed a tasking system that processes critical compute kernels on the GPU or the processor, depending on their utilization. Using the stencil-based fast multipole method, we gain a consistent speedup on all platforms, over the classical interaction-list-based implementation. On an Intel Xeon Phi 7210, we achieve a speedup of 1.9x. On a heterogeneous node with an Intel Xeon E5-2690 v3, we can obtain a speedup of 1.46x by adding an NVIDIA P100 GPU

    Efficient Processing of Range Queries in Main Memory

    Get PDF
    Datenbanksysteme verwenden Indexstrukturen, um Suchanfragen zu beschleunigen. Im Laufe der letzten Jahre haben Forscher verschiedene AnsĂ€tze zur Indexierung von Datenbanktabellen im Hauptspeicher entworfen. Hauptspeicherindexstrukturen versuchen möglichst hĂ€ufig Daten zu verwenden, die bereits im Zwischenspeicher der CPU vorrĂ€tig sind, anstatt, wie bei traditionellen Datenbanksystemen, die Zugriffe auf den externen Speicher zu optimieren. Die meisten vorgeschlagenen Indexstrukturen fĂŒr den Hauptspeicher beschrĂ€nken sich jedoch auf Punktabfragen und vernachlĂ€ssigen die ebenso wichtigen Bereichsabfragen, die in zahlreichen Anwendungen, wie in der Analyse von Genomdaten, Sensornetzwerken, oder analytischen Datenbanksystemen, zum Einsatz kommen. Diese Dissertation verfolgt als Hauptziel die FĂ€higkeiten von modernen Hauptspeicherdatenbanksystemen im AusfĂŒhren von Bereichsabfragen zu verbessern. Dazu schlagen wir zunĂ€chst die Cache-Sensitive Skip List, eine neue aktualisierbare Hauptspeicherindexstruktur, vor, die fĂŒr die Zwischenspeicher moderner Prozessoren optimiert ist und das AusfĂŒhren von Bereichsabfragen auf einzelnen Datenbankspalten ermöglicht. Im zweiten Abschnitt analysieren wir die Performanz von multidimensionalen Bereichsabfragen auf modernen Serverarchitekturen, bei denen Daten im Hauptspeicher hinterlegt sind und Prozessoren ĂŒber SIMD-Instruktionen und Multithreading verfĂŒgen. Um die Relevanz unserer Experimente fĂŒr praktische Anwendungen zu erhöhen, schlagen wir zudem einen realistischen Benchmark fĂŒr multidimensionale Bereichsabfragen vor, der auf echten Genomdaten ausgefĂŒhrt wird. Im letzten Abschnitt der Dissertation prĂ€sentieren wir den BB-Tree als neue, hochperformante und speichereffziente Hauptspeicherindexstruktur. Der BB-Tree ermöglicht das AusfĂŒhren von multidimensionalen Bereichs- und Punktabfragen und verfĂŒgt ĂŒber einen parallelen Suchoperator, der mehrere Threads verwenden kann, um die Performanz von Suchanfragen zu erhöhen.Database systems employ index structures as means to accelerate search queries. Over the last years, the research community has proposed many different in-memory approaches that optimize cache misses instead of disk I/O, as opposed to disk-based systems, and make use of the grown parallel capabilities of modern CPUs. However, these techniques mainly focus on single-key lookups, but neglect equally important range queries. Range queries are an ubiquitous operator in data management commonly used in numerous domains, such as genomic analysis, sensor networks, or online analytical processing. The main goal of this dissertation is thus to improve the capabilities of main-memory database systems with regard to executing range queries. To this end, we first propose a cache-optimized, updateable main-memory index structure, the cache-sensitive skip list, which targets the execution of range queries on single database columns. Second, we study the performance of multidimensional range queries on modern hardware, where data are stored in main memory and processors support SIMD instructions and multi-threading. We re-evaluate a previous rule of thumb suggesting that, on disk-based systems, scans outperform index structures for selectivities of approximately 15-20% or more. To increase the practical relevance of our analysis, we also contribute a novel benchmark consisting of several realistic multidimensional range queries applied to real- world genomic data. Third, based on the outcomes of our experimental analysis, we devise a novel, fast and space-effcient, main-memory based index structure, the BB- Tree, which supports multidimensional range and point queries and provides a parallel search operator that leverages the multi-threading capabilities of modern CPUs
    corecore