221 research outputs found

    Analytical Query Processing Using Heterogeneous SIMD Instruction Sets

    Get PDF
    Numerous applications gather increasing amounts of data, which have to be managed and queried. Different hardware developments help to meet this challenge. The grow-ing capacity of main memory enables database systems to keep all their data in memory. Additionally, the hardware landscape is becoming more diverse. A plethora of homo-geneous and heterogeneous co-processors is available, where heterogeneity refers not only to a different computing power, but also to different instruction set architectures. For instance, modern IntelÂź CPUs offer different instruction sets supporting the Single Instruction Multiple Data (SIMD) paradigm, e.g. SSE, AVX, and AVX512. Database systems have started to exploit SIMD to increase performance. However, this is still a challenging task, because existing algorithms were mainly developed for scalar processing and because there is a huge variety of different instruction sets, which were never standardized and have no unified interface. This requires to completely rewrite the source code for porting a system to another hardware architecture, even if those archi-tectures are not fundamentally different and designed by the same company. Moreover, operations on large registers, which are the core principle of SIMD processing, behave counter-intuitively in several cases. This is especially true for analytical query process-ing, where different memory access patterns and data dependencies caused by the com-pression of data, challenge the limits of the SIMD principle. Finally, there are physical constraints to the use of such instructions affecting the CPU frequency scaling, which is further influenced by the use of multiple cores. This is because the supply power of a CPU is limited, such that not all transistors can be powered at the same time. Hence, there is a complex relationship between performance and power, and therefore also between performance and energy consumption. This thesis addresses the specific challenges, which are introduced by the application of SIMD in general, and the heterogeneity of SIMD ISAs in particular. Hence, the goal of this thesis is to exploit the potential of heterogeneous SIMD ISAs for increasing the performance as well as the energy-efficiency

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

    Efficient Processing of Range Queries in Main Memory

    Get PDF
    Datenbanksysteme verwenden Indexstrukturen, um Suchanfragen zu beschleunigen. Im Laufe der letzten Jahre haben Forscher verschiedene AnsĂ€tze zur Indexierung von Datenbanktabellen im Hauptspeicher entworfen. Hauptspeicherindexstrukturen versuchen möglichst hĂ€ufig Daten zu verwenden, die bereits im Zwischenspeicher der CPU vorrĂ€tig sind, anstatt, wie bei traditionellen Datenbanksystemen, die Zugriffe auf den externen Speicher zu optimieren. Die meisten vorgeschlagenen Indexstrukturen fĂŒr den Hauptspeicher beschrĂ€nken sich jedoch auf Punktabfragen und vernachlĂ€ssigen die ebenso wichtigen Bereichsabfragen, die in zahlreichen Anwendungen, wie in der Analyse von Genomdaten, Sensornetzwerken, oder analytischen Datenbanksystemen, zum Einsatz kommen. Diese Dissertation verfolgt als Hauptziel die FĂ€higkeiten von modernen Hauptspeicherdatenbanksystemen im AusfĂŒhren von Bereichsabfragen zu verbessern. Dazu schlagen wir zunĂ€chst die Cache-Sensitive Skip List, eine neue aktualisierbare Hauptspeicherindexstruktur, vor, die fĂŒr die Zwischenspeicher moderner Prozessoren optimiert ist und das AusfĂŒhren von Bereichsabfragen auf einzelnen Datenbankspalten ermöglicht. Im zweiten Abschnitt analysieren wir die Performanz von multidimensionalen Bereichsabfragen auf modernen Serverarchitekturen, bei denen Daten im Hauptspeicher hinterlegt sind und Prozessoren ĂŒber SIMD-Instruktionen und Multithreading verfĂŒgen. Um die Relevanz unserer Experimente fĂŒr praktische Anwendungen zu erhöhen, schlagen wir zudem einen realistischen Benchmark fĂŒr multidimensionale Bereichsabfragen vor, der auf echten Genomdaten ausgefĂŒhrt wird. Im letzten Abschnitt der Dissertation prĂ€sentieren wir den BB-Tree als neue, hochperformante und speichereffziente Hauptspeicherindexstruktur. Der BB-Tree ermöglicht das AusfĂŒhren von multidimensionalen Bereichs- und Punktabfragen und verfĂŒgt ĂŒber einen parallelen Suchoperator, der mehrere Threads verwenden kann, um die Performanz von Suchanfragen zu erhöhen.Database systems employ index structures as means to accelerate search queries. Over the last years, the research community has proposed many different in-memory approaches that optimize cache misses instead of disk I/O, as opposed to disk-based systems, and make use of the grown parallel capabilities of modern CPUs. However, these techniques mainly focus on single-key lookups, but neglect equally important range queries. Range queries are an ubiquitous operator in data management commonly used in numerous domains, such as genomic analysis, sensor networks, or online analytical processing. The main goal of this dissertation is thus to improve the capabilities of main-memory database systems with regard to executing range queries. To this end, we first propose a cache-optimized, updateable main-memory index structure, the cache-sensitive skip list, which targets the execution of range queries on single database columns. Second, we study the performance of multidimensional range queries on modern hardware, where data are stored in main memory and processors support SIMD instructions and multi-threading. We re-evaluate a previous rule of thumb suggesting that, on disk-based systems, scans outperform index structures for selectivities of approximately 15-20% or more. To increase the practical relevance of our analysis, we also contribute a novel benchmark consisting of several realistic multidimensional range queries applied to real- world genomic data. Third, based on the outcomes of our experimental analysis, we devise a novel, fast and space-effcient, main-memory based index structure, the BB- Tree, which supports multidimensional range and point queries and provides a parallel search operator that leverages the multi-threading capabilities of modern CPUs

    Managing tail latency in large scale information retrieval systems

    Get PDF
    As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency

    NATSA: A Near-Data Processing Accelerator for Time Series Analysis

    Get PDF
    Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic intensity, but it typically operates on large amounts of time series data. In current computing systems, this data needs to be moved between the off-chip memory units and the on-chip computation units for performing matrix profile. This causes a major performance bottleneck as data movement is extremely costly in terms of both execution time and energy. In this work, we present NATSA, the first Near-Data Processing accelerator for time series analysis. The key idea is to exploit modern 3D-stacked High Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile computation near memory, where time series data resides. NATSA provides three key benefits: 1) quickly computing the matrix profile for a wide range of applications by building specialized energy-efficient floating-point arithmetic processing units close to HBM, 2) improving the energy efficiency and execution time by reducing the need for data movement over slow and energy-hungry buses between the computation units and the memory units, and 3) analyzing time series data at scale by exploiting low-latency, high-bandwidth, and energy-efficient memory access provided by HBM. Our experimental evaluation shows that NATSA improves performance by up to 14.2x (9.9x on average) and reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art multi-core implementation. NATSA also improves performance by 6.3x and reduces energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.Comment: To appear in the 38th IEEE International Conference on Computer Design (ICCD 2020

    Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

    Full text link
    Machine perception applications are increasingly moving toward manipulating and processing 3D point cloud. This paper focuses on point cloud registration, a key primitive of 3D data processing widely used in high-level tasks such as odometry, simultaneous localization and mapping, and 3D reconstruction. As these applications are routinely deployed in energy-constrained environments, real-time and energy-efficient point cloud registration is critical. We present Tigris, an algorithm-architecture co-designed system specialized for point cloud registration. Through an extensive exploration of the registration pipeline design space, we find that, while different design points make vastly different trade-offs between accuracy and performance, KD-tree search is a common performance bottleneck, and thus is an ideal candidate for architectural specialization. While KD-tree search is inherently sequential, we propose an acceleration-amenable data structure and search algorithm that exposes different forms of parallelism of KD-tree search in the context of point cloud registration. The co-designed accelerator systematically exploits the parallelism while incorporating a set of architectural techniques that further improve the accelerator efficiency. Overall, Tigris achieves 77.2×\times speedup and 7.4×\times power reduction in KD-tree search over an RTX 2080 Ti GPU, which translates to a 41.7% registration performance improvements and 3.0×\times power reduction.Comment: Published at MICRO-52 (52nd IEEE/ACM International Symposium on Microarchitecture); Tiancheng Xu and Boyuan Tian are co-primary author

    äžŠćˆ—èšˆçź—ă‚ąă‚Żă‚»ăƒ©ăƒŹăƒŒă‚żăžăźćŠč率的ăȘケプăƒȘă‚±ăƒŒă‚·ăƒ§ăƒłăƒžăƒƒăƒ”ăƒłă‚°ă«é–ąă™ă‚‹ç ”ç©¶

    Get PDF
    é•·ćŽŽć€§ć­Šć­Šäœè«–æ–‡ ć­Šäœèš˜ç•Șć·:捚(ć·„)ç”Č珏3ć· ć­ŠäœæŽˆäžŽćčŽæœˆæ—„:ćčłæˆ26ćčŽ3月20æ—„Nagasaki University (長掎性歊)èȘČ繋捚

    Efficient Algorithms for Coastal Geographic Problems

    Get PDF
    The increasing performance of computers has made it possible to solve algorithmically problems for which manual and possibly inaccurate methods have been previously used. Nevertheless, one must still pay attention to the performance of an algorithm if huge datasets are used or if the problem iscomputationally diïŹƒcult. Two geographic problems are studied in the articles included in this thesis. In the ïŹrst problem the goal is to determine distances from points, called study points, to shorelines in predeïŹned directions. Together with other in-formation, mainly related to wind, these distances can be used to estimate wave exposure at diïŹ€erent areas. In the second problem the input consists of a set of sites where water quality observations have been made and of the results of the measurements at the diïŹ€erent sites. The goal is to select a subset of the observational sites in such a manner that water quality is still measured in a suïŹƒcient accuracy when monitoring at the other sites is stopped to reduce economic cost. Most of the thesis concentrates on the ïŹrst problem, known as the fetch length problem. The main challenge is that the two-dimensional map is represented as a set of polygons with millions of vertices in total and the distances may also be computed for millions of study points in several directions. EïŹƒcient algorithms are developed for the problem, one of them approximate and the others exact except for rounding errors. The solutions also diïŹ€er in that three of them are targeted for serial operation or for a small number of CPU cores whereas one, together with its further developments, is suitable also for parallel machines such as GPUs.Tietokoneiden suorituskyvyn kasvaminen on tehnyt mahdolliseksi ratkaista algoritmisesti ongelmia, joita on aiemmin tarkasteltu paljon ihmistyötĂ€ vaativilla, mahdollisesti epĂ€tarkoilla, menetelmillĂ€. Algoritmien suorituskykyyn on kuitenkin toisinaan edelleen kiinnitettĂ€vĂ€ huomiota lĂ€htömateriaalin suuren mÀÀrĂ€n tai ongelman laskennallisen vaikeuden takia. VĂ€itöskirjaansisĂ€ltyvissĂ€artikkeleissatarkastellaankahtamaantieteellistĂ€ ongelmaa. EnsimmĂ€isessĂ€ nĂ€istĂ€ on mÀÀritettĂ€vĂ€ etĂ€isyyksiĂ€ merellĂ€ olevista pisteistĂ€ lĂ€himpÀÀn rantaviivaan ennalta mÀÀrĂ€tyissĂ€ suunnissa. EtĂ€isyyksiĂ€ ja tuulen voimakkuutta koskevien tietojen avulla on mahdollista arvioida esimerkiksi aallokon voimakkuutta. Toisessa ongelmista annettuna on joukko tarkkailuasemia ja niiltĂ€ aiemmin kerĂ€ttyĂ€ tietoa erilaisista vedenlaatua kuvaavista parametreista kuten sameudesta ja ravinteiden mÀÀristĂ€. TehtĂ€vĂ€nĂ€ on valita asemajoukosta sellainen osa joukko, ettĂ€ vedenlaatua voidaan edelleen tarkkailla riittĂ€vĂ€llĂ€ tarkkuudella, kun mittausten tekeminen muilla havaintopaikoilla lopetetaan kustannusten sÀÀstĂ€miseksi. VĂ€itöskirja keskittyy pÀÀosin ensimmĂ€isen ongelman, suunnattujen etĂ€isyyksien, ratkaisemiseen. Haasteena on se, ettĂ€ tarkasteltava kaksiulotteinen kartta kuvaa rantaviivan tyypillisesti miljoonista kĂ€rkipisteistĂ€ koostuvana joukkonapolygonejajaetĂ€isyyksiĂ€onlaskettavamiljoonilletarkastelupisteille kymmenissĂ€ eri suunnissa. Ongelmalle kehitetÀÀn tehokkaita ratkaisutapoja, joista yksi on likimÀÀrĂ€inen, muut pyöristysvirheitĂ€ lukuun ottamatta tarkkoja. Ratkaisut eroavat toisistaan myös siinĂ€, ettĂ€ kolme menetelmistĂ€ on suunniteltu ajettavaksi sarjamuotoisesti tai pienellĂ€ mÀÀrĂ€llĂ€ suoritinytimiĂ€, kun taas yksi menetelmistĂ€ ja siihen tehdyt parannukset soveltuvat myös voimakkaasti rinnakkaisille laitteille kuten GPU:lle. Vedenlaatuongelmassa annetulla asemajoukolla on suuri mÀÀrĂ€ mahdollisia osajoukkoja. LisĂ€ksi tehtĂ€vĂ€ssĂ€ kĂ€ytetÀÀn aikaa vaativia operaatioita kuten lineaarista regressiota, mikĂ€ entisestÀÀn rajoittaa sitĂ€, kuinka monta osajoukkoa voidaan tutkia. Ratkaisussa kĂ€ytetÀÀnkin heuristiikkoja, jotkaeivĂ€t vĂ€lttĂ€mĂ€ttĂ€ tuota optimaalista lopputulosta.Siirretty Doriast

    Exploiting multiple levels of parallelism of Convergent Cross Mapping

    Get PDF
    Identifying causal relationships between variables remains an essential problem across various scientific fields. Such identification is particularly important but challenging in complex systems, such as those involving human behaviour, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embeddings of time series, convergent cross mapping (CCM) serves as an important method for addressing this problem. While powerful, CCM is computationally costly; moreover, CCM results are highly sensitive to several parameter values. Current best practice involves performing a systematic search on a range of parameters, but results in high computational burden, which mainly raises barriers to practical use. In light of both such challenges and the growing size of commonly encountered datasets from complex systems, inferring the causality with confidence using CCM in a reasonable time becomes a biggest challenge. In this thesis, I investigate the performance associated with a variety of parallel techniques (CUDA, Thrust, OpenMP, MPI and Spark, etc.,) to accelerate convergent cross mapping. The performance of each method was collected and compared across multiple experiments to further evaluate potential bottlenecks. Moreover, the work deployed and tested combinations of these techniques to more thoroughly exploit available computation resources. The results obtained from these experiments indicate that GPUs can only accelerate the CCM algorithm under certain circumstances and requirements. Otherwise, the overhead of data transfer and communication can become the limiting bottleneck. On the other hand, in cluster computing, the MPI/OpenMP framework outperforms the Spark framework by more than one order of magnitude in terms of processing speed and provides more consistent performance for distributed computing. This also reflects the large size of the output from the CCM algorithm. However, Spark shows better cluster infrastructure management, ease of software engineering, and more ready handling of other aspects, such as node failure and data replication. Furthermore, combinations of GPU and cluster frameworks are deployed and compared in GPU/CPU clusters. An apparent speedup can be achieved in the Spark framework, while extra time cost is incurred in the MPI/OpenMP framework. The underlying reason reflects the fact that the code complexity imposed by GPU utilization cannot be readily offset in the MPI/OpenMP framework. Overall, the experimental results on parallelized solutions have demonstrated a capacity for over an order of magnitude performance improvement when compared with the widely used current library rEDM. Such economies in computation time can speed learning and robust identification of causal drivers in complex systems. I conclude that these parallel techniques can achieve significant improvements. However, the performance gain varies among different techniques or frameworks. Although the use of GPUs can accelerate the application, there still exists constraints required to be taken into consideration, especially with regards to the input data scale. Without proper usage, GPUs use can even slow down the whole execution time. Convergent cross mapping can achieve a maximum speedup by adopting the MPI/OpenMP framework, as it is suitable to computation-intensive algorithms. By contrast, the Spark framework with integrated GPU accelerators still offers low execution cost comparing to the pure Spark version, which mainly fits in data-intensive problems
    • 

    corecore