87 research outputs found

    FPGA-based Query Acceleration for Non-relational Databases

    Get PDF
    Database management systems are an integral part of today’s everyday life. Trends like smart applications, the internet of things, and business and social networks require applications to deal efficiently with data in various data models close to the underlying domain. Therefore, non-relational database systems provide a wide variety of database models, like graphs and documents. However, current non-relational database systems face performance challenges due to the end of Dennard scaling and therefore performance scaling of CPUs. In the meanwhile, FPGAs have gained traction as accelerators for data management. Our goal is to tackle the performance challenges of non-relational database systems with FPGA acceleration and, at the same time, address design challenges of FPGA acceleration itself. Therefore, we split this thesis up into two main lines of work: graph processing and flexible data processing. Because of the lacking benchmark practices for graph processing accelerators, we propose GraphSim. GraphSim is able to reproduce runtimes of these accelerators based on a memory access model of the approach. Through this simulation environment, we extract three performance-critical accelerator properties: asynchronous graph processing, compressed graph data structure, and multi-channel memory. Since these accelerator properties have not been combined in one system, we propose GraphScale. GraphScale is the first scalable, asynchronous graph processing accelerator working on a compressed graph and outperforms all state-of-the-art graph processing accelerators. Focusing on accelerator flexibility, we propose PipeJSON as the first FPGA-based JSON parser for arbitrary JSON documents. PipeJSON is able to achieve parsing at line-speed, outperforming the fastest, vectorized parsers for CPUs. Lastly, we propose the subgraph query processing accelerator GraphMatch which outperforms state-of-the-art CPU systems for subgraph query processing and is able to flexibly switch queries during runtime in a matter of clock cycles

    Distributed XML Query Processing

    Get PDF
    While centralized query processing over collections of XML data stored at a single site is a well understood problem, centralized query evaluation techniques are inherently limited in their scalability when presented with large collections (or a single, large document) and heavy query workloads. In the context of relational query processing, similar scalability challenges have been overcome by partitioning data collections, distributing them across the sites of a distributed system, and then evaluating queries in a distributed fashion, usually in a way that ensures locality between (sub-)queries and their relevant data. This thesis presents a suite of query evaluation techniques for XML data that follow a similar approach to address the scalability problems encountered by XML query evaluation. Due to the significant differences in data and query models between relational and XML query processing, it is not possible to directly apply distributed query evaluation techniques designed for relational data to the XML scenario. Instead, new distributed query evaluation techniques need to be developed. Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query processing is proposed. Based on a data partitioning model that supports both horizontal and vertical fragmentation steps (or any combination of the two), XML collections are fragmented and distributed across the sites of a distributed system. Then, a suite of distributed query evaluation strategies is proposed. These query evaluation techniques ensure locality between each fragment of the collection and the parts of the query corresponding to the data in this fragment. Special attention is paid to scalability and query performance, which is achieved by ensuring a high degree of parallelism during distributed query evaluation and by avoiding access to irrelevant portions of the data. For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides several alternative approaches for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is necessary to predict and compare the expected performance of each of these alternatives. In this work, this is accomplished through a query optimization technique based on a distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is fragmented to the demands of the query workload evaluated over this collection. To evaluate the performance impact of the distributed query evaluation techniques proposed in this thesis, the techniques were implemented within a production-quality XML database system. Based on this implementation, a thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation techniques introduced here lead to significant improvements in query performance and scalability both when compared to centralized techniques and when compared to existing distributed query evaluation techniques

    Evaluation and performance of reading from big data formats

    Get PDF
    The emergence of new application profiles has caused a steep surge in the volume of data generated nowadays. Data heterogeneity is a modern trend, as unstructured types of data, such as videos and images, and semi-structured types, such as JSON and XML files, are becoming increasingly widespread. Consequently, new challenges related to analyzing and extracting important insights from huge bodies of information arise. The field of big data analytics has been developed to address these issues. Performance plays a key role in analytical scenarios, as it empowers applications to generate value in a more efficient and less time-consuming way. In this context, files are used to persist large quantities of information, which can be accessed later by analytic queries. Text files have the advantage of providing an easier interaction with the end user, whereas binary files propose structures that enhance data access. Among them, Apache ORC and Apache Parquet are formats that present characteristics such as column-oriented organization and data compression, which are used to achieve a better performance in queries. The objective of this project is to assess the usage of such files by SAP Vora, a distributed database management system, in order to draw out processing techniques used in big data analytics scenarios, and apply them to improve the performance of queries executed upon CSV files in Vora. Two techniques were employed to achieve such goal: file pruning, which allows Vora’s relational engine to ignore files possessing irrelevant information for the query, and block pruning, which disregards individual file blocks that do not possess data targeted by the query when processing files. Results demonstrate that these modifications enhance the efficiency of analytical workloads executed upon CSV files in Vora, thus narrowing the performance gap of queries executed upon this format and those targeting files tailored for big data scenarios, such as Apache Parquet and Apache ORC. The project was developed during an internship at SAP, in Walldorf, Germany.A emergência de novos perfis de aplicação ocasionou um aumento abrupto no volume de dados gerado na atualidade. A heterogeneidade de tipos de dados é uma nova tendência: encontram-se tipos não-estruturados, como vídeos e imagens, e semi-estruturados, tais quais arquivos JSON e XML. Consequentemente, novos desafios relacionados à extração de valores importantes de corpos de dados surgiram. Para este propósito, criou-se o ramo de big data analytics. Nele, a performance é um fator primordial pois garante análises rápidas e uma geração de valores eficiente. Neste contexto, arquivos são utilizados para persistir grandes quantidades de informações, que podem ser utilizadas posteriormente em consultas analíticas. Arquivos de texto têm a vantagem de proporcionar uma fácil interação com o usuário final, ao passo que arquivos binários propõem estruturas que melhoram o acesso aos dados. Dentre estes, o Apache ORC e o Apache Parquet são formatos que apresentam uma organização orientada a colunas e compressão de dados, o que permite aumentar o desempenho de acesso. O objetivo deste projeto é avaliar o uso desses arquivos na plataforma SAP Vora, um sistema de gestão de base de dados distribuído, com o intuito de otimizar a performance de consultas sobre arquivos CSV, de tipo texto, em cenários de big data analytics. Duas técnicas foram empregadas para este fim: file pruning, a qual permite que arquivos possuindo informações desnecessárias para consulta sejam ignorados, e block pruning, que permite eliminar blocos individuais do arquivo que não fornecerão dados relevantes para consultas. Os resultados indicam que essas modificações melhoram o desempenho de cargas de trabalho analíticas sobre o formato CSV na plataforma Vora, diminuindo a discrepância de performance entre consultas sobre esses arquivos e aquelas feitas sobre outros formatos especializados para cenários de big data, como o Apache Parquet e o Apache ORC. Este projeto foi desenvolvido durante um estágio realizado na SAP em Walldorf, na Alemanha

    Integrating Skips and Bitvectors for List Intersection

    Get PDF
    This thesis examines space-time optimizations of in-memory search engines. Search engines can answer queries quickly, but this is accomplished using significant resources in the form of multiple machines running concurrently. Improving the performance of search engines means reducing the resource costs, such as hardware, energy, and cooling. These saved resources can then be used to improve the effectiveness of the search engine or provide additional added value to the system. We improve the space-time performance for search engines in the context of in-memory conjunctive intersection of ordered document identifier lists. We show that reordering of document identifiers can produce dense regions in these lists, where bitvectors can be used to improve the efficiency of conjunctive list intersection. Since the process of list intersection is a fundamental building block and a major performance bottleneck for search engines, this work will be important for all search engine researchers and developers. Our results are presented in three stages. First, we show how to combine multiple existing techniques for list intersection to improve space-time performance. We combine bitvectors for large lists with skips over compressed values for the other lists. When the skips are large and overlaid on the compressed lists, space-time performance is superior to existing techniques, such as using skips or bitvectors separately. Second, we show that grouping documents by size and ordering by URL within groups combines the skewed clustering that results from document size ordering with the tight clustering that results from URL ordering. We propose a new semi-bitvector data structure that encodes the front of a list, including groups with large documents, as a bitvector and the rest of the list as skips over compressed values. This combination produces significant space-time performance gains on top of the gains from the first stage. Third, we show how partitioning by document size into separate indexes can also produce high density regions that can be exploited by bitvectors, resulting in benefits similar to grouping by document size within one index. This partitioning technique requires no modification of the intersection algorithms, and it is therefore broadly applicable. We further show that any of our partitioning approaches can be combined with semi-bitvectors and grouping within each partition to effectively exploit skewed clustering and tight clustering in our dataset. A hierarchy of partitioning approaches may be required to exploit clustering in very large document collections

    메모리 집약적 연산 가속화를 위해 맞춤화된 DNN 가속기 및 로드 밸런싱 기술

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2022. 8. 안정호.딥 뉴럴 네트워크(DNN)는 인간에 근접한 인식 정확도를 토대로 이미지 분류, 자연어 처리, 음성 인식과 같은 다양한 분야에서 사용된다. DNN의 계속된 발전으 로 인해, DNN에서 가장 많은 연산량을 요구하는 컨볼루션과 행렬 곱셈(GEMM) 을 전용으로 처리하는 가속기들이 출시되었다. 하지만, 컴퓨팅 집약적인 연산들을 가속하는데에만 치중된 가속기 연구 방향으로 인해, 이전에는 잘 보이지 않았던 메모리 집약적인 연산들의 수행 시간 비중이 증가하였다. 컨볼루션 뉴럴 네트워크 추론(CNN inference)에서, 컨볼루션의 연산 비용을 줄이기 위해 최신 CNN 모델들은 깊이방식의 컨볼루션(depth-wise convolution, DW-CONV)과 스퀴즈-엑사이테이션(squeeze-and-excitation, SE)을 채택한 다. 그러나, 기존의 CNN 가속기는 컴퓨팅 집약적인 표준 컨볼루션 계층에 최적 화되었기 때문에, 데이터 재사용이 제한된 DW-CONV 및 SE는 연산의 효율성을 떨어뜨린다. 따라서, DW-CONV 및 SE의 연산량은 전체 연산의 10%만 차지하지 만 시스토릭 어레이(systolic-array) 기반의 가속기에서 메모리 대역폭의 병목으로 인해 처리 시간의 60% 이상을 소비한다. 트랜스포머 학습(transformer training)에서, GEMM의 수행시간이 상대적 으로 감소함에 따라 소프트맥스(softmax), 레이어 정규화(layer normalization), GeLU, 컨텍스트(context), 어텐션(attention)과 같은 메모리 집약적인 연산들의 수행 시간 비중이 증가하였다. 특히, 입력 데이터의 시퀀스 길이(sequence length) 가 증가하는 최신의 트랜스포머 추세로 인해 시퀀스 길이에 따라 데이터 크기가 제곱배가 되는 소프트맥스, 컨텍스트(context), 어텐션(attention) 레이어들의 영 향도가 커진다. 따라서, 메모리 집약적인 특성을 가진 연산들이 최대 80%의 수행 시간을 차지한다. 본 논문에서, 우리는 CNN을 가속하기 위해 시스토릭 어레이 기반 아키텍처 위에 작은 영역 오버헤드로 컴퓨팅 및 메모리 집약적 작업을 모두 효율적으로 처 리하는 연산 유닛을 추가한 MVP 아키텍처를 제안한다. 우리는 높은 메모리 대역 폭 요구 사항을 충족하기 위해 곱셈기, 덧셈 트리(adder tree), 다중의 다중-뱅크 버퍼를 포함한 DW-CONV 처리에 맞춤화된 벡터 유닛(vector unit)을 제안한다. 또한, 우리는 시스토릭 어레이에서 사용하는 통합 버퍼를 확장하여 SE와 같은 요 소단위(element-wise) 연산을 뒤따르는 CONV와 파이프라인(pipeline) 방식으로 처리하는 프로세싱-니어-메모리 유닛(processing-near-memory-unit, PNMU) 을 제안한다. MVP 구조는 베이스라인(baseline) 시스토릭 어레이 아키텍처에 비해 9%의 면적 오버헤드만을 이용하여 EfficientNet-B0/B4/B7, MnasNet 및 MobileNet-V1/V2에 대해 성능을 평균 2.6배 향상하고 에너지 소모량을 47% 줄인다. 그리고, 우리는 트랜스포머 학습 가속을 위해 DNN 가속기 내에 존재하는 여러 개의 연산 유닛들을 클러스터(cluster) 단위로 분할하는 기술들을 제안한다. 트래픽 성형(traffic shaping)은 클러스터들을 비동기 방식으로 수행시켜 DRAM 대역폭의 출렁임을 완화시킨다. 자원 공유(resource sharing)는 컴퓨팅 집약적인 연산과 메모리 집약적인 연산이 서로 다른 클러스터에서 동시에 수행될 때 모든 클러스터의 매트릭스 유닛과 벡터 유닛을 동시에 수행 시켜 컴퓨팅 집약적인 연산의 수행 시간을 줄인다. 트래픽 성형과 자원 공유를 적용하여 BERT-Large 학습 수행 시 1.27배의 성능을 향상시킨다.Deep neural networks (DNNs) are used in various fields, such as in image classification, natural language processing, and speech recognition based on high recognition accuracy that approximates that of humans. Due to the continuous development of DNNs, a large body of accelerators have been introduced to process convolution (CONV) and general matrix multiplication (GEMM) operations, which account for the greatest level of computational demand. However, in the line of accelerator research focused on accelerating compute-intensive operations, the execution time of memory-intensive operations has increased more than it did in the past. In convolutional neural network (CNN) inference, recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE) to reduce the computational costs of CONV. However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. During the transformer training process, the execution times of memoryintensive operations such as softmax, layer normalization, GeLU, context, and attention layer increased because conventional accelerators improved their computational performance capabilities dramatically. In addition, with the latest trend toward increasing the sequence length, the softmax, context, and attention layers have much more of an influence as their data sizes have increased quadratically. Thus, these layers take up to 80% of the execution time. In this thesis, we propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DWCONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6× and reduces energy consumption by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline. Then, we propose load balancing techniques that partition multiple processing element tiles inside a DNN accelerator for transformer training acceleration. Traffic shaping alleviates temporal fluctuations in the DRAM bandwidth by handling multiple processing element tiles within a cluster in a synchronous manner but running different clusters asynchronously. Resource sharing reduces the execution time of compute-intensive operations by simultaneously executing the matrix units and vector units of all clusters. Our evaluation shows that traffic shaping and resource sharing improve the performance by up to 1.27× for BERT-Large training.1 Introduction 1 1.1 Accelerating Depth-wise Convolution on Edge Device 3 1.2 Accelerating Transformer Models in Training 6 1.3 Research Contributions 10 1.4 Outline 11 2 Background and Motivation 12 2.1 CNN background and trends 12 2.1.1 Various types of convolution (CONV) operations 12 2.1.2 Trends in CNN model architecture 14 2.1.3 EfficientNet: A state-of-the-art CNN model 17 2.2 Transformer background and trends 20 2.2.1 Bidirectional encoder representations from transformers (BERT) 20 2.2.2 Trends in training transformer models 21 2.3 Baseline DNN acceleration architecture 23 2.4 Motivation 25 2.4.1 Challenges of computing memory-intensive CNN layers 25 2.4.2 Opportunity for load balancing in BERT training 28 3 DNN accelerator tailored for accelerating memory-intensive operations 32 4 MVP: A CNN accelerator with Matrix, Vector, and Processing-near-memory units 35 4.1 Contribution 35 4.1.1 MVP organization 35 4.1.2 How depth-wise processing element (DWPE) operates 38 4.1.3 How processing-near-memory unit (PNMU) operates 41 4.1.4 Overlapping the operation of DW-CONV with PW-CONV 42 4.1.5 Considerations for designing DWIB 44 4.2 Evaluation 45 4.2.1 Experimental setup 46 4.2.2 Performance and energy evaluation 47 4.2.3 Comparing MVP with NVDLA 52 4.2.4 Exploring the design space of MVP architecture 54 4.2.5 Evaluating MVP with various SysAr configurations 57 4.3 Related Work 57 5 Load Balancing Techniques for BERT Training 61 5.1 Contribution 61 5.1.1 Tiled architecture 61 5.1.2 DRAM traffic shaping 64 5.1.3 Resource sharing 66 5.2 Evaluation 68 5.2.1 Experimental setup 68 5.2.2 Performance evaluation 69 6 Discussion 73 7 Conclusion 78박

    Exploring a striped XML world

    Get PDF
    EXtensible Markup Language, XML, was designed as a markup language for structuring, storing and transporting data on the World Wide Web. The focus of XML is on data content; arbitrary markup is used to describe data. This versatile, self-describing data representation has established XML as the universal data format and the de facto standard for information exchange on the Web. This has gradually given rise to the need for efficient storage and querying of large XML repositories. To that end, we propose a new model for building a native XML store which is based on a generalisation of vertical decomposition. Nodes of a document satisfying the same label-path, are extracted and stored together in a single container, a Stripe. Stripes make use of a labelling scheme allowing us to maintain full structural information. Over this new representation, we introduce various evaluation techniques, which allow us to handle a large fragment of XPath 2.0. We also focus on the optimisation opportunities that arise from our decomposition model during any query evaluation phase. During query validation, we present an input minimisation process that exploits the proposed model for identifying input that is only relevant to the given query, in terms of Stripes. We also define query equivalence rules for query rewriting over our proposed model. Finally, during query optimisation, we deal with whether and under which circumstances certain evaluation algorithms can be replaced by others having lower I/O and/or CPU cost. We propose three storage schemes under our general decomposition technique. The schemes differ in the compression method imposed on the structural part of the XML document. The first storage scheme imposes no compression. The second storage scheme exploits structural regularities of the document to minimise storage and, thus, I/O cost during query evaluation. Finally, the third storage scheme performs structureagnostic compression of the document structure which results in minimised storage, regardless the actual XML structure. We experiment on XML repositories of varying size, recursion and structural regularity. We consider query input size, execution plan size and query response time as metrics for our experimental results. We process query workloads by applying each of the proposed optimisations in isolation and then all of their combinations. In addition, we apply the same execution pipeline for all proposed storage schemes. As a reference to our proposed query evaluation pipeline, we use the current state-of-the-art system for XML query processing. Our results demonstrate that: • Our proposed data model provides the infrastructure for efficiently selecting the parts of the document that are relevant to a given query. • The application of query rewriting, combined with input minimisation, reduces query input size as well as the number of physical operators used. In addition, when evaluation algorithms are specialised to the decomposition method, query response time is further reduced. • Query evaluation performance is largely affected by the storage schemes, which are closely related to the structural properties of the data. The achieved compression ratio greatly affects storage size and therefore, query response times

    A Search Engine Architecture Based on Collection Selection

    Get PDF
    In this thesis, we present a distributed architecture for a Web search engine, based on the concept of collection selection. We introduce a novel approach to partition the collection of documents, able to greatly improve the effectiveness of standard collection selection techniques (CORI), and a new selection function outperforming the state of the art. Our technique is based on the novel query-vector (QV) document model, built from the analysis of query logs, and on our strategy of co-clustering queries and documents at the same time. Incidentally, our partitioning strategy is able to identify documents that can be safely moved out of the main index (into a supplemental index), with a minimal loss in result accuracy. In our test, we could move 50\% of the collection to the supplemental index with a minimal loss in recall. By suitably partitioning the documents in the collection, our system is able to select the subset of servers containing the most relevant documents for each query. Instead of broadcasting the query to every server in the computing platform, only the most relevant will be polled, this way reducing the average computing cost to solve a query. We introduce a novel strategy to use the instant load at each server to drive the query routing. Also, we describe a new approach to caching, able to incrementally improve the quality of the stored results. Our caching strategy is effectively both in reducing computing load and in improving result quality. By combining these innovations, we can achieve extremely high figures of precision, with a reduced load w.r.t.~full query broadcasting. Our system can cover 65\% results offered by a centralized reference index (competitive recall at 5), with a computing load of only 15.6\%, \ie a peak of 156 queries out of a shifting window of 1000 queries. This means about 1/4 of the peak load reached when broadcasting queries. The system, with a slightly higher load (24.6\%), can cover 78\% of the reference results. The proposed architecture, overall, presents a trade-off between computing cost and result quality, and we show how to guarantee very precise results in face of a dramatic reduction to computing load. This means that, with the same computing infrastructure, our system can serve more users, more queries and more documents

    Storage Format Selection and Optimization for Materialized Intermediate Results in Data-Intensive Flows

    Get PDF
    Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen peratortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HL Ansatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzen

    Storage format selection and optimization for materialized intermediate results in data-intensive flows

    Get PDF
    Tesi en modalitat de cotuela: Universitat Politècnica de Catalunya i Technische Universität DresdenModern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.Las organizaciones producen y recopilan grandes volúmenes de datos, que deben procesarse de forma repetitiva y rápida para obtener información relevante para la empresa. Para tal procesamiento, por lo general, se emplean flujos intensivos de datos (DIFs por sussiglas en inglés) en entornos de procesamiento distribuido. Los DIFs de diferentes usuarios tienen elementos comunes (es decir, se duplican partes del procesamiento, lo que desperdicia recursos computacionales y aumenta el coste en general). Los resultados intermedios de varios DIFs pueden pues coincidir y se pueden por tanto materializar para facilitar su reutilización, lo que ayuda a reducir el coste y ahorrar recursos si se realiza correctamente. Además, la forma en qué se materializan dichos resultados debe ser considerada. Por ejemplo, diferentes tipos de diseño lógico de los datos (es decir, horizontal, vertical o híbrido) se pueden utilizar para reducir el coste de E/S. En esta tesis doctoral, primero proponemos un enfoque novedoso para materializar automáticamente los resultados intermedios de los DIFs a través de un método de optimización multi-objetivo, que puede considerar múltiples y contradictorias métricas de calidad. A continuación, estudiamos el comportamiento de diferentes operadores de DIF que acceden directamente a los resultados materializados. Sobre la base de este estudio, ideamos un enfoque basado en reglas, que decide el diseño del almacenamiento para los resultados materializados en función de los tipos de operaciones que los utilizan directamente. A pesar de mejorar el coste en general, las reglas heurísticas no consideran estadísticas sobre la cantidad de datos leídos al hacer la elección, lo que podría llevar a una decisión errónea. Consecuentemente, diseñamos un modelo de costos que es capaz de encontrar el diseño de almacenamiento adecuado para cada escenario dependiendo de las características de los datos almacenados. El modelo de costes usa estadísticas y características de acceso para estimar el coste de E/S de un resultado intervii medio materializado con diferentes diseños de almacenamiento y elige el de menor coste. Los resultados muestran que los diseños de almacenamiento ayudan a reducir el tiempo de carga de los resultados materializados y, en general, mejoran el rendimiento de los DIF. La tesis también presta atención a la optimización de los parámetros configurables de diseños híbridos. Proponemos así ATUN-HL (Auto TUNing Hybrid Layouts), que, basado en el mismo modelo de costes, las características de los datos y el tipo de acceso que se está haciendo, encuentra los valores óptimos para los parámetros de configuración en disponibles Parquet (una implementación de diseños híbridos para Hadoop Distributed File System). Finalmente, esta tesis estudia el impacto del paralelismo en DIF y diseños híbridos. El modelo de coste propuesto ayuda a idear un enfoque para ajustar el paralelismo al decidir la cantidad de tareas y máquinas para procesar los datos. En resumen, el modelo de costes propuesto permite elegir el mejor diseño de almacenamiento posible para los resultados intermedios materializados, ajustar los parámetros configurables de diseños híbridos y estimar el número de tareas y máquinas para la ejecución de DIF.Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen Operatortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HLAnsatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzenPostprint (published version
    corecore