67 research outputs found
Elastic Dataflow Processing on the Cloud
Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη
επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της
ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι
μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο
ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη
υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες
εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα
για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα
πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και
τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες
δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η
ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο.
Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν
χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα
ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος
επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη
θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και
του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων.
Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την
ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο
ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η
ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην
σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα
νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους
τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να
βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε-
πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που
χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο
ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το
συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων
ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την
μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων
ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και
των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για
την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο
εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα
προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με
συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined
Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με
στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται
τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας
υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of
modern applications on Big Data, especially due to the concept of elasticity,
which characterizes them: resources can be leased on demand and used for as
much time as needed, offering the ability to create virtual infrastructures
that change dynamically over time. Such applications often require processing
of complex queries that are expressed in a high-level language and are
typically transformed into data processing flows (dataflows). A logical
question that arises is whether elasticity affects dataflow execution and in
which way. It seems reasonable that the execution is faster when more resources
are used, however the monetary cost is higher. This gives rise to the concept
eco-elasticity, an additional kind of elasticity that comes from economics, and
captures the trade-offs between the response time of the system and the amount
of money we pay for it as influenced by the use of different amounts of
resources.
In this thesis, we approach the elasticity of clouds in a unified way that
combines both the traditional notion and eco-elasticity. This unified
elasticity concept is essential for the development of auto-tuned systems in
cloud environments. First, we demonstrate that eco-elasticity exists in several
common tasks that appear in practice and that can be discovered using a simple,
yet highly scalable and efficient algorithm. Next, we present two cases of
auto-tuned algorithms that use the unified model of elasticity in order to
adapt to the query workload: 1) processing analytical queries in the form of
tree execution plans in order to maximize profit and 2) automated index
management taking into account compute and storage re- sources. Finally, we
describe EXAREME, a system for elastic data processing on the cloud that has
been used and extended in this work. The system offers declarative languages
that are based on SQL with user-defined functions (UDFs) extended with
parallelism primi- tives. EXAREME exploits both elasticities of clouds by
dynamically allocating and deallocating compute resources in order to adapt to
the query workload
Recommended from our members
Complex Query Operators on Modern Parallel Architectures
Identifying interesting objects from a large data collection is a fundamental problem for multi-criteria decision making applications.In Relational Database Management Systems (RDBMS), the most popular complex query operators used to solve this type of problem are the Top-K selection operator and the Skyline operator.Top-K selection is tasked with retrieving the k-highest ranking tuples from a given relation, as determined by a user-defined aggregation function.Skyline selection retrieves those tuples with attributes offering (pareto) optimal trade-offs in a given relation.Efficient Top-K query processing entails minimizing tuple evaluations by utilizing elaborate processing schemes combined with sophisticated data structures that enable early termination.Skyline query evaluation involves supporting processing strategies which are geared towards early termination and incomparable tuple pruning.The rapid increase in memory capacity and decreasing costs have been the main drivers behind the development of main-memory database systems.Although the act of migrating query processing in-memory has created many opportunities to improve the associated query latency, attaining such improvements has been very challenging due to the growing gap between processor and main memory speeds.Addressing this limitation has been made easier by the rapid proliferation of multi-core and many-core architectures.However, their utilization in real systems has been hindered by the lack of suitable parallel algorithms that focus on algorithmic efficiency.In this thesis, we study in depth the Top-K and Skyline selection operators, in the context of emerging parallel architectures.Our ultimate goal is to provide practical guidelines for developing work-efficient algorithms suitable for parallel main memory processing.We concentrate on multi-core (CPU), many-core (GPU), and processing-in-memory architectures (PIM), developing solutions optimized for high throughout and low latency.The first part of this thesis focuses on Top-K selection, presenting the specific details of early termination algorithms that we developed specifically for parallel architectures and various types of accelerators (i.e. GPU, PIM).The second part of this thesis, concentrates on Skyline selection and the development of a massively parallel load balanced algorithm for PIM architectures.Our work consolidates performance results across different parallel architectures using synthetic and real data on variable query parameters and distributions for both of the aforementioned problems.The experimental results demonstrate several orders of magnitude better throughput and query latency, thus validating the effectiveness of our proposed solutions for the Top-K and Skyline selection operators
Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
Semantic web technologies have emerged in the last few years across different fields of study and their data are still growing rapidly. Specifically, the increased data storage and publishing capabilities in standard open web formats have made the technology much more successful. So, the data have become readable by humans, and they can be processed on a computer. The demand for complex multiple RDF queries is becoming significant with the increasing number of RDF triples. Such complex queries occasionally produce many common subexpressions. It is therefore extremely challenging to reduce the amount of RDF queries and transmission time for a vast number of related RDF data. Moreover, Recent literature shows that join query processing of Big RDF data has introduced many problems with respect to execution time and throughput. The hash-based encoding induces low execution time, which takes a long time to load and hence does not load all graphs. This is because the Resource Description Framework (RDF) collects and analyses large data in swarms, thereby having to deal with the inherent challenge of efficient swarm storage. The effective storage and data retrieval, which could be applied to high amounts of possible schema-less data, has also proven exceedingly difficult for RDF data storage. For instance, it is particularly difficult to view semantic and SPARQL query languages, as well as huge and complex graph patterns. To address this problem, a Join Query Processing Model (JQPro) is introduced for Big RDF data. The objectives of this research are: (i) formulate plan generator algorithms for join query processing on the basis of the previous research. (ii) develop an enhancement model of Join Query Processing (JQPro) based on SPARQL and Hadoop MapReduce using hashing-merge join technique to process Big RDF Data. (iii) evaluate and compare the performance based on the execution time, throughput, and CPU utilization of the JQPro model with existing models. On the other hand, the throughput was employed to measure the units of information that a system can process in each time frame. In addition, the CPU utilization was used in the big join query processing as an important resource element particularly during the map, to reduce phases. Furthermore, the hash-join and Sort-Merge algorithms were used to generate the join query processing, and this was employed due to their capacity to allow for more data sets to be joined. Both processes were sorted by algorithms on join attributes and the sorted relations was merged. Therefore, the join column sorted the groups of datasets with the same value. The sort–merge–join algorithm sorts the datasets on the joining attribute and then searches for tuples by merging the two datasets. Then, a processing framework for RDF queries was introduced and the benchmark was used for performance evaluation. Finally, the validation was conducted by standard statistical analysis to validate and compare the performance of the JQPro model with current models. In addition, the synthetic benchmarks Lehigh University Benchmark (LUBM) and Waterloo SPARQL Diversity Test Suite (WatDiv) v06 were used for measurement. The experiment was carried out on three datasets ranging from 10 million to 1 billion RDF triples produced by the generator of WatDiv data with a scale factor of 10, 100 and 1000, respectively. A selective dataset for each experimental query was also used for the processing of RDFs with a LUBM benchmark in sizes 500, 1000 and 2000 million triples. The result revealed that there is a strong correlation between execution time and throughput with a strength of 99.9% percent as confirmed by the Pearson correlation coefficient. Furthermore, the findings show that the JQPro solution was comparable to gStore RDF-3X, RDFox and PARJ and the percentage of improved performance was 87.77% in terms of execution time. The CPU utilization was significantly increased by extensive mapping and reduced code computing. It is therefore inferred that the JQPro solution is timely and innovative, as it provides an efficient execution time and CPU utilization where users could perform better queries for Big RDF data processing in a seamless manne
Towards High-Performance Big Data Processing Systems
The amount of generated and stored data has been growing rapidly, It is estimated that 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world today has been created in the last two years. How to solve these big data issues has become a hot topic in both industry and academia. Due to the complex of big data platform, we stratify it into four layers: storage layer, resource management layer, computing layer, and methodology layer. This dissertation proposes brand-new approaches to address the performance of big data platforms like Hadoop and Spark on these four layers. We first present an improved HDFS design called SMARTH, which optimizes the storage layer. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \u27\u27high performance\u27\u27 datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2. Secondly, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs on the resource management layer. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop. Thirdly, we introduce an efficient 3-level sampling performance model, called Hedgehog, and focus on the relationship between resource and performance. This design is a brand new white-box model for Spark, which is more complex and challenging than Hadoop. In our tool, we employ a Java bytecode manipulation and analysis framework called ASM to reduce the profiling overhead dramatically. Fourthly, on the computing layer, we optimize the current implementation of SGD in Spark\u27s MLlib by reusing data partition for multiple times within a single iteration to find better candidate weights in a more efficient way. Whether using multiple local iterations within each partition is dynamically decided by the 68-95-99.7 rule. We also design a variant of momentum algorithm to optimize step size in every iteration. This method uses a new adaptive rule that decreases the step size whenever neighboring gradients show differing directions of significance. Experiments show that our adaptive algorithm is more efficient and can be 7 times faster compared to the original MLlib\u27s SGD. At last, on the application layer, we present a scalable and distributed geographic information system, called Dart, based on Hadoop and HBase. Dart provides a hybrid table schema to store spatial data in HBase so that the Reduce process can be omitted for operations like calculating the mean center and the median center. It employs reasonable pre-splitting and hash techniques to avoid data imbalance and hot region problems. It also supports massive spatial data analysis like K-Nearest Neighbors (KNN) and Geometric Median Distribution. In our experiments, we evaluate the performance of Dart by processing 160 GB Twitter data on an Amazon EC2 cluster. The experimental results show that Dart is very scalable and efficient
Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters
Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters.
Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks.
We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters.
In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues.
We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads.
Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 .
Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
- …