22 research outputs found

    Importance of data distribution on hive-based systems for query performance: An experimental study

    Get PDF
    SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive\u27s query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well

    GRAPH BASESD WORD SENSE DISAMBIGUATION FOR CLINICAL ABBREVIATIONS USING APACHE SPARK

    Get PDF
    Identification of the correct sense for an ambiguous word is one of the major challenges for language processing in all domains. Word Sense Disambiguation is the task of identifying the correct sense of an ambiguous word by referencing the surrounding context of the word. Similar to the narrative documents, clinical documents suffer from ambiguity issues that impact automatic extraction of correct sense from the document. In this project, we propose a graph-based solution based on an algorithm originally implemented by Osmar R. Zaine et al. for word sense disambiguation specifically focusing on clinical text. The algorithm makes use of proposed UMLS Metathesaurus as its source of knowledge. As an enhancement to the existing implementation of the algorithm, this project uses Apache Spark - A Big Data Technology for cluster based distributed processing and performance optimization

    Performance Evaluation of Job Scheduling and Resource Allocation in Apache Spark

    Get PDF
    Advancements in data acquisition techniques and devices are revolutionizing the way image data are collected, managed and processed. Devices such as time-lapse cameras and multispectral cameras generate large amount of image data daily. Therefore, there is a clear need for many organizations and researchers to deal with large volume of image data efficiently. On the other hand, Big Data processing on distributed systems such as Apache Spark are gaining popularity in recent years. Apache Spark is a widely used in-memory framework for distributed processing of large datasets on a cluster of inexpensive computers. This thesis proposes using Spark for distributed processing of large amount of image data in a time efficient manner. However, to share cluster resources efficiently, multiple image processing applications submitted to the cluster must be appropriately scheduled by Spark cluster managers to take advantage of all the compute power and storage capacity of the cluster. Spark can run on three cluster managers including Standalone, Mesos and YARN, and provides several configuration parameters that control how resources are allocated and scheduled. Using default settings for these multiple parameters is not enough to efficiently share cluster resources between multiple applications running concurrently. This leads to performance issues and resource underutilization because cluster administrators and users do not know which Spark cluster manager is the right fit for their applications and how the scheduling behaviour and parameter settings of these cluster managers affect the performance of their applications in terms of resource utilization and response times. This thesis parallelized a set of heterogeneous image processing applications including Image Registration, Flower Counter and Image Clustering, and presents extensive comparisons and analyses of running these applications on a large server and a Spark cluster using three different cluster managers for resource allocation, including Standalone, Apache Mesos and Hodoop YARN. In addition, the thesis examined the two different job scheduling and resource allocations modes available in Spark: static and dynamic allocation. Furthermore, the thesis explored the various configurations available on both modes that control speculative execution of tasks, resource size and the number of parallel tasks per job, and explained their impact on image processing applications. The thesis aims to show that using optimal values for these parameters reduces jobs makespan, maximizes cluster utilization, and ensures each application is allocated a fair share of cluster resources in a timely manner

    Automatic physical layer tuning of mapreduce-based query processing engines

    Get PDF
    Orientador: Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 29/06/2020Inclui referências: p. 98-109Área de concentração: Ciência da ComputaçãoResumo: A crescente necessidade de processar grandes quantidades de dados semi-estruturados e nãoestruturados levou ao desenvolvimento de mecanismos de processamento especializados como o MapReduce. O MapReduce é um modelo de programação projetado para processar grandes quantidades de dados semiestruturados de maneira distribuída e paralela. Os sistemas SQLon-Hadoop são interfaces SQL construídas sobre os mecanismos de processamento baseados em MapReduce para consultar grandes quantidades de dados semi-estruturados. No entanto, o número de máquinas, o número de sistemas na pilha de software e os mecanismos de controle fornecidos pelos mecanismos do MapReduce aumentam a complexidade e os custos operacionais de um cluster SQL-on-Hadoop. O aumento do desempenho dos motores de processamento MapReduce é um fator chave que pode ser alcançado delegando a quantidade certa de recursos físicos para suas tarefas. No entanto, usuários e até administradores especializados lutam para entender e ajustar as tarefas MapReduce para obter um desempenho melhor. A falta de conhecimento para ajustar as tarefas MapReduce deu origem a uma linha de pesquisa bem-sucedida sobre o ajuste automático dos parâmetros do MapReduce, originando vários Orientadores de Ajuste. No entanto, o problema de ajustar automaticamente as consultas SQL-no-Hadoop permanece amplamente inexplorado, pois a abordagem atual da aplicação dos Orientadores de Ajuste projetados para MapReduce em consultas SQL-on-Hadoop acarreta em vários problemas. Por exemplo, o processador de consultas do Hive, um sistema SQL-on-Hadoop popular, traduz consultas HiveQL em grafos de tarefas MapReduce, e seria fácil supor que, ajustando as configurações do motor de processamento MapReduce, as consultas HiveQL também se beneficiariam. Entretanto, essa suposição não se aplica quando os Orientadores de Ajuste existentes são aplicados ingenuamente às consultas HiveQL devido a arquitetura do Hive, Hadoop e dos Orientadores de Ajuste. Nesta tese tratamos da questão de como ajustar corretamente as consultas SQL-no-Hadoop. Por "corretamente", entendemos que, ao ajustar as configurações das consultas SQL-no-Hadoop, a geração das configurações deve considerar várias características que estão presentes apenas em tarefas geradas pelos sistemas SQL-no-Hadoop. Essas características incluem: (i) no caso de consultas individuais, todas as tarefas MapReduce que constituem o plano de consulta desta consulta são executadas com configurações idênticas. (ii) apesar da busca e geração das configurações de ajuste serem realizadas para cada tarefa MapReduce, apenas uma configuração de ajuste é selecionada e aplicada à consulta e as demais configurações de ajuste são simplesmente descartadas. (iii) Os Orientadores de Ajuste do Hadoop tratam as funções do MapReduce como caixas-pretas e fazem suposições de modelagem simplificadoras que podem valer para tarefas clássicas do MapReduce (Sort, Grep), mas não são verdadeiras para consultas do tipo SQL como o HiveQL, onde as tarefas contêm vários operadores de álgebra relacional como junções e agregadores. Estendemos o processador de consultas do Hive para ajustar as consultas SQL-no-Hadoop. Esta extensão compreende uma abordagem chamada de ajuste não-uniforme que permite que os sistemas SQL-on-Hadoop tenham um controle mais refinado da configuração das consultas, onde cada tarefa MapReduce recebe uma configuração especializada. Apresentamos um modelo conceitual, chamado assinatura de código, que usa informações estáticas disponíveis antes da execução de cada tafera para mapear tarefas que tenham padrões de consumo de recursos similares. Também apresentamos um cache que armazena configurações de ajuste, geradas por algum Orientadore de Ajuste, e as recicla entre tarefas que possuem consumo de recursos semelhantes. Nossa extensão funciona em conjunto como uma solução única para o ajuste automático de consultas SQL-no-Hadoop. Para validar nossa solução, realizamos um estudo experimental focado no Hive executando sobre o Hadoop porque (i) O Hive é um bom representante dos sistemas SQL-on-Hadoop nativos (como o System-R fez para os sistemas de bancos de dados relacionais); (ii) o Hive e o Hadoop são altamente populares para processamento analítico; e (iii) O ajuste de parâmetros do Hadoop foi estudado extensivamente nos últimos anos. Para preencher o cache de ajuste, empregamos o Starfish, o primeiro Orientador de Ajuste baseado em custo que encontra configurações (quase) ótimas e é o único Orientador de Ajuste disponível ao público para fins de pesquisa acadêmica. Em nossos experimentos, apresentamos que as consultas otimizadas com nossa abordagem de ajuste apresentaram acelerações de até 25%, contrastando com a abordagem atual que degradou o desempenho em várias ocasiões. Especificamente, a abordagem atual de ajuste pode causar variações no tempo de execução entre -171% e 27% em relação à configuração padrão. Mais importante ainda, nosso método de ajuste leva a uma melhor utilização de recursos, diminuindo o uso da CPU e a paginação de memória em até 40%. Nossa abordagem também reduziu a quantidade total de dados gravados em discos em 5×. Nossa abordagem de ajuste tem um cache usado para evitar a recriação de perfis de tarefas MapReduce semelhantes. Nosso cache reduziu a geração de perfils em 50% para a carga de trabalho TPC-H, permitindo até o ajuste parcial de consultas ad-hoc antes de sua execução. Palavras-chave: Sintonia da camada física. Processamento de consulta em MapReduce. SQL-On-Hadoop.Abstract: The increasing need to process large amounts of semi- and non-structured data has led to the development of specialized processing engines like MapReduce. MapReduce is a programming model designed to process large-scale semi-structured data in a distributed and parallel fashion. SQL-on-Hadoop systems are SQL-like interfaces build on top of MapReduce processing engines to query semi-structured data in large-scale. However, the number of computing nodes, the number of systems in the software stack, and the controlling mechanisms provided by MapReduce engines increase the complexity and the operational costs of maintaining a large SQL-on-Hadoop cluster. Increasing performance of such engines is a key factor that can be achieved by delegating the right amount of physical resources. Yet, regular users and even expert administrators struggle to understand and tune MapReduce jobs to achieve good performance. This skill gap has given rise to a successful line of research on automatically tuning MapReduce parameters, originating several tuning advisors. Yet, the problem of automatically tuning SQL-on-Hadoop queries remains largely unexplored today as the current approach of applying MapReduce tuning advisors direct to SQL-on-Hadoop queries entail a number of problems. For instance, the Hive SQL-on-Hadoop engine compiles HiveQL queries into a workflow of MapReduce jobs, and it would be straightforward to assume that by tuning the underlying Hadoop processing engine, HiveQL queries would benefit as well. However, this assumption does not hold when existing tuning advisors are naively applied to HiveQL queries due to the design choices of Hive, Hadoop, and the tuning advisors. This thesis addresses the question of how to properly tune SQL-on-Hadoop queries? By "properly" we mean, when tuning SQL-on-Hadoop queries, the generation of the tuning setups has to consider several characteristics that are only present in jobs generated by SQL-on-Hadoop systems. These characteristics include: (i) at the level of individual queries, all MapReduce jobs that constitute a query plan are executed with identical configuration settings. (ii) despite profiling and search heuristics being performed in a job-basis to generate tuning setups, only one tuning setup is applied to the query and the remaining tuning setups are simply discarded. (iii) Hadoop tuning advisors treat the MapReduce functions as black boxes and make simplifying modeling assumptions that may hold for classical MapReduce jobs (Sort, Grep), but they are not true for SQL-like queries like HiveQL where jobs contain multiple relational algebra operators like joins and aggregators. We extended the Hive query processor for tune SQL-on-Hadoop queries. This extension comprises an approach called non-uniform tuning that enables SQL-on-Hadoop systems to have a fine-grained control for tuning queries, where jobs receive specialized tuning setups. We present a conceptual model, called code-signature, that uses static information available upfront execution to match jobs with similar resource consumption patterns. We also present a tuning cache that stores tuning setups, generated by third part tuning advisors, and recycle them between jobs that have the similar resource consumption. The extension works together as a single solution for automatic tuning of SQL-on-Hadoop queries. In order to validate our solution, we conduct an experimental study focused on Hive over Hadoop because (i) Hive is a good representative of native SQL-on-Hadoop systems (like System-R did for relational database systems); (ii) both Hive and Hadoop are highly popular for analytical processing; and (iii) Hadoop parameter tuning has been studied extensively in recent years. For populate the Tuning Cache, we employ Starfish, the first cost-based optimizer for finding (near-) optimal configuration parameter settings and the only publicly available tuning advisor for academic research purposes. In our experiments, we present that queries optimized with our tuning approach always presented positive speed ups up to 25%, contrasting the current approach that degraded performance in several occasions. Specifically, the current tuning approach can cause variations in the execution run time between -171% and 27% over default configuration. Most importantly, our tuning method leads to considerable better resource utilization, decreasing CPU usage and Memory paging over 40%. Also reducing the total amount of data written to disks in 5×. Our tuning approach has a Tuning Cache used to avoid reprofiling similar jobs. Our Tuning Cache reduced the profilings in 50% for TPC-H queries, enabling upfront tuning of ad-hoc queries. Keywords: Physical-layer tuning. MapReduce query processing. SQL-On-Hadoop

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

    Data warehousing technologies for large-scale and right-time data

    Get PDF

    Model-Based Time Series Management at Scale

    Get PDF

    Visualization of large amounts of multidimensional multivariate business-oriented data

    Get PDF
    Many large businesses store large amounts of business-oriented data in data warehouses. These data warehouses contain fact tables, which themselves contain rows representing business events, such as an individual sale or delivery. This data contains multiple dimensions (independent variables that are categorical) and very often also contains multiple measures (dépendent variables that are usually continuous), which makes it complex for casual business users to analyze and visualize. We propose two techniques, GPLOM and VisReduce, that respectively handle the visualization front-end of complex datasets and the back-end processing necessary to visualize large datasets. Scatterplot matrices (SPLOMs), parallel coordinates, and glyphs can all be used to visualize the multiple measures in multidimensional multivariate data. However, these techniques are not well suited to visualizing many dimensions. To visualize multiple dimensions, “hierarchical axes” that “stack dimensions” have been used in systems like Polaris and Tableau. However, this approach does not scale well beyond a small number of dimensions. Emerson et al. (2013) extend the matrix paradigm of the SPLOM to simultaneously visualize several categorical and continuous variables, displaying many kinds of charts in the matrix depending on the kinds of variables involved. We propose a variant of their technique, called the Generalized Plot Matrix (GPLOM). The GPLOM restricts Emerson et al. (2013)’s technique to only three kinds of charts (scatterplots for pairs of continuous variables, heatmaps for pairs of categorical variables, and barcharts for pairings of categorical and continuous variable), in an effort to make it easier to understand by casual business users. At the same time, the GPLOM extends Emerson et al. (2013)’s work by demonstrating interactive techniques suited to the matrix of charts. We discuss the visual design and interactive features of our GPLOM prototype, including a textual search feature allowing users to quickly locate values or variables by name. We also present a user study that compared performance with Tableau and our GPLOM prototype, that found that GPLOM is significantly faster in certain cases, and not significantly slower in other cases. Also, performance and responsiveness of visual analytics systems for exploratory data analysis of large datasets has been a long standing problem, which GPLOM also encounters. We propose a method called VisReduce that incrementally computes visualizations in a distributed fashion by combining a modified MapReduce-style algorithm with a compressed columnar data store, resulting in significant improvements in performance and responsiveness for constructing commonly encountered information visualizations, e.g., bar charts, scatterplots, heat maps, cartograms and parallel coordinate plots. We compare our method with one that queries three other readily available database and data warehouse systems — PostgreSQL, Cloudera Impala and the MapReduce-based Apache Hive — in order to build visualizations. We show that VisReduce’s end-to-end approach allows for greater speed and guaranteed end-user responsiveness, even in the face of large, long-running queries

    Design of a reference architecture for an IoT sensor network

    Get PDF

    Electricity use profiling and forecasting at microgrid level

    Get PDF
    Σκοπός αυτής της διπλωματικής εργασίας είναι η δημιουργία ενός ευέλικτου και εύκολα προσαρμόσιμου εργαλείου που θα εφαρμοστεί σε microgrids για την δημιουργία ενεργιακών προφίλ χρήσης ηλεκτρικής ενέργειας και για την πρόβλεψη φορτίου. Το αρθρωτό αυτό εργαλείο ονομάζεται Divinus και η αρχιτεκτονική του αποτελείται από πολλά διασυνδεδεμένα και καλά καθορισμένα στοιχεία, όπου το καθένα αλληλεπιδρά άμεσα με το άλλο. Οι τρεις πρώτοι δομικοί πυλώνες της πλατφόρμας είναι η βάση δεδομένων, στην οποία αποθηκεύονται όλες οι πληροφορίες, το Django framework στο οποίο υπάρχει ο πηγαίος κώδικας και τέλος ο ιστότοπος όπου εμφανίζονται όλα τα αποτελέσματα. Το επόμενο σύνολο στοιχείων δεν αφορά τόσο την δομική όσο την λειτουργική πλευρά του Divinus. Στα στοιχεία αυτά εμπεριέχονται διαδικασίες όπως είναι η συλλογή δεδομένων που θα αποθηκευτούν στη βάση, η δημιουργία ενεργειακών προφίλ χρήση που θα εκτελεστεί πάνω στα δεδομένα που συλλέγονται καθώς και η πρόβλεψη φορτίου για την οποία θα χρησιμοποιηθούν δεδομένα από τα ενεργειακά προφίλ χρήσης. Μέσω τον αυτοοργανωτικών χαρτών, που είναι ανταγωνιστικά δίκτυα που παρέχουν τοπολογική χαρτογράφηση στα εισαγόμενα δεδομένα, πραγματοποιούμε τη δημιουργία ενεργιακών προφίλ χρήσης ηλεκτρικής ενέργειας με βάση τα συλλεχθέντα δεδομένα από το 2010 έως το 2017 της περιοχής των Ψαχνών Ευβοίας του Τεχνολογικού Εκπαιδευτικού Ινστιτούτου Στερεάς Ελλάδας. Μόλις η χαρτογράφηση των δεδομένων αυτών είναι πλήρης τοποθετηθούν σε ομάδες βάσει των χαρακτηριστικών τους, η διαδικασία πρόβλεψης είναι σε θέση να ξεκινήσει. Η πρόβλεψη πραγματοποιείται με βάση τη μεθοδολογία machine learning και πιο συγκεκριμένα μέσω του αλγόριθμο k-neighbours. Από τις δοκιμές που έχουν πραγματοποιηθεί μέχρι τώρα, παρατηρούμαι ότι το Divinus έχει υψηλή ακρίβεια και μικρά σφάλματα. Πιο συγκεκριμένα, με βάση τις προβλέψεις που πραγματοποιήθηκαν για τις επόμενες πέντε ημέρες, τον επόμενο μήνα και τον επόμενο χρόνο, το μέσο σφάλμα δεν υπερβεί το 5% για τις επόμενες πέντε ημέρες, το 12% για τον επόμενο μήνα και το 16% για το επόμενο έτος. Ως εκ τούτου, στο στάδιο που βρίσκεται αυτήν την στιγμή το Divinus μπορούμε να πούμε ότι αποτελεί ένα πολύ ελπιδοφόρο εργαλείο που είναι πιθανό να χρησιμοποιηθεί τόσο για βραχυπρόθεσμες όσο και για μεσοπρόθεσμες προβλέψεις.The aim of this thesis is to create a flexible and easily customized tool applicable in microgrids to carry out electricity use profiling and forecasting. This modular tool is called Divinus and its architecture consists of several interconnected well-defined components where each one interacts directly with the other. Τhe first three structural pillars of the platform are its database where all the information is stored, the Django framework in which the code exists and finally the website where all the results are displayed. Τhe next set of components are not as structural as they are functional. Upon them is based the collection of data that will be saved in the database, the use profile that will be performed on the collected data and the load forecasting for which use profiling data will be used. Through the Self-Organizing Map, that are competing networks that provide topological mapping to the imported data, we perform the use profiling based on the collected data of Technological Institute of Sterea Ellada, Psachna campus from 2010 till 2017. As soon as the use profiling is complete and these data are placed in clusters based on their characteristics the forecasting process is able to begin. The forecasting is performed based on the machine learning methodology and more specifically with the k-neighbours algorithm. From the tests that have been carried out so far, we observed that Divinus has a high accuracy and low mean errors. More specifically based on forecasts made for the next five days, the next month and the next year the average error does not exceed 5% for the next five days, 12% for next month and 16% for the next year. Therefore, at the current stage of the tools is we are able to say that it is quite promising tool and that is likely to be used for both short-term and medium-term forecasts
    corecore