12 research outputs found

    Benchmarking BigSQL Systems

    Get PDF
    Elame suurandmete ajastul. Tänapäeval on olemas suurandmete töötlemise süsteemid, mis on võimelised haldama sadu terabaite ja petabaite andmeid. Need süsteemid töötlevad andmehulki, mis on liiga suured traditsiooniliste andmebaasisüsteemide jaoks. Mõned neist süsteemidest sisaldavad SQL keeli andmehoidlaga suhtlemiseks. Nendel süsteemidel, mida nimetatakse ka BigSQL süsteemideks, on mõned omadused, mis teevad nende andmete hoidmist ja haldamist unikaalseks. Süsteemide paremaks mõistmiseks on vajalik nende jõudluse ja omaduste uuring. Antud töö sisaldab BigSQL süsteemide jõudluse võrdlusuuringut. Valitud BigSQL süsteemdiega viiakse läbi standardiseeritud jõudlustestid ja eksperimentidest saadud tulemusi analüüsitakse. Töö eesmärgiks on seletada paremini lahti valitud BigSQL süsteemide omadusi ja käitumist.We live in the era of BigData. We now have BigData systems which are able to manage data in volumes of hundreds of terabytes and petabytes. These BigData systems handle data sizes which are too large for traditional database systems to handle. Some of these BigData systems now provide SQL syntax for interacting with their store. These BigData systems, referred to as BigSQL systems, possess certain features which make them unique in how they manage the stored. A study into the performances and characteristics of these BigSQL systems is necessary in order to better understand these systems. This thesis provides that study into the performance of these BigSQL systems. In this thesis, we perform standardized benchmark experiments against some selected BigSQL systems and then analyze the performances of these systems based on the results of the experiments. The output of this thesis study will provide an understanding of the features and behaviors of the BigSQL systems

    Data Analytics and Machine Learning to Enhance the Operational Visibility and Situation Awareness of Smart Grid High Penetration Photovoltaic Systems

    Get PDF
    Electric utilities have limited operational visibility and situation awareness over grid-tied distributed photovoltaic systems (PV). This will pose a risk to grid stability when the PV penetration into a given feeder exceeds 60% of its peak or minimum daytime load. Third-party service providers offer only real-time monitoring but not accurate insights into system performance and prediction of productions. PV systems also increase the attack surface of distribution networks since they are not under the direct supervision and control of the utility security analysts. Six key objectives were successfully achieved to enhance PV operational visibility and situation awareness: (1) conceptual cybersecurity frameworks for PV situation awareness at device, communications, applications, and cognitive levels; (2) a unique combinatorial approach using LASSO-Elastic Net regularizations and multilayer perceptron for PV generation forecasting; (3) applying a fixed-point primal dual log-barrier interior point method to expedite AC optimal power flow convergence; (4) adapting big data standards and capability maturity models to PV systems; (5) using K-nearest neighbors and random forests to impute missing values in PV big data; and (6) a hybrid data-model method that takes PV system deration factors and historical data to estimate generation and evaluate system performance using advanced metrics. These objectives were validated on three real-world case studies comprising grid-tied commercial PV systems. The results and conclusions show that the proposed imputation approach improved the accuracy by 91%, the estimation method performed better by 75% and 10% for two PV systems, and the use of the proposed forecasting model improved the generalization performance and reduced the likelihood of overfitting. The application of primal dual log-barrier interior point method improved the convergence of AC optimal power flow by 0.7 and 0.6 times that of the currently used deterministic models. Through the use of advanced performance metrics, it is shown how PV systems of different nameplate capacities installed at different geographical locations can be directly evaluated and compared over both instantaneous as well as extended periods of time. The results of this dissertation will be of particular use to multiple stakeholders of the PV domain including, but not limited to, the utility network and security operation centers, standards working groups, utility equipment, and service providers, data consultants, system integrator, regulators and public service commissions, government bodies, and end-consumers

    Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

    Get PDF
    In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador

    The state of SQL-on-Hadoop in the cloud

    Get PDF
    Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come preconfigured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective, comparing providers including: Microsoft Azure, Amazon Web Services, Google Cloud, and Rackspace. The study focuses on performance, readiness, scalability, and cost-effectiveness of the different solutions at entry/test level clusters sizes. Results are based on over 15,000 Hive queries derived from the industry standard TPC-H benchmark. The study is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines. The ALOJA Project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. The study benchmarks cloud providers across a diverse range instance types, and uses input data scales from 1GB to 1TB, in order to survey the popular entry-level PaaS SQL-on-Hadoop solutions, thereby establishing a common results-base upon which subsequent research can be carried out by the project. Initial results already show the main performance trends to both hardware and software configuration, pricing, similarities and architectural differences of the evaluated PaaS solutions. Whereas some providers focus on decoupling storage and computing resources while offering network-based elastic storage, others choose to keep the local processing model from Hadoop for high performance, but reducing flexibility. Results also show the importance of application-level tuning and how keeping up-to-date hardware and software stacks can influence performance even more than replicating the on-premises model in the cloud.This work is partially supported by the Microsoft Azure for Research program, the European Research Council (ERC) under the EUs Horizon 2020 programme (GA 639595), the Spanish Ministry of Education (TIN2015-65316-P), and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Towards Providing Hadoop Storage and Computing as Services

    Get PDF

    Effectiveness of NoSQL and NewSQL Databases in Mo bile Network Event Data : Cassandra and ParStream /Kinetic

    Get PDF
    Continuously growing amount of data has inspired seeking more and more efficient database solutions for storing and manipulating data. In big data sets, NoSQL databases have been established as alternatives for traditional SQL databases. The effectiveness of these databases has been widely tested, but the tests focused only on key-value data that is structurally very simple. Many application domains, such as telecommunication, involve more complex data structures. Huge amount of Mobile Network Event (MNE) data is produced by an increasing number of mobile and ubiquitous applications. MNE data is structurally predetermined and typically contains a large number of columns. Applications that handle MNE data are usually insert intensive, as a huge amount of data are generated during rush hours. NoSQL provides high scalability and its column family stores suits MNE data well, but NoSQL does not support ACID features of the traditional relational databases. NewSQL is a new kind of databases, which provide the high scalability of NoSQL while still maintaining ACID guarantees of the traditional DBMS. In the paper, we evaluation NEM data storing and aggregating efficiency of Cassandra and ParStream/Kinetic databases and aim to find out whether the new kind of database technology can clearly bring performance advantages over legacy database technology and offers an alternative to existing solutions. Among the column family stores of NoSQL, Cassandra is especially a good choice for insert intensive applications due to its way to handle data insertions. ParStream is a novel and advanced NewSQL like database and is recently integrated into Cisco Kinetic. The results of the evaluation show that ParStream is much faster than Cassandra when storing and aggregating MNE data and the NewSQL is a very strong alternative to existing database solutions for insert intensive applications

    A performance comparison of data lake table formats in cloud object storages

    Get PDF
    The increasing informatization of processes involved in our daily lives has generated a significant increase on the number of software developed to meet these needs. Consider ing this, the volume of data generated by applications is increasing, which is generating a bigger interest in the usage of it for analytical purposes, with the objective of getting in sights and extracting valuable information from it. This increase, however, has generated new challenges related to the storage, organization and processing of the data. In general, the interest is in obtaining relevant information quickly, consistently and at the lowest possible cost. In this context, new approaches have emerged to facilitate the organization and access to data at a massive scale. An already widespread concept is to have a central repository, known as Data Lake, in which data from different sources, having variable characteristics, are massively stored, so that they can be explored and processed in order to obtain new relevant information. These environments have been implemented in object storages lately, especially in the Cloud, given the rise of this model in recent years. F Frequently, the data stored in these environments is often structured as tables, and stored in files, which can be text files, such as CSVs, or binary files, such as Parquet, Avro or ORC, that implement specific properties, like data compression, for example. The modeling of these tables resembles, in some aspects, the structures of Data Warehouses, which are frequently implemented using Database Management Systems (DBMSs), and are used to make data available in a structured way, for analytical purposes. Given these characteristics, new specifications of table formats have emerged, applied as layers above these files, which aim to implement the support to usual operations and properties of DBMSs, using object storages as a storage layer. In practice, it is intended to guarantee the ACID properties during these operations, as well as the ability to per form operations that involve mutability, like updates, upserts or merges, in a simpler way. Thus, this work aims to evaluate and compare the performance of some of these formats, defined as Data Lake Table Formats: Delta Lake, Apache Hudi and Apache Iceberg, in order to identify how each format behaves when performing usual operations in these en vironments, like: inserting data, updating data and querying data.A crescente informatização dos processos envolvidos em nosso cotidiano gerou um aumento significativo no número de softwares desenvolvidos para atender a essas necessidades. Diante disso, o volume de dados gerados tem crescido, o que está gerando um maior interesse na utilização dos mesmos para fins analíticos, com o objetivo de obter insights e extrair informações valiosas deles. Esse aumento, no entanto, gerou novos desafios relacionados ao armazenamento, organização e processamento dos dados. Em geral, o in teresse está em obter informações relevantes de forma rápida, consistente e com o menor custo possível. Nesse contexto, surgiram novas abordagens para facilitar a organização e o acesso a esses dados em grande escala. Um conceito já difundido envolve ter um repositório central, conhecido como Data Lake, no qual dados de diferentes fontes, com características variáveis, são armazenados massivamente, para que possam ser explorados e processados de forma a obter novas informações relevantes a partir deles. Esses ambientes vêm sendo implementados em Object Storages ultimamente, em especial, na Nuvem, dada a ascensão desse modelo nos últimos anos. Frequentemente, os dados armazenados nesses ambientes costumam ser estruturados como tabelas e armazenados em arquivos, que podem ser arquivos de texto, como CSVs, ou arquivos binários, como Apache Par quet, Apache Avro ou Apache ORC, que implementam alguma propriedade específica, como compressão de dados, por exemplo. A modelagem dessas tabelas se assemelha, em alguns aspectos, às estruturas de Data Warehouses, que são frequentemente implemen tados usando Sistemas de Gerenciamento de Banco de Dados (SGBDs), sendo utiizados para disponibilizar dados de forma estruturada, para fins analíticos. Diante dessas características, novas especificações de formatos de tabelas, têm surgido, visando aplicar camadas acima desses arquivos, a fim de implementar o suporte às ope rações e propriedades comuns em SGBDs, utilizando Object Storages como camada de armazenamento. Na prática, pretende-se garantir as propriedades ACID durante essas operações, bem como a capacidade de realizar operações que envolvam mutabilidade, como atualizações, upserts ou merges, de forma mais simples. Assim, este trabalho tem como objetivo avaliar e comparar o desempenho de alguns desses formatos, definidos como "formatos de tabela de Data Lakes": Delta Lake, Apache Hudi e Apache Iceberg, a fim de identificar como cada formato se comporta ao realizar operações usuais nesses ambientes, como: inserção de dados, atualização de dados e consulta aos dados
    corecore