5 research outputs found

    ALOJA: A benchmarking and predictive platform for big data performance analysis

    Get PDF
    The main goals of the ALOJA research project from BSC-MSR, are to explore and automate the characterization of cost-effectivenessof Big Data deployments. The development of the project over its first year, has resulted in a open source benchmarking platform, an online public repository of results with over 42,000 Hadoop job runs, and web-based analytic tools to gather insights about system's cost-performance1. This article describes the evolution of the project's focus and research lines from over a year of continuously benchmarking Hadoop under dif- ferent configuration and deployments options, presents results, and dis cusses the motivation both technical and market-based of such changes. During this time, ALOJA's target has evolved from a previous low-level profiling of Hadoop runtime, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. Modeling benchmark executions allow us to estimate the results of new or untested configu- rations or hardware set-ups automatically, by learning techniques from past observations saving in benchmarking time and costs.This work is partially supported the BSC-Microsoft Research Centre, the Span- ish Ministry of Education (TIN2012-34557), the MINECO Severo Ochoa Research program (SEV-2011-0067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    The state of SQL-on-Hadoop in the cloud

    Get PDF
    Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come preconfigured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective, comparing providers including: Microsoft Azure, Amazon Web Services, Google Cloud, and Rackspace. The study focuses on performance, readiness, scalability, and cost-effectiveness of the different solutions at entry/test level clusters sizes. Results are based on over 15,000 Hive queries derived from the industry standard TPC-H benchmark. The study is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines. The ALOJA Project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. The study benchmarks cloud providers across a diverse range instance types, and uses input data scales from 1GB to 1TB, in order to survey the popular entry-level PaaS SQL-on-Hadoop solutions, thereby establishing a common results-base upon which subsequent research can be carried out by the project. Initial results already show the main performance trends to both hardware and software configuration, pricing, similarities and architectural differences of the evaluated PaaS solutions. Whereas some providers focus on decoupling storage and computing resources while offering network-based elastic storage, others choose to keep the local processing model from Hadoop for high performance, but reducing flexibility. Results also show the importance of application-level tuning and how keeping up-to-date hardware and software stacks can influence performance even more than replicating the on-premises model in the cloud.This work is partially supported by the Microsoft Azure for Research program, the European Research Council (ERC) under the EUs Horizon 2020 programme (GA 639595), the Spanish Ministry of Education (TIN2015-65316-P), and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Optimizing cost and performance trade-offs for MapReduce job processing in the cloud

    No full text
    corecore