1,856 research outputs found

    ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads

    Full text link
    ARM processors have dominated the mobile device market in the last decade due to their favorable computing to energy ratio. In this age of Cloud data centers and Big Data analytics, the focus is increasingly on power efficient processing, rather than just high throughput computing. ARM's first commodity server-grade processor is the recent AMD A1100-series processor, based on a 64-bit ARM Cortex A57 architecture. In this paper, we study the performance and energy efficiency of a server based on this ARM64 CPU, relative to a comparable server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads. Specifically, we study these for Intel's HiBench suite of web, query and machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed setup, for data sizes up to 20GB20GB files, 5M5M web pages and 500M500M tuples. Our results show that the ARM64 server's runtime performance is comparable to the x64 server for integer-based workloads like Sort and Hive queries, and only lags behind for floating-point intensive benchmarks like PageRank, when they do not exploit data parallelism adequately. We also see that the ARM64 server takes 13rd\frac{1}{3}^{rd} the energy, and has an Energy Delay Product (EDP) that is 5071%50-71\% lower than the x64 server. These results hold promise for ARM64 data centers hosting Big Data workloads to reduce their operational costs, while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 201

    Towards Efficient Resource Provisioning in Hadoop

    Get PDF
    Considering recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for better energy-efficient computing. This thesis proposes the Best Trade-off Point (BToP) method which provides a general approach and techniques based on an algorithm with mathematical formulas to find the best trade-off point on an elbow curve of performance vs. resources for efficient resource provisioning in Hadoop MapReduce and Apache Spark. Our novel BToP method is expected to work for any applications and systems which rely on a tradeoff curve with an elbow shape, non-inverted or inverted, for making good decisions. This breakthrough method for optimal resource provisioning was not available before in the scientific, computing, and economic communities. To illustrate the effectiveness of the BToP method on the ubiquitous Hadoop MapReduce, our Terasort experiment shows that the number of task resources recommended by the BToP algorithm is always accurate and optimal when compared to the ones suggested by three popular rules of thumbs. We also test the BToP method on the emerging cluster computing framework Apache Spark running in YARN cluster mode. Despite the effectiveness of Spark’s robust and sophisticated built-in dynamic resource allocation mechanism, which is not available in MapReduce, the BToP method could still consistently outperform it according to our Spark-Bench Terasort test results. The performance efficiency gained from the BToP method not only leads to significant energy saving but also improves overall system throughput and prevents cluster underutilization in a multi-tenancy environment. In General, the BToP method is preferable for workloads with identical resource consumption signatures in production environment where job profiling for behavioral replication will lead to the most efficient resource provisioning

    BigDataBench: a Big Data Benchmark Suite from Internet Services

    Full text link
    As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache misses per 1000 instructions of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.Comment: 12 pages, 6 figures, The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, US

    E-EON : Energy-Efficient and Optimized Networks for Hadoop

    Get PDF
    Energy efficiency and performance improvements have been two of the major concerns of current Data Centers. With the advent of Big Data, more information is generated year after year, and even the most aggressive predictions of the largest network equipment manufacturer have been surpassed due to the non-stop growing network traffic generated by current Big Data frameworks. As, currently, one of the most famous and discussed frameworks designed to store, retrieve and process the information that is being consistently generated by users and machines, Hadoop has gained a lot of attention from the industry in recent years and presently its name describes a whole ecosystem designed to tackle the most varied requirements of today’s cloud applications. This thesis relates to Hadoop clusters, mainly focused on their interconnects, which is commonly considered to be the bottleneck of such ecosystem. We conducted research focusing on energy efficiency and also on performance optimizations as improvements on cluster throughput and network latency. Regarding the energy consumption, a significant proportion of a data center's energy consumption is caused by the network, which stands for 12% of the total system power at full load. With the non-stop growing network traffic, it is desired by industry and academic community that network energy consumption should be proportional to its utilization. Considering cluster performance, although Hadoop is a network throughput-sensitive workload with less stringent requirements for network latency, there is an increasing interest in running batch and interactive workloads concurrently on the same cluster. Doing so maximizes system utilization, to obtain the greatest benefits from the capital and operational expenditures. For this to happen, cluster throughput should not be impacted when network latency is minimized. The two biggest challenges faced during the development of this thesis were related to achieving near proportional energy consumption for the interconnects and also improving the network latency found on Hadoop clusters, while having virtually no loss on cluster throughput. Such challenges led to comparable sized opportunity: proposing new techniques that must solve such problems from the current generation of Hadoop clusters. We named E-EON the set of techniques presented in this work, which stands for Energy Efficient and Optimized Networks for Hadoop. E-EON can be used to reduce the network energy consumption and yet, to reduce network latency while cluster throughput is improved at the same time. Furthermore, such techniques are not exclusive to Hadoop and they are also expected to have similar benefits if applied to any other Big Data framework infrastructure that fits the problem characterization we presented throughout this thesis. With E-EON we were able to reduce the energy consumption by up to 80% compared to the state-of-the art technique. We were also able to reduce network latency by up to 85% and in some cases, even improve cluster throughput by 10%. Although these were the two major accomplishment from this thesis, we also present minor benefits which translate to easier configuration compared to the stat-of-the-art techniques. Finally, we enrich the discussions found in this thesis with recommendations targeting network administrators and network equipment manufacturers.La eficiencia energética y las mejoras de rendimiento han sido dos de las principales preocupaciones de los Data Centers actuales. Con el arribo del Big Data, se genera más información año con año, incluso las predicciones más agresivas de parte del mayor fabricante de dispositivos de red se han superado debido al continuo tráfico de red generado por los sistemas de Big Data. Actualmente, uno de los más famosos y discutidos frameworks desarrollado para almacenar, recuperar y procesar la información generada consistentemente por usuarios y máquinas, Hadoop acaparó la atención de la industria en los últimos años y actualmente su nombre describe a todo un ecosistema diseñado para abordar los requisitos más variados de las aplicaciones actuales de Cloud Computing. Esta tesis profundiza sobre los clusters Hadoop, principalmente enfocada a sus interconexiones, que comúnmente se consideran el cuello de botella de dicho ecosistema. Realizamos investigaciones centradas en la eficiencia energética y también en optimizaciones de rendimiento como mejoras en el throughput de la infraestructura y de latencia de la red. En cuanto al consumo de energía, una porción significativa de un Data Center es causada por la red, representada por el 12 % de la potencia total del sistema a plena carga. Con el tráfico constantemente creciente de la red, la industria y la comunidad académica busca que el consumo energético sea proporcional a su uso. Considerando las prestaciones del cluster, a pesar de que Hadoop mantiene una carga de trabajo sensible al rendimiento de red aunque con requisitos menos estrictos sobre la latencia de la misma, existe un interés creciente en ejecutar aplicaciones interactivas y secuenciales de manera simultánea sobre dicha infraestructura. Al hacerlo, se maximiza la utilización del sistema para obtener los mayores beneficios al capital y gastos operativos. Para que esto suceda, el rendimiento del sistema no puede verse afectado cuando se minimiza la latencia de la red. Los dos mayores desafíos enfrentados durante el desarrollo de esta tesis estuvieron relacionados con lograr un consumo energético cercano a la cantidad de interconexiones y también a mejorar la latencia de red encontrada en los clusters Hadoop al tiempo que la perdida del rendimiento de la infraestructura es casi nula. Dichos desafíos llevaron a una oportunidad de tamaño semejante: proponer técnicas novedosas que resuelven dichos problemas a partir de la generación actual de clusters Hadoop. Llamamos a E-EON (Energy Efficient and Optimized Networks) al conjunto de técnicas presentadas en este trabajo. E-EON se puede utilizar para reducir el consumo de energía y la latencia de la red al mismo tiempo que el rendimiento del cluster se mejora. Además tales técnicas no son exclusivas de Hadoop y también se espera que tengan beneficios similares si se aplican a cualquier otra infraestructura de Big Data que se ajuste a la caracterización del problema que presentamos a lo largo de esta tesis. Con E-EON pudimos reducir el consumo de energía hasta en un 80% en comparación con las técnicas encontradas en la literatura actual. También pudimos reducir la latencia de la red hasta en un 85% y, en algunos casos, incluso mejorar el rendimiento del cluster en un 10%. Aunque estos fueron los dos principales logros de esta tesis, también presentamos beneficios menores que se traducen en una configuración más sencilla en comparación con las técnicas más avanzadas. Finalmente, enriquecimos las discusiones encontradas en esta tesis con recomendaciones dirigidas a los administradores de red y a los fabricantes de dispositivos de red

    Big Data and the Internet of Things

    Full text link
    Advances in sensing and computing capabilities are making it possible to embed increasing computing power in small devices. This has enabled the sensing devices not just to passively capture data at very high resolution but also to take sophisticated actions in response. Combined with advances in communication, this is resulting in an ecosystem of highly interconnected devices referred to as the Internet of Things - IoT. In conjunction, the advances in machine learning have allowed building models on this ever increasing amounts of data. Consequently, devices all the way from heavy assets such as aircraft engines to wearables such as health monitors can all now not only generate massive amounts of data but can draw back on aggregate analytics to "improve" their performance over time. Big data analytics has been identified as a key enabler for the IoT. In this chapter, we discuss various avenues of the IoT where big data analytics either is already making a significant impact or is on the cusp of doing so. We also discuss social implications and areas of concern.Comment: 33 pages. draft of upcoming book chapter in Japkowicz and Stefanowski (eds.) Big Data Analysis: New algorithms for a new society, Springer Series on Studies in Big Data, to appea

    Analyze Large Multidimensional Datasets Using Algebraic Topology

    Get PDF
    This paper presents an efficient algorithm to extract knowledge from high-dimensionality, high- complexity datasets using algebraic topology, namely simplicial complexes. Based on concept of isomorphism of relations, our method turn a relational table into a geometric object (a simplicial complex is a polyhedron). So, conceptually association rule searching is turned into a geometric traversal problem. By leveraging on the core concepts behind Simplicial Complex, we use a new technique (in computer science) that improves the performance over existing methods and uses far less memory. It was designed and developed with a strong emphasis on scalability, reliability, and extensibility. This paper also investigate the possibility of Hadoop integration and the challenges that come with the framework

    How can SMEs benefit from big data? Challenges and a path forward

    Get PDF
    Big data is big news, and large companies in all sectors are making significant advances in their customer relations, product selection and development and consequent profitability through using this valuable commodity. Small and medium enterprises (SMEs) have proved themselves to be slow adopters of the new technology of big data analytics and are in danger of being left behind. In Europe, SMEs are a vital part of the economy, and the challenges they encounter need to be addressed as a matter of urgency. This paper identifies barriers to SME uptake of big data analytics and recognises their complex challenge to all stakeholders, including national and international policy makers, IT, business management and data science communities. The paper proposes a big data maturity model for SMEs as a first step towards an SME roadmap to data analytics. It considers the ‘state-of-the-art’ of IT with respect to usability and usefulness for SMEs and discusses how SMEs can overcome the barriers preventing them from adopting existing solutions. The paper then considers management perspectives and the role of maturity models in enhancing and structuring the adoption of data analytics in an organisation. The history of total quality management is reviewed to inform the core aspects of implanting a new paradigm. The paper concludes with recommendations to help SMEs develop their big data capability and enable them to continue as the engines of European industrial and business success. Copyright © 2016 John Wiley & Sons, Ltd.Peer ReviewedPostprint (author's final draft

    Energy Efficient Ethernet on MapReduce Clusters: Packet Coalescing To Improve 10GbE Links

    Get PDF
    An important challenge of modern data centers is to reduce energy consumption, of which a substantial proportion is due to the network. Switches and NICs supporting the recent energy efficient Ethernet (EEE) standard are now available, but current practice is to disable EEE in production use, since its effect on real world application performance is poorly understood. This paper contributes to this discussion by analyzing the impact of EEE on MapReduce workloads, in terms of performance overheads and energy savings. MapReduce is the central programming model of Apache Hadoop, one of the most widely used application frameworks in modern data centers. We find that, while 1GbE links (edge links) achieve good energy savings using the standard EEE implementation, optimum energy savings in the 10 GbE links (aggregation and core links) are only possible, if these links employ packet coalescing. Packet coalescing must, however, be carefully configured in order to avoid excessive performance degradation. With our new analysis of how the static parameters of packet coalescing perform under different cluster loads, we were able to cover both idle and heavy load periods that can exist on this type of environment. Finally, we evaluate our recommendation for packet coalescing for 10 GbE links using the energy-delay metric. This paper is an extension of our previous work [1], which was published in the Proceedings of the 40th Annual IEEE Conference on Local Computer Networks (LCN 2015).This work was supported in part by the European Union’s Seventh Framework Programme (FP7/2007-2013) under Grant 610456 (EUROSERVER), in part by the Spanish Government through the Severo Ochoa programme (SEV-2011-00067 and SEV-2015-0493), in part by the Spanish Ministry of Economy a nd Competitiveness under Contract TIN2012-34557 and Contract TIN2015-65316-P, and in part by the Generalitat de Catalunya under Contract 2014-SGR-1051 and Contract 2014-SGR-1272.Peer ReviewedPostprint (author's final draft

    Exploring interconnect energy savings under East-West traffic pattern of MapReduce clusters

    Get PDF
    An important challenge of modern data centers is to reduce energy consumption, of which a substantial proportion is due to the network. Energy Efficient Ethernet (EEE) is a recent standard that aims to reduce network power consumption, but current practice is to disable it in production use, since it has a poorly understood impact on real world application performance. An important application framework commonly used in modern data centers is Apache Hadoop, which implements the MapReduce programming model. This paper is the first to analyse the impact of EEE on MapReduce workloads, in terms of performance overheads and energy savings. We find that optimum energy savings are possible if the links use packet coalescing. Packet coalescing must, however, be carefully configured in order to avoid excessive performance degradation.The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 610456 (Euroserver). The research was also supported by the Ministry of Economy and Competitiveness of Spain under the contract TIN2012-34557, HiPEAC-3 Network of Excellence (ICT-287759), and the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Postprint (author's final draft
    corecore