7 research outputs found

    Model and simulation of power consumption and power saving potential of energy efficient cluster hardware

    Get PDF
    In the last years the power consumption of high performance computing clusters has become a growing problem because number and size of cluster installations raised and still is raising. The high power consumption of the clusters results from the main goal of these clusters: High performance. With a low utilization the cluster hardware consumes nearly as much energy as when it is fully utilized. In these low utilization phases the cluster hardware can theoretically turned off or switched to an lower power consuming mode. In this thesis a model is designed to estimate the power consumption of the hardware with and without energy saving mechanism. With the resulting software it is possible to estimate the cluster power consumption for different configurations of a parallel program. Further energy aware hardware can be simulated to determine an upper bound for energy savings without performance leakage. The results show that is a great energy saving potential for energy aware hardware even in high performance computing. This potential should motivate research in mechanism to control the energy aware hardware in high performance clusters

    A storage architecture for data-intensive computing

    Get PDF
    The assimilation of computing into our daily lives is enabling the generation of data at unprecedented rates. In 2008, IDC estimated that the "digital universe" contained 486 exabytes of data [9]. The computing industry is being challenged to develop methods for the cost-effective processing of data at these large scales. The MapReduce programming model has emerged as a scalable way to perform data-intensive computations on commodity cluster computers. Hadoop is a popular open-source implementation of MapReduce. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem --- HDFS --- is written in Java and designed for portability across heterogeneous hardware and software platforms. The efficiency of a Hadoop cluster depends heavily on the performance of this underlying storage system. This thesis is the first to analyze the interactions between Hadoop and storage. It describes how the user-level Hadoop filesystem, instead of efficiently capturing the full performance potential of the underlying cluster hardware, actually degrades application performance significantly. Architectural bottlenecks in the Hadoop implementation result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Further, HDFS implicitly makes assumptions about how the underlying native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. Methods to eliminate these bottlenecks in HDFS are proposed and evaluated both in terms of their application performance improvement and impact on the portability of the Hadoop framework. In addition to improving the performance and efficiency of the Hadoop storage system, this thesis also focuses on improving its flexibility. The goal is to allow Hadoop to coexist in cluster computers shared with a variety of other applications through the use of virtualization technology. The introduction of virtualization breaks the traditional Hadoop storage architecture, where persistent HDFS data is stored on local disks installed directly in the computation nodes. To overcome this challenge, a new flexible network-based storage architecture is proposed, along with changes to the HDFS framework. Network-based storage enables Hadoop to operate efficiently in a dynamic virtualized environment and furthers the spread of the MapReduce parallel programming model to new applications

    GreenHDFS: data-centric and cyber-physical energy management system for big data clouds

    Get PDF
    Explosion in Big Data has led to a rapid increase in the popularity of Big Data analytics. With the increase in the sheer volume of data that needs to be stored and processed, storage and computing demands of the Big Data analytics workloads are growing exponentially, leading to a surge in extremely large-scale Big Data cloud platforms, and resulting in burgeoning energy costs and environmental impact. The sheer size of Big Data lends it significant data movement inertia and that coupled with the network bandwidth constraints inherent in the cloud's cost-efficient and scale-out economic paradigm, makes data-locality a necessity for high performance in the Big Data environments. Instead of sending data to the computations as has been the norm, computations are sent to the data to take advantage of the higher data-local performance. The state-of-the-art run-time energy management techniques are job-centric in nature and rely on thermal- and energy-aware job placement, job consolidation, or job migration to derive energy costs savings. Unfortunately, data-locality requirement of the compute model limits the applicability of the state-of-the-art run-time energy management techniques as these techniques are inherently data-placement-agnostic in nature, and provide energy savings at significant performance impact in the Big Data environment. Big Data analytics clusters have moved away from shared network attached storage (NAS) or storage area network (SAN) model to completely clustered, commodity storage model that allows direct access path between the storage servers and the clients in interest of high scalability and performance. The underlying storage system distributes file chunks and replicas across the servers for high performance, load-balancing, and resiliency. However, with files distributed across all servers, any server may be participating in the reading, writing, or computation of a file chunk at any time. Such a storage model complicates scale-down based power-management by making it hard to generate significant periods of idleness in the Big Data analytics clusters. GreenHDFS is based on the observation that data needs to be a first-class object in energy management in the Big Data environments to allow high data access performance. GreenHDFS takes a novel data-centric, cyber-physical approach to reduce compute (i.e., server) and cooling operating energy costs. On the physical-side, GreenHDFS is cognizant that all-servers-are-not-alike in the Big Data analytics cloud and is aware of the variations in the thermal-profiles of the servers. On the cyber-side, GreenHDFS is aware that all-data-is-not-alike and knows the differences in the data-semantics (i.e., computational jobs arrival rate, size, popularity, and evolution life spans) of the Big Data placed in the Big Data analytics cloud. Armed with this cyber-physical knowledge, and coupled with its insights, predictive data models, and run-time information GreenHDFS does proactive, cyber-physical, thermal- and energy-aware file placement, and data-classification-driven scale-down, which implicitly results in thermal- and energy-aware job placement in the Big Data analytics cloud compute model. GreenHDFS's data-centric energy- and thermal-management approach results in a reduction in energy costs without any associated performance impact, allows scale-down of a subset of servers in spite of the unique challenges posed by Big Data analytics cloud to scale-down, and ensures thermal-reliability of the servers in the cluster. GreenHDFS evaluation results with one-month long real-world traces from a production Big Data analytics cluster at Yahoo! show up to 59% reduction in the cooling energy costs while performing 9x better than the state-of-the-art data-agnostic cooling techniques, up to a 26% reduction in the server operating energy costs, and significant reduction in the total cost of ownership (TCO) of the Big Data analytics cluster. GreenHDFS provides a software-based mechanism to increase energy-proportionality even with non-energy-proportional server components. Free-cooling or air- and water-side economization (i.e., use outside air or natural water resources to cool the data center) is gaining popularity as it can result in significant cooling energy costs savings. There is also a drive towards increasing the cooling set point of the cooling systems to make them more efficient. If the ambient temperature of the outside air or the cooling set point temperature is high, the inlet temperatures of the servers get high which reduces their ability to dissipate computational heat, resulting in an increase in server temperatures. The servers are rated to operate safely only with a certain temperature range, beyond which the failure rates increase. GreenHDFS considers the differences in the thermal-reliability-driven load-tolerance upper-bound of the servers in its predictive thermal-aware file placement and places file chunks in a manner that ensures that temperatures of servers don't exceed temperature upper-bound. Thus, by ensuring thermal-reliability at all times and by lowering the overall temperature of the servers, GreenHDFS enables data centers to enjoy energy-saving economizer mode for longer periods of time and also enables an increase in the cooling set point. There are a substantial number of data centers that still rely fully on traditional air-conditioning. These data centers can not always be retrofitted with the economizer modes or hot- and cold-aisle air containment as incorporation of the economizer and air containment may require space for duct-work, and heat exchangers which may not be available in the data center. Existing data centers may also not be favorably located geographically; air-side economization is more viable in geographic locations where ambient air temperatures are low for most part of the year and humidity is in the tolerable range. GreenHDFS provides a software-based approach to enhance the cooling-efficiency of such traditional data centers as it lowers the overall temperature in the cluster, makes the thermal-profile much more uniform, and reduces hot air recirculation, resulting in lowered cooling energy costs

    Systems support for genomics computing in cloud environments

    Get PDF
    Genomics research has enormous applications in many areas such as health care, forensic, agriculture, etc. Most recent achievements in this field come from the availability of the unprecedented genomic data. However, new sequencing technologies in genomics keep producing data at a faster pace resulting a very huge amount of data. This poses great challenges on how to store, manage, process and analyze the data efficiently. To deal with these, genomics research groups often equip themselves with a small scale server room composed of high storage capacity and computing ability machines. This solution is not only costly, unscalable but also inefficient. A better solution would be the Cloud Computing with its elasticity and pay-as-you-go economic model. Nevertheless, Cloud Computing only provides the potential infrastructure solution. To address the high-throughput processing challenges, we need to have a suitable programming model. The fundamental idea is to process data in parallel. In existing models, MapReduce appears to be the best candidate because of its extremely scalability. In this work, we plan to develop a domain specific style system to support data management and analysis in genomics using Cloud Computing and MapReduce. Starting from the application layer, we developed a fundamental alignment tool called CloudAligner based on the MapReduce framework that outperformed its counterparts. After that, we continue seeking solutions to improve the system at the infrastructure level. Observing that scientists spend too much time on accessing data from low speed archives (tapes), we developed the Distributed Disk Cache (DiSK), and it was covered in a Master thesis. Another challenge is to enable the system to support differentiated services which are prevalent in Cloud Computing. To address this, we proposed a Differentiated Replication (DiR) mechanism allowing data to be inserted and retrieved with different availability. Another problem that greatly reduces the performance of the system is the heterogeneity of the Cloud. To tame it, we created an Open Reputation model called Opera. It employs vectors to record the behaviors (reputations) of nodes from different aspects. We modified the Hadoop MapReduce scheduler to make use of this information. The results proved that under heterogeneous environments, our system is better than the original Hadoop in terms of job execution time, number of failed/killed tasks, and energy consumption. The last challenge we have dealt with is the data movement since the data in our targeted domain (genomics) is extremely large and is generated with exponential rate. We divided the issue into two categories: internal and external movement. We have successfully developed a cached system to minimize the internal data movement and an easy-to-use tool called SPBD to handle external data movement with minimal respond time

    FAWNdamentally Power-Efficient Clusters

    No full text
    As a power-efficient alternative for data-intensive computing, we propose a cluster architecture called a Fast Array of Wimpy Nodes, or FAWN. A FAWN consists of a large number of slower but efficient nodes that each draw only a few watts of power, coupled with low-power storage—our prototype FAWN nodes are built from 500MHz embedded devices with CompactFlash storage that are typically used as wireless routers, Internet gateways, or thin clients. Through our preliminary evaluation, we demonstrate that a FAWN can be up to six times more efficient than traditional systems with Flash storage in terms of queries per joule for seek-bound applications and between two to eight times more efficient for I/O throughput-bound application
    corecore