104 research outputs found

    Overview of Caching Mechanisms to Improve Hadoop Performance

    Full text link
    Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time

    Minimize Execution latency in Hadoop using SVM

    Get PDF
    ABSTRAC

    LenticularFS: Scalable filesystem for the cloud

    Get PDF
    The Hadoop platform is the most common solution to handle the explosion of big-data that both companies and research institutions are facing. In order to store such data, the Hadoop platform provides HDFS, a scalable distributed filesystem, which runs on commodity hardware and enables linear scalability by adding new storage nodes. While storage capacity of the system can be increased by adding new storage nodes, the component that handles metadata for the filesystem, the namenode, is a single point of failure and cannot easily replaced or linearly scaled. The Hops projects provides an alternative implementation of the namenode, which increases performance and scalability by storing metadata on an external distributed NewSQL database called MySQL Cluster. With the new architecture, the system is much more scalable and can transparently manage the failover of namenodes, which are now stateless components. HopsFS is, however, still limited to running within a single datacenter, which can cause severe outages in case the entire datacenter becomes unavailable. Cloud native storage systems, such as Amazon’s Simple Storage Service (S3), solve this problem by replicating data across different, geographically distant datacenters, so that the failure of any given zone does not cause data unavailability. The objective of this thesis is to enable HopsFS to work across geographical regions while, as far as possible, maintaining the semantics of a POSIX-style hierarchical filesystem. We leverage asynchronous replication functionality provided by MySQL Cluster to obtain replication of metadata across geographical regions and we present a detailed analysis on how to maintain the consistency properties of HDFS in such an environment. Furthermore, we analyze the issue of split brain scenarios and propose a way for namenodes to detect this condition and continue operating correctly. Finally, we discuss the changes to the codebase which are required to implement the proposed plan

    Improving Hadoop Performance by Using Metadata of Related Jobs in Text Datasets Via Enhancing MapReduce Workflow

    Get PDF
    Cloud Computing provides different services to the users with regard to processing data. One of the main concepts in Cloud Computing is BigData and BigData analysis. BigData is a complex, un-structured or very large size of data. Hadoop is a tool or an environment that is used to process BigData in parallel processing mode. The idea behind Hadoop is, rather than send data to the servers to process. Hadoop divides a job into small tasks and sends them to servers. These servers contain data, process the tasks and send the results back to the master node in Hadoop. Hadoop contains some limitations that could be developed to have a higher performance in executing jobs. These limitations are mostly because of data locality in the cluster, jobs and tasks scheduling, CPU execution time, or resource allocations in Hadoop. Data locality and efficient resource allocation remains a challenge in cloud computing MapReduce platform. We propose an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis. At the same time, the proposed architecture addresses the issue of resource allocation in native Hadoop. The proposed architecture provides an efficient distributed clustering approach for dedicated cloud computing environments. Enhanced Hadoop architecture leverages on NameNode’s ability to assign jobs to the TaskTrakers (DataNodes) within the cluster. By adding controlling features to the NameNode, it can intelligently direct and assign tasks to the DataNodes that contain the required data. Our focus is on extracting features and building a metadata table that carries information about the existence and the location of the data blocks in the cluster. This enables NameNode to direct the jobs to specific DataNodes without going through the whole data sets in the cluster. It should be noted that newly build lookup table is an addition to the metadata table that already exists in the native Hadoop. Our development is about processing real text in text data sets that might be readable such as books, or not readable such as DNA data sets. To test the performance of proposed architecture, we perform DNA sequence matching and alignment of various short genome sequences. Comparing with native Hadoop, proposed Hadoop reduced CPU time, number of read operations, input data size, and another different factors

    Hadoop MapReduce for Mobile Cloud

    Get PDF
    The new generations of mobile devices have high processing power and storage, but they lag behind in terms of software systems for big data storage and processing. Hadoop is a scalable platform that provides distributed storage and computational capabilities on clusters of commodity hardware. Building Hadoop on a mobile net- work enables the devices to run data intensive computing applications without direct knowledge of underlying distributed systems complexities. However, these applications have severe energy and reliability constraints (e.g., caused by unexpected device failures or topology changes in a dynamic network). As mobile devices are more susceptible to unauthorized access when compared to traditional servers, security is also a concern for sensitive data. Hence, it is paramount to consider reliability, energy efficiency and security for such applications. The goal of this thesis is to bring Hadoop MapReduce framework to a mobile cloud environment such that it solves these bottlenecks involved in big data processing. The Mobile Distributed File System(MDFS) addresses these issues for big data processing in mobile clouds. We have developed the Hadoop MapReduce framework over MDFS and have evaluated its performance by varying input workloads in a real heterogeneous mobile cluster. Our evaluation shows that the implementation addresses all constraints in processing large amounts of data in mobile clouds. Thus, our system is a viable solution to meet the growing demands of data processing in a mobile environment
    corecore