31 research outputs found

    Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

    Get PDF
    Big data is the biggest challenges as we need huge processing power system and good algorithms to make an decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar

    On the Energy Efficiency of MapReduce Shuffling Operations in Data Centers

    Get PDF
    This paper aims to quantitatively measure the impact of different data centers networking topologies on the performance and energy efficiency of shuffling operations in MapReduce. Mixed Integer Linear Programming (MILP) models are utilized to optimize the shuffling in several data center topologies with electronic, hybrid, and all-optical switching while maximizing the throughput and reducing the power consumption. The results indicate that the networking topology has a significant impact on the performance of MapReduce. They also indicate that with comparable performance, optical-based data centers can achieve an average of 54% reduction in the energy consumption when compared to electronic switching data centers

    Analisis Penggabungan Delay Scheduling dan Fair share Scheduling Algorithm Dengan beberapa karakteristik Job pada Hadoop

    Get PDF
    Abstrak Scheduling Hadoop merupakan cara untuk mengatur setiap job yang akan berjalan pada sistem Hadoop agar dapat mengelola semua job yang ada untuk mendapatkan giliran untuk di eksekusi pada setiap resource yang tersedia. Default Scheduling pada Hadoop yaitu FIFO yang mempunyai karakteristik untuk setiap job yang masuk pertama akan di eksekusi langsung dan berhak memonopoli satu resource secara utuh. Namun FIFO memiliki kerugian bagi proses short job ketika yang dieksekusi adalah proses long job. Delay improve Fair share merupakan Job scheduler yang menggunakan metode dengan membagi job untuk satu cluster ke dalam beberapa pool dan setiap pool diberlakukan metode menunda jalannya jobs selanjutnya untuk memperbaiki data lokalitas sebelumnya. Pembagian resource data dan pengalokasian data yang hampir optimal akan mempengaruhi Job Fail Rate, Job Throughput, Average Completion Time. Delay improve Fair share memiliki performasi efektif daripada Fair share dan Delay Scheduling pada jenis job randomtextwriter terhadap data .txt dengan penurunan 0,3% job fail rate dengan nilai throughput 2,73 job/m dan 273,59 menit lebih cepat dari Delay Scheduling dan 128,15 menit lebih cepat dari Fair share Kata Kunci: hadoop, hadoop multi-node, Fair share improve Delay Scheduling improve capacity, Delay Scheduling improve capacity

    Proposed Energy Aware Scheduling Algorithm in Data Center by using Map Reduce

    Get PDF
    The majority of large-scale data intensive applications executed by data centers are based on MapReduce or its open-source implementation, Hadoop. Such applications are executed on large clusters requiring large amounts of energy, making the energy costs a considerable fraction of the data center's overall costs. Therefore minimizing the energy consumption when executing each MapReduce job is a critical concern for data centers. We propose a framework for improving the energy efficiency of MapReduce applications, while satisfying the service level agreement (SLA). We first model the problem of energy-aware scheduling of a single MapReduce job as an Integer Program. We then propose two heuristic algorithms, called Energy-aware MapReduce Scheduling Algorithms (EMRSA-I and EMRSA-II), that find the assignments of map and reduce tasks to the machine slots in order to minimize the energy consumed when executing the application. We perform extensive experiments on a Hadoop cluster to determine the energy consumption and execution time for several workloads from the HiBench benchmark suite including TeraSort, PageRank, and K-means Clustering, and then use this data in an extensive simulation study to analyze the performance of the proposed algorithms. The results show that EMRSA-I and EMRSA-II are able to find near optimal job schedules consuming approximately 40% less energy on average than the schedules obtained by a common practice scheduler that minimizes the makespan
    corecore