Search CORE

21 research outputs found

노드기반 컴바이너를 이용한 하둡 맵리듀스 성능 개선

Author: 이우현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2015
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 김형주.다양한 종류의 애플리케이션과 기기들이 기하급수적인 양의 데이터를 실시간으로 생성한다. 대용량 데이터 분석에 대한 수요가 증가함에 따라 효과적인 방법으로 많은 양의 데이터를 저장 및 처리하는 기술이 요구되고 있다. 데이터 분석은 주어진 하드웨어를 사용하여 허용된 범위 시간 안에 처리되어야 한다. 이를 위해 하둡은 효율적인 대용량 데이터 분산 저장과 분산 병렬 컴퓨팅을 지원한다. 맵리듀스는 하둡이 지원하는 강력한 분산 프로그래밍 모델로서 다양한 형식의 데이터를 처리한다. 본 논문은 맵리듀스 작업의 병목으로 지목되는 I/O 비용에 대한 개선 방안을 제시한다. 많은 연구가 맵리듀스 작업의 입력 데이터를 메모리에 캐시하여 데이터 매핑 단계의 디스크 I/O를 최소화하는 방법에 대한 효율성을 증명하였다. 본 논문은 셔플 단계의 I/O를 줄이는 방안으로 동일한 노드에서 실행되는 모든 매퍼(Mapper)의 결과 값을 인메모리 캐시에 저장하여 노드별 결과 값 크기를 최소화하는 노드기반 컴바이너(In-Node Combiner)를 제안한다. 실험 결과 기존 연구에 비해 노드기반 컴바이너를 사용하였을 경우 맵리듀스 작업의 성능이 20% 이상 향상하는 것을 확인하였다.Overwhelming amount of data is being generated by various applications and devices in real-time. While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Data-intensive analytics should be processed in tolerable elapsed time using commodity hardware. Hadoop framework efficiently distributes large datasets over multiple commodity servers and the MapReduce framework performs parallel computations. We discuss the I/O bottlenecks of Hadoop MapReduce framework and propose methods for enhancing I/O performance in common MapReduce jobs. A proven approach is to cache input data to maximize memory-locality of all map tasks. We introduce an approach to optimize I/O in the shuffle phase, the in-node combining design which extend the scope of the traditional combiner to a node level. The in-node combiner reduces the total number of emitted intermediate results and curtail network traffic between mappers and reducers.Abstract i Introduction 3 Related Work 6 2.1 Hadoop Distributed File System 6 2.2 Hadoop MapReduce 7 2.3 Hadoop I/O Optimization 10 2.4 NoSQL 12 Background 15 3.1 HDFS Bottleneck 15 3.1.1 In-Memory Cache 16 3.2 Shuffle Bottleneck 17 3.2.1 Traditional Combiner 18 3.2.2 In-Mapper Combiner 19 Our Approach 22 4.1 In-Node Combiner (INC) 23 4.2 Implementation 25 4.2.1 System Architecture 27 Experiment 29 5.1 In-Memory Cache 30 5.2 Combiner 31 Conclusion 36 Reference 38 Appendix 41 1. Hadoop 2.0 41Maste

SNU Open Repository and Archive

Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database

Author: Fuchs Adam
Gadepally Vijay
Hutchison Dylan
Kepner Jeremy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/08/2015
Field of study

The Apache Accumulo database excels at distributed storage and indexing and is ideally suited for storing graph data. Many big data analytics compute on graph data and persist their results back to the database. These graph calculations are often best performed inside the database server. The GraphBLAS standard provides a compact and efficient basis for a wide range of graph applications through a small number of sparse matrix operations. In this article, we implement GraphBLAS sparse matrix multiplication server-side by leveraging Accumulo's native, high-performance iterators. We compare the mathematics and performance of inner and outer product implementations, and show how an outer product implementation achieves optimal performance near Accumulo's peak write rate. We offer our work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org

arXiv.org e-Print Archive

Crossref

Recommended from our members

A resource aware distributed LSI algorithm for scalable information retrieval

Author: Liu Yang
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2011
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels

Brunel University Research Archive

Recommended from our members

Computing resources sensitive parallelization of neural neworks for large scale diabetes data modelling, diagnosis and prediction

Author: Qi Hao
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2011
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Diabetes has become one of the most severe deceases due to an increasing number of diabetes patients globally. A large amount of digital data on diabetes has been collected through various channels. How to utilize these data sets to help doctors to make a decision on diagnosis, treatment and prediction of diabetic patients poses many challenges to the research community. The thesis investigates mathematical models with a focus on neural networks for large scale diabetes data modelling and analysis by utilizing modern computing technologies such as grid computing and cloud computing. These computing technologies provide users with an inexpensive way to have access to extensive computing resources over the Internet for solving data and computationally intensive problems. This thesis evaluates the performance of seven representative machine learning techniques in classification of diabetes data and the results show that neural network produces the best accuracy in classification but incurs high overhead in data training. As a result, the thesis develops MRNN, a parallel neural network model based on the MapReduce programming model which has become an enabling technology in support of data intensive applications in the clouds. By partitioning the diabetic data set into a number of equally sized data blocks, the workload in training is distributed among a number of computing nodes for speedup in data training. MRNN is first evaluated in small scale experimental environments using 12 mappers and subsequently is evaluated in large scale simulated environments using up to 1000 mappers. Both the experimental and simulations results have shown the effectiveness of MRNN in classification, and its high scalability in data training. MapReduce does not have a sophisticated job scheduling scheme for heterogonous computing environments in which the computing nodes may have varied computing capabilities. For this purpose, this thesis develops a load balancing scheme based on genetic algorithms with an aim to balance the training workload among heterogeneous computing nodes. The nodes with more computing capacities will receive more MapReduce jobs for execution. Divisible load theory is employed to guide the evolutionary process of the genetic algorithm with an aim to achieve fast convergence. The proposed load balancing scheme is evaluated in large scale simulated MapReduce environments with varied levels of heterogeneity using different sizes of data sets. All the results show that the genetic algorithm based load balancing scheme significantly reduce the makespan in job execution in comparison with the time consumed without load balancing.This work is funded by the EPSRC and China Market Association

Brunel University Research Archive

Systems For Delivering Electric Vehicle Data Analytics

Author: Bolly Vamshi Krishna
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2014
Field of study

n the recent times, advances in scientific research related to electric vehicles led to generation of large amounts of data. This data is majorly logger data collected from various sensors in the vehicle. It is predominantly unstructured and non-relational in nature, also called Big Data. Analysis of such data needs a high performance information technology infrastructure that provides superior computational efficiency and storage capacity. It should be scalable to accommodate the growing data and ensure its security over a network. This research proposes an architecture built over Hadoop to effectively support distributed data management over a network for real-time data collection and storage, parallel processing, and faster random read access for information retrieval for decision-making. Once imported into a database, the system can support efficient analysis and visualization of data as per user needs. These analytics can help understand correlations between data parameters under various circumstances. This system provides scalability to support data accumulation in the future and still perform analytics with less overhead. Overall, these open problems in EV data analytics are taken into consideration and a low-cost architecture for data management is researched

Purdue E-Pubs

Pig Squeal: Bridging Batch and Stream Processing Using Incremental Updates

Author: Lampton James Holmes
Publication venue
Publication date: 01/01/2015
Field of study

As developers shift from batch MapReduce to stream processing for better latency, they are faced with the dilemma of changing tools and maintaining multiple code bases. In this work we present a method for converting arbitrary chains of MapReduce jobs into pipelined, incremental processes to be executed in a stream processing framework. Pig Squeal is an enhancement of the Pig execution framework that runs lightly modified user scripts on Storm. The contributions of this work include: an analysis that tracks how information flows through MapReduce computations along with the influence of adding and deleting data from the input, a structure to generically handle these changes along with a description of the criteria to re-enable efficiencies using combiners, case studies for running word count and the more complex NationMind algorithms within Squeal, and a performance model which examines execution times of MapReduce algorithms after converted. A general solution to the conversion of analytics from batch to streaming impacts developers with expertise in batch systems by providing a means to use their expertise in a new environment. Imagine a medical researcher who develops a model for predicting emergency situations in a hospital on historical data (in a batch system). They could apply these techniques to quickly deploy these detectors on live patient feeds. It also significantly impacts organizations with large investments in batch codes by providing a tool for rapid prototyping and significantly lowering the costs of experimenting in these new environments

Digital Repository at the University of Maryland

A resource aware distributed LSI algorithm for scalable information retrieval

Author: Li M
Liu Yang
Publication venue
Publication date: 01/01/2011
Field of study

Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Cache-conscious Splitting of MapReduce Tasks and its Application to Stencil Computations

Author: Magro Daniel Lobato Vieira
Publication venue
Publication date: 01/01/2015
Field of study

Modern cluster systems are typically composed by nodes with multiple processing units and memory hierarchies comprising multiple cache levels of various sizes. To leverage the full potential of these architectures it is necessary to explore concepts such as parallel programming and the layout of data onto the memory hierarchy. However, the inherent complexity of these concepts and the heterogeneity of the target architectures raises several challenges at application development and performance portability levels, respectively. In what concerns parallel programming, several model and frameworks are available, of which MapReduce [16] is one of the most popular. It was developed at Google [16] for the parallel and distributed processing of large amounts of data in large clusters of commodity machines. Although being very powerful tools, the reference MapReduce frameworks, such as Hadoop and Spark, do not leverage the characteristics of the underlying memory hierarchy. This shortcoming is particularly noticeable in computations that benefit from temporal locality, such as stencil computations. In this context, the goal of this thesis is to improve the performance of MapReduce computations that benefit from temporal locality. To that end we optimize the mapping of MapReduce computations in a machine’s cache memory hierarchy by applying cacheaware tiling techniques. We prototyped our solution on top of the framework Hadoop MapReduce, incorporating a cache-awareness in the splitting stage. To validate our solution and assess its benefits, we developed an API for expressing stencil computations on top the developed framework. The experimental results show that, for a typical stencil computation, our solution delivers an average speed-up of 1.77 while reaching a peek speed-up of 3.2. These findings allows us to conclude that cacheaware decomposition of MapReduce computations considerably boosts the execution of this class of MapReduce computations

Repositório da Universidade Nova de Lisboa

Computing resources sensitive parallelization of neural neworks for large scale diabetes data modelling, diagnosis and prediction

Author: Li M
Qi Hao
Publication venue
Publication date: 01/01/2011
Field of study

Diabetes has become one of the most severe deceases due to an increasing number of diabetes patients globally. A large amount of digital data on diabetes has been collected through various channels. How to utilize these data sets to help doctors to make a decision on diagnosis, treatment and prediction of diabetic patients poses many challenges to the research community. The thesis investigates mathematical models with a focus on neural networks for large scale diabetes data modelling and analysis by utilizing modern computing technologies such as grid computing and cloud computing. These computing technologies provide users with an inexpensive way to have access to extensive computing resources over the Internet for solving data and computationally intensive problems. This thesis evaluates the performance of seven representative machine learning techniques in classification of diabetes data and the results show that neural network produces the best accuracy in classification but incurs high overhead in data training. As a result, the thesis develops MRNN, a parallel neural network model based on the MapReduce programming model which has become an enabling technology in support of data intensive applications in the clouds. By partitioning the diabetic data set into a number of equally sized data blocks, the workload in training is distributed among a number of computing nodes for speedup in data training. MRNN is first evaluated in small scale experimental environments using 12 mappers and subsequently is evaluated in large scale simulated environments using up to 1000 mappers. Both the experimental and simulations results have shown the effectiveness of MRNN in classification, and its high scalability in data training. MapReduce does not have a sophisticated job scheduling scheme for heterogonous computing environments in which the computing nodes may have varied computing capabilities. For this purpose, this thesis develops a load balancing scheme based on genetic algorithms with an aim to balance the training workload among heterogeneous computing nodes. The nodes with more computing capacities will receive more MapReduce jobs for execution. Divisible load theory is employed to guide the evolutionary process of the genetic algorithm with an aim to achieve fast convergence. The proposed load balancing scheme is evaluated in large scale simulated MapReduce environments with varied levels of heterogeneity using different sizes of data sets. All the results show that the genetic algorithm based load balancing scheme significantly reduce the makespan in job execution in comparison with the time consumed without load balancing.EThOS - Electronic Theses Online ServiceEPSRCChina Market AssociationGBUnited Kingdo

OpenGrey Repository