Search CORE

625 research outputs found

Content-aware partial compression for textual big data analysis in Hadoop

Author: Dong Dapeng
Herbert John
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/06/2017
Field of study

A substantial amount of information in companies and on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. Compression as an effective means to reduce data size has been employed by many emerging data analytic platforms, whom the main purpose of data compression is to save storage space and reduce data transmission cost over the network. Since general purpose compression methods endeavour to achieve higher compression ratios by leveraging data transformation techniques and contextual data, this context-dependency forces the access to the compressed data to be sequential. Processing such compressed data in parallel, such as desirable in a distributed environment, is extremely challenging. This work proposes techniques for more efficient textual big data analysis with an emphasis on content-aware compression schemes suitable for the Hadoop analytic platform. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of public and private real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements

Irish Universities

Cork Open Research Archive

Recommended from our members

A Platform for Scalable Low-Latency Analytics using MapReduce

Author: Li Boduo
Publication venue: ScholarWorks@UMass Amherst
Publication date: 20/08/2015
Field of study

Today, the ability to process big data has become crucial to the information needs of many enterprise businesses, scientific applications, and governments. Recently, there have been increasing needs of processing data that is not only big but also fast . Here fast data refers to high-speed real-time and near real-time data streams, such as Twitter feeds, search query streams, click streams, impressions, and system logs. To handle both historical data and real-time data, many companies have to maintain multiple systems. However, recent real-world case studies show that maintaining multiple systems cause not only code duplication, but also intensive manual work to partition the analytics workloads and determine which data is processed by which system. These issues point to the need for a general, unified data processing framework to support analytical queries with different latency requirements. This thesis takes a further step towards building a general, unified system for big and fast data analytics. In order to build such a system, I propose to build on existing solutions on data parallelism and extend them with two new features: incremental processing and stream processing with latency constraints. This thesis starts with Hadoop, the most popular open-source MapReduce implementation, which provides proven scalability based on data parallelism. I answer the following questions: (1) Is Hadoop able to support incremental processing? (2) What are the necessary architecture changes in order to support incremental processing? (3) What are the additional design features needed to support stream processing with latency constraints? The thesis includes three parts that answer each of the questions. The first part of the thesis validates whether the existing MapReduce implementations can support incremental processing. Incremental processing means that computation is performed as soon as the relevant data becomes available. My extensive benchmark study of Hadoop-based MapReduce systems shows that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental computation. I further propose a cost model, and optimize the Hadoop system configuration based on the model. The benchmark results over the optimized system verify that the barrier to incremental computation is intrinsic, and cannot be removed by tuning system parameters. In the second part of the thesis, I employ various purely hash-based techniques to enable fast in-memory incremental processing in MapReduce, and frequent key based techniques to extend such processing to workloads that require memory more than available. I evaluate my Hadoop-based prototype equipped with all proposed techniques. The results show that the hash techniques allow the reduce progress to keep up with the map progress with up to 3 orders of magnitude reduction of internal disk spills, and enable results to be returned early. The third part of the thesis aims to support stream processing with latency constraints based on the incremental processing platform resulted from the second part. I perform a benchmark study to understand the sources of latency. I then propose a number of necessary architecture changes to support stream processing, and augment the platform with new latency-aware model-driven resource planning and latency-aware runtime scheduling techniques to meet user-specified latency constraints while maximizing throughput. Experiments using real-world workloads show that the techniques reduce the latency from tens or hundreds of seconds to sub-second, with 2x-5x increase in throughput. The new platform offers 1-2 orders of magnitude improvements over Storm, a commercial-grade distributed stream system, and Spark Streaming, a state-of-the-art academic prototype, when considering both latency and throughput

ScholarWorks@UMass Amherst

Recommended from our members

Optimisation of a hadoop cluster based on SDN in cloud computing for big data applications

Author: Khaleel Ali
Publication venue: Brunel University London
Publication date: 01/01/2018
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a great deal attention from many sectors, including academia, industry and government. The Hadoop framework has emerged for supporting its storage and analysis using the MapReduce programming module. However, this framework is a complex system that has more than 150 parameters and some of them can exert a considerable effect on the performance of a Hadoop job. The optimum tuning of the Hadoop parameters is a difficult task as well as being time consuming. In this thesis, an optimisation approach is presented to improve the performance of a Hadoop framework by setting the values of the Hadoop parameters automatically. Specifically, genetic programming is used to construct a fitness function that represents the interrelations among the Hadoop parameters. Then, a genetic algorithm is employed to search for the optimum or near the optimum values of the Hadoop parameters. A Hadoop cluster is configured on two severe at Brunel University London to evaluate the performance of the proposed optimisation approach. The experimental results show that the performance of a Hadoop MapReduce job for 20 GB on Word Count Application is improved by 69.63% and 30.31% when compared to the default settings and state of the art, respectively. Whilst on Tera sort application, it is improved by 73.39% and 55.93%. For better optimisation, SDN is also employed to improve the performance of a Hadoop job. The experimental results show that the performance of a Hadoop job in SDN network for 50 GB is improved by 32.8% when compared to traditional network. Whilst on Tera sort application, the improvement for 50 GB is on average 38.7%. An effective computing platform is also presented in this thesis to support solar irradiation data analytics. It is built based on RHIPE to provide fast analysis and calculation for solar irradiation datasets. The performance of RHIPE is compared with the R language in terms of accuracy, scalability and speedup. The speed up of RHIPE is evaluated by Gustafson's Law, which is revised to enhance the performance of the parallel computation on intensive irradiation data sets in a cluster computing environment like Hadoop. The performance of the proposed work is evaluated using a Hadoop cluster based on the Microsoft azure cloud and the experimental results show that RHIPE provides considerable improvements over the R language. Finally, an effective routing algorithm based on SDN to improve the performance of a Hadoop job in a large scale cluster in a data centre network is presented. The proposed algorithm is used to improve the performance of a Hadoop job during the shuffle phase by allocating efficient paths for each shuffling flow, according to the network resources demand of each flow as well as their size and number. Furthermore, it is also employed to allocate alternative paths for each shuffling flow in the case of any link crashing or failure. This algorithm is evaluated by two network topologies, namely, fat tree and leaf-spine, built by EstiNet emulator software. The experimental results show that the proposed approach improves the performance of a Hadoop job in a data centre network.Ministry of Higher Education and Scientific Researc

Brunel University Research Archive

Distributed Partitioning and Processing of Large Spatial Datasets

Author: Zeidan Ayman I
Publication venue: CUNY Academic Works
Publication date: 01/02/2022
Field of study

Data collection is one of the most common practices in today’s world. The data collection rate has rapidly increased over the past decade and is not showing any signs of decline. Data sources are many; the Internet of Things devices, mobile gadgets, social media posts, connected cars, and web servers constantly report on their users’ interactions and habits. Much of the collected data is spatial data which contains attributes that denote the physical origin of the data. As a result of the tremendous growth in data collection, higher demand for new techniques emerged to efficiently process and extract valuable insights in a relatively acceptable time frame. The current standard approach to large-scale data analysis uses distributed parallel processing systems like Apache Hadoop and Apache Spark. However, these systems are designed for general-purpose parallel processing and require an additional layer to recognize and efficiently process spatial datasets. Motivated by its many applications, we examine the several challenges facing spatial data partitioning and processing and propose solutions customized for each task. We detail our techniques for building spatial partitioners over large datasets for use with spatial queries like map-matching and kNN spatial join. Additionally, we present an accuracy benchmarking framework for comparing and classifying the results of two input files based on specific criteria. Our proposed work targets batch processing of large spatial datasets, including structured, unstructured, and semi-structured datasets

City University of New York

Market Basket Analysis Algorithm with MapReduce Using HDFS

Author: Nuthalapati Aditya
Publication venue: North Dakota State University
Publication date: 01/01/2017
Field of study

Market basket analysis techniques are substantially important to every day’s business decision. The traditional single processor and main memory based computing approach is not capable of handling ever increasing large transactional data. In today’s world, the MapReduce approach has been popular to compute huge volumes of data, moreover existing sequential algorithms can be converted in to MapReduce framework for big data. This paper presents a Market Basket Analysis (MBA) algorithm with MapReduce on Hadoop to generate the complete set of maximal frequent item sets. The algorithm is to sort data sets and to convert it to (key, value) pairs to fit with the MapReduce concept. The framework sorts the outputs of the maps, which are then input to the “reduce” tasks. The experimental results show that the code with MapReduce increases the performance as adding more nodes until it reaches saturation

NDSU Libraries Institutional Repository

Content-aware compression for big textual data analysis

Author: Dong Dapeng
Publication venue: 'University College Cork'
Publication date: 01/01/2016
Field of study

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements

Irish Universities

Cork Open Research Archive

Deep Data Locality on Apache Hadoop

Author: Lee Sungchul
Publication venue: Digital Scholarship@UNLV
Publication date: 15/05/2018
Field of study

The amount of data being collected in various areas such as social media, network, scientific instrument, mobile devices, and sensors is growing continuously, and the technology to process them is also advancing rapidly. One of the fundamental technologies to process big data is Apache Hadoop that has been adopted by many commercial products, such as InfoSphere by IBM, or Spark by Cloudera. MapReduce on Hadoop has been widely used in many data science applications. As a dominant big data processing platform, the performance of MapReduce on Hadoop system has a significant impact on the big data processing capability across multiple industries. Most of the research for improving the speed of big data analysis has been on Hadoop modules such as Hadoop common, Hadoop Distribute File System (HDFS), Hadoop Yet Another Resource Negotiator (YARN) and Hadoop MapReduce. In this research, we focused on data locality on HDFS to improve the performance of MapReduce. To reduce the amount of data transfer, MapReduce has been utilizing data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Specifically, we introduce two implementation methods of the DDL, i.e., block-based DDL and key-based DDL. In block-based DDL, the data blocks are pre-arranged to reduce the block copying time in two ways. First the RLM blocks are eliminated. Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. In block-based DDL, blocks are placed to avoid RLMs to reduce the block copy time. Second, block-based DDL concentrates the blocks in a smaller number of nodes and reduces the data transfer time among them. We analyzed the block distribution status with the customer review data from TripAdvisor and measured the performances with Terasort Benchmark. Our test result shows that the execution times of Map and Shuffle have been improved by up to 25% and 31% respectively. In key-based DDL, the input data is divided into several blocks and stored in HDFS before going into the Map stage. In comparison with conventional blocks that have random keys, our blocks have a unique key. This requires a pre-sorting of the key-value pairs, which can be done during ETL process. This eliminates some data movements in map, shuffle, and reduce stages, and thereby improves the performance. In our experiments, MapReduce with key-based DDL performed 21.9% faster than default MapReduce and 13.3% faster than MapReduce with block-based DDL. Additionally, key-based DDL can be combined with other methods to further improve the performance. When key-based DDL and block-based DDL are combined, the Hadoop performance went up by 34.4%. In this research, we developed the MapReduce workflow models with a novel computational model. We developed a numerical simulator that integrates the computational models. The model faithfully predicts the Hadoop performance under various conditions

University of Nevada, Las Vegas Repository