21 research outputs found

    ๋…ธ๋“œ๊ธฐ๋ฐ˜ ์ปด๋ฐ”์ด๋„ˆ๋ฅผ ์ด์šฉํ•œ ํ•˜๋‘ก ๋งต๋ฆฌ๋“€์Šค ์„ฑ๋Šฅ ๊ฐœ์„ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 2. ๊น€ํ˜•์ฃผ.๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜๊ณผ ๊ธฐ๊ธฐ๋“ค์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์ธ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ƒ์„ฑํ•œ๋‹ค. ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ๋Œ€ํ•œ ์ˆ˜์š”๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ ๋ฐ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ˆ ์ด ์š”๊ตฌ๋˜๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ—ˆ์šฉ๋œ ๋ฒ”์œ„ ์‹œ๊ฐ„ ์•ˆ์— ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ํ•˜๋‘ก์€ ํšจ์œจ์ ์ธ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ๋ถ„์‚ฐ ์ €์žฅ๊ณผ ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์„ ์ง€์›ํ•œ๋‹ค. ๋งต๋ฆฌ๋“€์Šค๋Š” ํ•˜๋‘ก์ด ์ง€์›ํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ถ„์‚ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ๋กœ์„œ ๋‹ค์–‘ํ•œ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋งต๋ฆฌ๋“€์Šค ์ž‘์—…์˜ ๋ณ‘๋ชฉ์œผ๋กœ ์ง€๋ชฉ๋˜๋Š” I/O ๋น„์šฉ์— ๋Œ€ํ•œ ๊ฐœ์„  ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•œ๋‹ค. ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ๋งต๋ฆฌ๋“€์Šค ์ž‘์—…์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์บ์‹œํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋งคํ•‘ ๋‹จ๊ณ„์˜ ๋””์Šคํฌ I/O๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ํšจ์œจ์„ฑ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์…”ํ”Œ ๋‹จ๊ณ„์˜ I/O๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์•ˆ์œผ๋กœ ๋™์ผํ•œ ๋…ธ๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ชจ๋“  ๋งคํผ(Mapper)์˜ ๊ฒฐ๊ณผ ๊ฐ’์„ ์ธ๋ฉ”๋ชจ๋ฆฌ ์บ์‹œ์— ์ €์žฅํ•˜์—ฌ ๋…ธ๋“œ๋ณ„ ๊ฒฐ๊ณผ ๊ฐ’ ํฌ๊ธฐ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋…ธ๋“œ๊ธฐ๋ฐ˜ ์ปด๋ฐ”์ด๋„ˆ(In-Node Combiner)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ์กด ์—ฐ๊ตฌ์— ๋น„ํ•ด ๋…ธ๋“œ๊ธฐ๋ฐ˜ ์ปด๋ฐ”์ด๋„ˆ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„ ๊ฒฝ์šฐ ๋งต๋ฆฌ๋“€์Šค ์ž‘์—…์˜ ์„ฑ๋Šฅ์ด 20% ์ด์ƒ ํ–ฅ์ƒํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.Overwhelming amount of data is being generated by various applications and devices in real-time. While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Data-intensive analytics should be processed in tolerable elapsed time using commodity hardware. Hadoop framework efficiently distributes large datasets over multiple commodity servers and the MapReduce framework performs parallel computations. We discuss the I/O bottlenecks of Hadoop MapReduce framework and propose methods for enhancing I/O performance in common MapReduce jobs. A proven approach is to cache input data to maximize memory-locality of all map tasks. We introduce an approach to optimize I/O in the shuffle phase, the in-node combining design which extend the scope of the traditional combiner to a node level. The in-node combiner reduces the total number of emitted intermediate results and curtail network traffic between mappers and reducers.Abstract i Introduction 3 Related Work 6 2.1 Hadoop Distributed File System 6 2.2 Hadoop MapReduce 7 2.3 Hadoop I/O Optimization 10 2.4 NoSQL 12 Background 15 3.1 HDFS Bottleneck 15 3.1.1 In-Memory Cache 16 3.2 Shuffle Bottleneck 17 3.2.1 Traditional Combiner 18 3.2.2 In-Mapper Combiner 19 Our Approach 22 4.1 In-Node Combiner (INC) 23 4.2 Implementation 25 4.2.1 System Architecture 27 Experiment 29 5.1 In-Memory Cache 30 5.2 Combiner 31 Conclusion 36 Reference 38 Appendix 41 1. Hadoop 2.0 41Maste

    Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database

    Full text link
    The Apache Accumulo database excels at distributed storage and indexing and is ideally suited for storing graph data. Many big data analytics compute on graph data and persist their results back to the database. These graph calculations are often best performed inside the database server. The GraphBLAS standard provides a compact and efficient basis for a wide range of graph applications through a small number of sparse matrix operations. In this article, we implement GraphBLAS sparse matrix multiplication server-side by leveraging Accumulo's native, high-performance iterators. We compare the mathematics and performance of inner and outer product implementations, and show how an outer product implementation achieves optimal performance near Accumulo's peak write rate. We offer our work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org

    Systems For Delivering Electric Vehicle Data Analytics

    Get PDF
    n the recent times, advances in scientific research related to electric vehicles led to generation of large amounts of data. This data is majorly logger data collected from various sensors in the vehicle. It is predominantly unstructured and non-relational in nature, also called Big Data. Analysis of such data needs a high performance information technology infrastructure that provides superior computational efficiency and storage capacity. It should be scalable to accommodate the growing data and ensure its security over a network. This research proposes an architecture built over Hadoop to effectively support distributed data management over a network for real-time data collection and storage, parallel processing, and faster random read access for information retrieval for decision-making. Once imported into a database, the system can support efficient analysis and visualization of data as per user needs. These analytics can help understand correlations between data parameters under various circumstances. This system provides scalability to support data accumulation in the future and still perform analytics with less overhead. Overall, these open problems in EV data analytics are taken into consideration and a low-cost architecture for data management is researched

    Pig Squeal: Bridging Batch and Stream Processing Using Incremental Updates

    Get PDF
    As developers shift from batch MapReduce to stream processing for better latency, they are faced with the dilemma of changing tools and maintaining multiple code bases. In this work we present a method for converting arbitrary chains of MapReduce jobs into pipelined, incremental processes to be executed in a stream processing framework. Pig Squeal is an enhancement of the Pig execution framework that runs lightly modified user scripts on Storm. The contributions of this work include: an analysis that tracks how information flows through MapReduce computations along with the influence of adding and deleting data from the input, a structure to generically handle these changes along with a description of the criteria to re-enable efficiencies using combiners, case studies for running word count and the more complex NationMind algorithms within Squeal, and a performance model which examines execution times of MapReduce algorithms after converted. A general solution to the conversion of analytics from batch to streaming impacts developers with expertise in batch systems by providing a means to use their expertise in a new environment. Imagine a medical researcher who develops a model for predicting emergency situations in a hospital on historical data (in a batch system). They could apply these techniques to quickly deploy these detectors on live patient feeds. It also significantly impacts organizations with large investments in batch codes by providing a tool for rapid prototyping and significantly lowering the costs of experimenting in these new environments

    A resource aware distributed LSI algorithm for scalable information retrieval

    Get PDF
    Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Cache-conscious Splitting of MapReduce Tasks and its Application to Stencil Computations

    Get PDF
    Modern cluster systems are typically composed by nodes with multiple processing units and memory hierarchies comprising multiple cache levels of various sizes. To leverage the full potential of these architectures it is necessary to explore concepts such as parallel programming and the layout of data onto the memory hierarchy. However, the inherent complexity of these concepts and the heterogeneity of the target architectures raises several challenges at application development and performance portability levels, respectively. In what concerns parallel programming, several model and frameworks are available, of which MapReduce [16] is one of the most popular. It was developed at Google [16] for the parallel and distributed processing of large amounts of data in large clusters of commodity machines. Although being very powerful tools, the reference MapReduce frameworks, such as Hadoop and Spark, do not leverage the characteristics of the underlying memory hierarchy. This shortcoming is particularly noticeable in computations that benefit from temporal locality, such as stencil computations. In this context, the goal of this thesis is to improve the performance of MapReduce computations that benefit from temporal locality. To that end we optimize the mapping of MapReduce computations in a machineโ€™s cache memory hierarchy by applying cacheaware tiling techniques. We prototyped our solution on top of the framework Hadoop MapReduce, incorporating a cache-awareness in the splitting stage. To validate our solution and assess its benefits, we developed an API for expressing stencil computations on top the developed framework. The experimental results show that, for a typical stencil computation, our solution delivers an average speed-up of 1.77 while reaching a peek speed-up of 3.2. These findings allows us to conclude that cacheaware decomposition of MapReduce computations considerably boosts the execution of this class of MapReduce computations

    Computing resources sensitive parallelization of neural neworks for large scale diabetes data modelling, diagnosis and prediction

    Get PDF
    Diabetes has become one of the most severe deceases due to an increasing number of diabetes patients globally. A large amount of digital data on diabetes has been collected through various channels. How to utilize these data sets to help doctors to make a decision on diagnosis, treatment and prediction of diabetic patients poses many challenges to the research community. The thesis investigates mathematical models with a focus on neural networks for large scale diabetes data modelling and analysis by utilizing modern computing technologies such as grid computing and cloud computing. These computing technologies provide users with an inexpensive way to have access to extensive computing resources over the Internet for solving data and computationally intensive problems. This thesis evaluates the performance of seven representative machine learning techniques in classification of diabetes data and the results show that neural network produces the best accuracy in classification but incurs high overhead in data training. As a result, the thesis develops MRNN, a parallel neural network model based on the MapReduce programming model which has become an enabling technology in support of data intensive applications in the clouds. By partitioning the diabetic data set into a number of equally sized data blocks, the workload in training is distributed among a number of computing nodes for speedup in data training. MRNN is first evaluated in small scale experimental environments using 12 mappers and subsequently is evaluated in large scale simulated environments using up to 1000 mappers. Both the experimental and simulations results have shown the effectiveness of MRNN in classification, and its high scalability in data training. MapReduce does not have a sophisticated job scheduling scheme for heterogonous computing environments in which the computing nodes may have varied computing capabilities. For this purpose, this thesis develops a load balancing scheme based on genetic algorithms with an aim to balance the training workload among heterogeneous computing nodes. The nodes with more computing capacities will receive more MapReduce jobs for execution. Divisible load theory is employed to guide the evolutionary process of the genetic algorithm with an aim to achieve fast convergence. The proposed load balancing scheme is evaluated in large scale simulated MapReduce environments with varied levels of heterogeneity using different sizes of data sets. All the results show that the genetic algorithm based load balancing scheme significantly reduce the makespan in job execution in comparison with the time consumed without load balancing.EThOS - Electronic Theses Online ServiceEPSRCChina Market AssociationGBUnited Kingdo
    corecore