724 research outputs found

    Evaluation of Hadoop/Mapreduce Framework Migration Tools

    Get PDF
    In distributed systems, database migration is not an easy task. Companies will encounter challenges moving data including legacy data to the big data platform. This paper reviews some tools for migrating from traditional databases to the big data platform and thus suggests a model, based on the review

    Hive on spark and MapReduce : a methodology for parameter tuning

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster

    Towards Efficient Resource Provisioning in Hadoop

    Get PDF
    Considering recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for better energy-efficient computing. This thesis proposes the Best Trade-off Point (BToP) method which provides a general approach and techniques based on an algorithm with mathematical formulas to find the best trade-off point on an elbow curve of performance vs. resources for efficient resource provisioning in Hadoop MapReduce and Apache Spark. Our novel BToP method is expected to work for any applications and systems which rely on a tradeoff curve with an elbow shape, non-inverted or inverted, for making good decisions. This breakthrough method for optimal resource provisioning was not available before in the scientific, computing, and economic communities. To illustrate the effectiveness of the BToP method on the ubiquitous Hadoop MapReduce, our Terasort experiment shows that the number of task resources recommended by the BToP algorithm is always accurate and optimal when compared to the ones suggested by three popular rules of thumbs. We also test the BToP method on the emerging cluster computing framework Apache Spark running in YARN cluster mode. Despite the effectiveness of Spark’s robust and sophisticated built-in dynamic resource allocation mechanism, which is not available in MapReduce, the BToP method could still consistently outperform it according to our Spark-Bench Terasort test results. The performance efficiency gained from the BToP method not only leads to significant energy saving but also improves overall system throughput and prevents cluster underutilization in a multi-tenancy environment. In General, the BToP method is preferable for workloads with identical resource consumption signatures in production environment where job profiling for behavioral replication will lead to the most efficient resource provisioning

    Efficient Parallel Processing of k-Nearest Neighbor Queries by Using a Centroid-based and Hierarchical Clustering Algorithm

    Get PDF
    The k-Nearest Neighbor method is one of the most popular techniques for both classification and regression purposes. Because of its operation, the application of this classification may be limited to problems with a certain number of instances, particularly, when run time is a consideration. However, the classification of large amounts of data has become a fundamental task in many real-world applications. It is logical to scale the k-Nearest Neighbor method to large scale datasets. This paper proposes a new k-Nearest Neighbor classification method (KNN-CCL) which uses a parallel centroid-based and hierarchical clustering algorithm to separate the sample of training dataset into multiple parts. The introduced clustering algorithm uses four stages of successive refinements and generates high quality clusters. The k-Nearest Neighbor approach subsequently makes use of them to predict the test datasets. Finally, sets of experiments are conducted on the UCI datasets. The experimental results confirm that the proposed k-Nearest Neighbor classification method performs well with regard to classification accuracy and performance

    Doctor of Philosophy

    Get PDF
    dissertationThe Active Traffic and Demand Management (ATDM) initiative aims to integrate various management strategies and control measures so as to achieve the mobility, environment and sustainability goals. To support the active monitoring and management of real-world complex traffic conditions, the first objective of this dissertation is to develop a travel time reliability estimation and prediction methodology that can provide informed decisions for the management and operation agencies and travelers. A systematic modeling framework was developed to consider a corridor with multiple bottlenecks, and a series of close-form formulas was derived to quantify the travel time distribution under both stochastic demand and capacity, with possible on-ramp and off-ramp flow changes. Traffic state estimation techniques are often used to guide operational management decisions, and accurate traffic estimates are critically needed in ATDM applications designed for reducing instability, volatility and emissions in the transportation system. By capturing the essential forward and backward wave propagation characteristics under possible random measurement errors, this dissertation proposes a unified representation with a simple but theoretically sound explanation for traffic observations under free-flow, congested and dynamic transient conditions. This study also presents a linear programming model to quantify the value of traffic measurements, in a heterogeneous data environment with fixed sensors, Bluetooth readers and GPS sensors. It is important to design comprehensive traffic control measures that can systematically address deteriorating congestion and environmental issues. To better evaluate and assess the mobility and environmental benefits of the transportation improvement plans, this dissertation also discusses a cross-resolution modeling framework for integrating a microscopic emission model with the existing mesoscopic traffic simulation model. A simplified car-following model-based vehicle trajectory construction method is used to generate the high-resolution vehicle trajectory profiles and resulting emission output. In addition, this dissertation discusses a number of important issues for a cloud computing-based software system implementation. A prototype of a reliability-based traveler information provision and dissemination system is developed to offer a rich set of travel reliability information for the general public and traffic management and planning organizations

    New and Existing Approaches Reviewing of Big Data Analysis with Hadoop Tools

    Get PDF
                 الجميع متصل بوسائل التواصل الاجتماعي مثل) الفيس بوك وتويتر ولنكدان والانستغرام ...الخ) , التي تتولد من خلالها كميات هائلة من البيانات لا تستطيع التطبيقات التقليدية من معالجتها , حيث تعتبر وسائل التواصل الاجتماعي منصة مهمة لتبادل المعلومات والآراء والمعرفة التي يجريها العديد من المشتركين ,على الرغم من هذه السمات الأساسية ، تساهم البيانات الضخمة أيضًا في العديد من المشكلات ، مثل جمع البيانات ، والتخزين ، والنقل ، والتحديث ، والمراجعة ، والنشر ، والمسح الضوئي ، والتصور ، وحماية البيانات ... إلخ. للتعامل مع كل هذه المشاكل، ظهرت الحاجة إلى نظام مناسب لا يقوم فقط بإعداد التفاصيل، بل يوفر أيضًا تحليلًا ذا مغزى للاستفادة من المواقف الصعبة، سواء ذات الصلة بالأعمال التجارية، أو القرار المناسب، أو الصحة، أو وسائل التواصل الاجتماعي، أو العلوم، الاتصالات، البيئة... إلخ.يلاحظ المؤلفون من خلال قراءة الدراسات السابقة أن هناك تحليلات مختلفة من خلال Hadoop وأدواته المختلفة مثل المشاعر في الوقت الفعلي وغيرها. ومع ذلك، فإن التعامل مع هذه البيانات الضخمة يعد مهمة صعبة. لذلك فإن هذا النوع من التحليل يكون بكفاءه أكثر أكثر كفاءة فقط من خلال نظام Hadoop البيئي.، الغرض من هذه الورقة هو تحليل الأدبيات المتعلقة بتحليل البيانات الضخمة لوسائل التواصل الاجتماعي باستخدام إطار Hadoop لمعرفة أدوات التحليل تقريبًا الموجودة في العالم تحت مظلة Hadoop وتوجهاتها بالإضافة إلى الصعوبات والأساليب الحديثة لها للتغلب على تحديات البيانات الضخمة في المعالجة غير المتصلة وفي الوقت الفعلي. تعمل التحليلات في الوقت الفعلي على تسريع عملية اتخاذ القرار إلى جانب توفير الوصول إلى مقاييس الأعمال وإعداد التقارير. كما تم توضيح المقارنة بين Hadoop و spark.Everybody is connected with social media like (Facebook, Twitter, LinkedIn, Instagram…etc.) that generate a large quantity of data and which traditional applications are inadequate to process. Social media are regarded as an important platform for sharing information, opinion, and knowledge of many subscribers. These basic media attribute Big data also to many issues, such as data collection, storage, moving, updating, reviewing, posting, scanning, visualization, Data protection, etc. To deal with all these problems, this is a need for an adequate system that not just prepares the details, but also provides meaningful analysis to take advantage of the difficult situations, relevant to business, proper decision, Health, social media, science, telecommunications, the environment, etc. Authors notice through reading of previous studies that there are different analyzes through HADOOP and its various tools such as the sentiment in real-time and others. However, dealing with this Big data is a challenging task. Therefore, such type of analysis is more efficiently possible only through the Hadoop Ecosystem. The purpose of this paper is to analyze literature related analysis of big data of social media using the Hadoop framework for knowing almost analysis tools existing in the world under the Hadoop umbrella and its orientations in addition to difficulties and modern methods of them to overcome challenges of big data in offline and real –time processing. Real-time Analytics accelerates decision-making along with providing access to business metrics and reporting. Comparison between Hadoop and spark has been also illustrated

    DEVELOPMENT OF MAP/REDUCE BASED MICROARRAY ANALYSIS TOOLS

    Get PDF
    High density oligonucleotide array (microarray) from the Affymetrix GeneChip¨ system has been widely used for the measurements of gene expressions. Currently, public data repositories, such as Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI), have accumulated very large amount of microarray data. For example, there are 84389 human and 9654 Arabidopsis microarray experiments in GEO database. Efficiently integrative analysis large amount of microarray data will provide more knowledge about the biological systems. Traditional microarray analysis tools all implemented sequential algorithms and can only be run on single processor. They are not able to handle very large microarray data sets with thousands of experiments. It is necessary to develop new microarray analysis tools using parallel framework. In this thesis, I implemented microarray quality assessment, background correction, normalization and summarization algorithms using the Map/Reduce framework. The Map/Reduce framework, first introduced by Google in 2004, offers a promising paradigm to develop scalable parallel applications for large-scale data. Evaluation of our new implementation on large microarray data of rice and Arabidopsis showed that they have good speedups. For example, running rice microarray data using our implementations of MAS5.0 algorithms on 20 computer nodes totally 320 processors has a 28 times speedup over using previous C++ implementation on single processor. Our new microarray tools will make it possible to utilize the valuable experiments in the public repositories
    corecore