27 research outputs found

    MR MAQ : algorisme de Read Mapping utilitzant la plataforma Hadoop

    Get PDF
    L'èxit del Projecte Genoma Humà (PGH) l'any 2000 va fer de la "medicina personalitzada" una realitat més propera. Els descobriments del PGH han simplificat les tècniques de seqüenciació de tal manera que actualment qualsevol persona pot aconseguir la seva seqüència d'ADN complerta. La tecnologia de Read Mapping destaca en aquest tipus de tècniques i es caracteritza per manegar una gran quantitat de dades. Hadoop, el framework d'Apache per aplicacions intensives de dades sota el paradigma Map Reduce, resulta un aliat perfecte per aquest tipus de tecnologia i ha sigut l'opció escollida per a realitzar aquest projecte. Durant tot el treball es realitza l'estudi, l'anàlisi i les experimentacions necessàries per aconseguir un Algorisme Genètic innovador que utilitzi tot el potencial de Hadoop.El éxito del Proyecto Genoma Humano (PGH) en el año 2.000 hizo de la "medicina personalizada" una relidad más cercana. Los descubrimientos del PGH han simplificado las técnicas de secuenciación de tal manera que actualmente cualquier persona puede conseguir su secuencia de ADN completa. La tecnología de Read Mapping destaca en este tipo de técnicas y se caracteriza por manejar una gran cantidad de datos. Hadoop, el Framework de Apache para aplicaciones intensivas de datos bajo el paradigma Map Reduce, resulta un aliado perfecto para este tipo de tecnología y ha sido la opción escogida para realizar este proyecto. A lo largo del trabajo se realiza el estudio, el análisis y las experimentaciones necesarias para conseguir un Algoritmo Genómico novedoso que utilice todo el potencial de Hadoop.In the 2000th the Human Genome Project (PGH) was accomplished successfully and it made "personalized medicine" a closer reality. The PGH has simplified the sequencing techniques in a high way so nowadays anyone can get his full ADN sequence. Read Mapping technology is one of most important sequencing techniques and it is characterized to work with lots of data. Hadoop is the Framework of Apache for data intensive applications under Map Reduce paradigm and it becomes a perfect tool for this kind of technology. For this reason it has been selected for this project. Along this entire project we will realize the study, the analysis and the experimentations to get a new Genetic Algorithm with all Hadoop potential

    Processing of Large Satellite Images using Hadoop Distributed Technology and Mapreduce : A Case of Edge Detection

    Get PDF
    Now a day's amount of data continues to grow as more information becomes available. The Exponential growth of data and the increasing user’s demand for real time satellite data has forced remote sensing service providers to deliver the required services. The processing of large amount of images is necessary when there are satellite images involved.This paper presents a distributed technology, mapreduce programming paradigm,which is based on Hadoop platform to process large-scale satellite images.The main aim of this hadoop concept is to take the advantage of high reliability and high scalability in the field of remote sensing as to achieve the purpose of fast processing of large satellite images.The Hadoop streaming technology is used in the model and the main operations are written on java as the mapper and reducer.The model has been implemented using virtual machines where the large number of images are delivered to the multicluster nodes for concurrent processing.This paper presents a MapReduce based processing of large satellite images using edge detection methods .Sobel, Laplacian, and Canny edge detection methods are implemented in this model. DOI: 10.17762/ijritcc2321-8169.150520

    Integrating big data and blockchain to manage energy smart grid - TOTEM framework

    Get PDF
    The demand for electricity is increasing exponentially day by day, especially with the arrival of electric vehicles. In the smart community neighborhood project, electricity should be produced at the household or community level and sold or bought according to the demands. Since the actors can produce, sell, and buy according to the demands, thus the name prosumers. ICT solutions can contribute to this in several ways, such as machine learning for analyzing the household data for customer demand and peak hours for the usage of electricity, blockchain as a trustworthy platform for selling or buying, data hub, and ensuring data security and privacy of prosumers. TOTEM: Token for controlled computation is a framework that allows users to analyze the data without moving the data from the data owner's environment. It also ensures the data security and privacy of the data. Here, in this article, we will show the importance of the TOTEM architecture in the EnergiX project and how the extended version of TOTEM can be efficiently merged with the demands of the current and similar projects.publishedVersio

    Workload Schedulers - Genesis, Algorithms and Comparisons

    Get PDF
    In this article we provide brief descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems, Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion, we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems

    ALLOCATION OF THE LARGE CLUSTER SETUPS IN MAPREDUCE

    Get PDF
    Running multiple instances of the MapReduce framework concurrently in a multicluster system or datacenter enables data, failure, and version isolation, which is attractive for many organizations. It may also provide some form of performance isolation, but in order to achieve this in the face of time-varying workloads submitted to the MapReduce instances, a mechanism for dynamic resource (re-)allocations to those instances is required. In this paper, we present such a mechanism called Fawkes that attempts to balance the allocations to MapReduce instances so that they experience similar service levels. Fawkes proposes a new abstraction for deploying MapReduce instances on physical resources, the MR-cluster, which represents a set of resources that can grow and shrink, and that has a core on which MapReduce is installed with the usual data locality assumptions but that relaxes those assumptions for nodes outside the core. Fawkes dynamically grows and shrinks the active MRcluster based on a family of weighting policies with weights derived from monitoring their operation. Implementing MapReduce in cloud requires creation of clusters, where the Map and Reduce operations can be performed. Optimizing the overall resource utilization without compromising with the efficiency of availing services is the need for the hour. Selecting right set of nodes to form cluster plays a major role in improving the performance of the cloud. As a huge amount of data transfer takes place during the data analysis phase, network latency becomes the defining factor in improving the QoS of the cloud. In this paper we propose a novel Cluster Configuration algorithm that selects optimal nodes in a dynamic cloud environment to configure a cluster for running MapReduce jobs. The algorithm is cost optimized, adheres to global resource utilization and provides high performance to the clients. The proposed Algorithm gives a performance benefit of 35% on all reconfiguration based cases and 45 % performance benefit on best cases

    Large-scale Data Analysis and Deep Learning Using Distributed Cyberinfrastructures and High Performance Computing

    Get PDF
    Data in many research fields continues to grow in both size and complexity. For instance, recent technological advances have caused an increased throughput in data in various biological-related endeavors, such as DNA sequencing, molecular simulations, and medical imaging. In addition, the variance in the types of data (textual, signal, image, etc.) adds an additional complexity in analyzing the data. As such, there is a need for uniquely developed applications that cater towards the type of data. Several considerations must be made when attempting to create a tool for a particular dataset. First, we must consider the type of algorithm required for analyzing the data. Next, since the size and complexity of the data imposes high computation and memory requirements, it is important to select a proper hardware environment on which to build the application. By carefully both developing the algorithm and selecting the hardware, we can provide an effective environment in which to analyze huge amounts of highly complex data in a large-scale manner. In this dissertation, I go into detail regarding my applications using big data and deep learning techniques to analyze complex and large data. I investigate how big data frameworks, such as Hadoop, can be applied to problems such as large-scale molecular dynamics simulations. Following this, many popular deep learning frameworks are evaluated and compared to find those that suit certain hardware setups and deep learning models. Then, we explore an application of deep learning to a biomedical problem, namely ADHD diagnosis from fMRI data. Lastly, I demonstrate a framework for real-time and fine-grained vehicle detection and classification. With each of these works in this dissertation, a unique large-scale analysis algorithm or deep learning model is implemented that caters towards the problem and leverages specialized computing resources

    Using Workload Prediction and Federation to Increase Cloud Utilization

    Get PDF
    The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads

    Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

    Get PDF
    One of the significant shifts of the next-generation computing technologies will certainly be in the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD landmark, evolved as a widely deployed BD operating system. Its new features include federation structure and many associated frameworks, which provide Hadoop 3.x with the maturity to serve different markets. This dissertation addresses two leading issues involved in exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely, (i)Scalability that directly affects the system performance and overall throughput using portable Docker containers. (ii) Security that spread the adoption of data protection practices among practitioners using access controls. An Enhanced Mapreduce Environment (EME), OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker (BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for data streaming to the cloud computing are the main contribution of this thesis study

    Developing a trusted computational grid

    Get PDF
    Within institutional computing infrastructure, currently available grid mid-dlewares are considered to be overly complex. This is largely due to behaviours required for untrusted networks. These behaviours however are an integral part of grid systems and cannot be removed. Within this work the development of a grid middleware suitable for unifying institu-tional resources is proposed. The proposed system should be capa-ble of interfacing with all Linux based systems within the QueensGate Grid (QGG) campus grid, automatically determining the best resource for a given job. This allocation should be done without requiring any ad-ditional user effort, or impacting established user workflows. The frame-work was developed to tackle this problem. It was simulated, utilising real usage data, in order to assess suitability for deployment. The re-sults gained from simulation were encouraging. There is a close match between real usage data and data generated through simulation. Fur-thermore the proposed framework will enable better utilisation of cam-pus grid resources, will not require modification of user workflows, and will maintain the security and integrity of user accounts
    corecore