128 research outputs found

    A Survey on Big data Analytics in Cloud Environment

    Get PDF
    The continuous and rapid growth in the volume of data captured by organizations, such as social media, Internet of Things (IoT), machines, multimedia, GPS has produced an overwhelming flow of data. Data creation is occurring at a record rate, referred to as big data, and has emerged as a widely recognized trend. To take advantage of big data, real-time analysis and reporting must be provided in tandem with the massive capacity needed to store and process the data. Big data is affecting organization such as Banking, Education, Government, Health care, Manufacturing, retails and eventually, the society. On the other hand, Cloud computing eliminates the need to maintain expensive computing hardware, dedicated space, and software. Cloud provides larger volume of space for the storage and different set of services for all kind of applications to the cloud customers. Therefore, all the companies are nowadays migrating their applications towards cloud environment, because of the huge reduce in the overall investment and greater flexibility provided by the cloud

    Scientific Workflows for Metabolic Flux Analysis

    Get PDF
    Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components

    Performance Analysis of Hadoop MapReduce And Apache Spark for Big Data

    Get PDF
    In the recent era, information has evolved at an exponential rate. In order to obtain new insights, this information must be carefully interpreted and analyzed. There is, therefore, a need for a system that can process data efficiently all the time. Distributed cloud computing data processing platforms are important tools for data analytics on a large scale. In this area, Apache Hadoop (High-Availability Distributed Object-Oriented Platform) MapReduce has evolved as the standard. The MapReduce job reads, processes its input data and then returns it to Hadoop Distributed Files Systems (HDFS). Although there is limitation to its programming interface, this has led to the development of modern data flow-oriented frameworks known as Apache Spark, which uses Resilient Distributed Datasets (RDDs) to execute data structures in memory. Since RDDs can be stored in the memory, algorithms can iterate very efficiently over its data many times. Cluster computing is a major investment for any organization that chooses to perform Big Data Analysis. The MapReduce and Spark were indeed two famous open-source cluster-computing frameworks for big data analysis. Cluster computing hides the task complexity and low latency with simple user-friendly programming. It improves performance throughput, and backup uptime should the main system fail. Its features include flexibility, task scheduling, higher availability, and faster processing speed. Big Data analytics has become more computer-intensive as data management becomes a big issue for scientific computation. High-Performance Computing is undoubtedly of great importance for big data processing. The main application of this research work is towards the realization of High-Performance Computing (HPC) for Big Data Analysis. This thesis work investigates the processing capability and efficiency of Hadoop MapReduce and Apache Spark using Cloudera Manager (CM). The Cloudera Manager provides end-to-end cluster management for Cloudera Distribution for Apache Hadoop (CDH). The implementation was carried out with Amazon Web Services (AWS). Amazon Web Service is used to configure window Virtual Machine (VM). Four Linux In-stances of free tier eligible t2.micro were launched using Amazon Elastic Compute Cloud (EC2). The Linux Instances were configured into four cluster nodes using Secure Socket Shell (SSH). A Big Data application is generated and injected while both MapReduce and Spark job are run with different queries such as scan, aggregation, two way and three-way join. The time taken for each task to be completed are recorded, observed, and thoroughly analyzed. It was observed that Spark executes job faster than MapReduce

    Big Data Analytics and Application Deployment on Cloud Infrastructure

    Get PDF
    This dissertation describes a project began in October 2016. It was born from the collaboration between Mr.Alessandro Bandini and me, and has been developed under the supervision of professor Gianluigi Zavattaro. The main objective was to study, and in particular to experiment with, the cloud computing in general and its potentiality in the data elaboration field. Cloud computing is a utility-oriented and Internet-centric way of delivering IT services on demand. The first chapter is a theoretical introduction on cloud computing, analyzing the main aspects, the keywords, and the technologies behind clouds, as well as the reasons for the success of this technology and its problems. After the introduction section, I will briefly describe the three main cloud platforms in the market. During this project we developed a simple Social Network. Consequently in the third chapter I will analyze the social network development, with the initial solution realized through Amazon Web Services and the steps we took to obtain the final version using Google Cloud Platform with its charateristics. To conclude, the last section is specific for the data elaboration and contains a initial theoretical part that describes MapReduce and Hadoop followed by a description of our analysis. We used Google App Engine to execute these elaborations on a large dataset. I will explain the basic idea, the code and the problems encountered

    From cluster databases to cloud storage: Providing transactional support on the cloud

    Get PDF
    Durant les últimes tres dècades, les limitacions tecnològiques (com per exemple la capacitat dels dispositius d'emmagatzematge o l'ample de banda de les xarxes de comunicació) i les creixents demandes dels usuaris (estructures d'informació, volums de dades) han conduït l'evolució de les bases de dades distribuïdes. Des dels primers repositoris de dades per arxius plans que es van desenvolupar en la dècada dels vuitanta, s'han produït importants avenços en els algoritmes de control de concurrència, protocols de replicació i en la gestió de transaccions. No obstant això, els reptes moderns d'emmagatzematge de dades que plantegen el Big Data i el cloud computing—orientats a millorar la limitacions pel que fa a escalabilitat i elasticitat de les bases de dades estàtiques—estan empenyent als professionals a relaxar algunes propietats importants dels sistemes transaccionals clàssics, cosa que exclou a diverses aplicacions les quals no poden encaixar en aquesta estratègia degut a la seva alta dependència transaccional. El propòsit d'aquesta tesi és abordar dos reptes importants encara latents en el camp de les bases de dades distribuïdes: (1) les limitacions pel que fa a escalabilitat dels sistemes transaccionals i (2) el suport transaccional en repositoris d'emmagatzematge en el núvol. Analitzar les tècniques tradicionals de control de concurrència i de replicació, utilitzades per les bases de dades clàssiques per suportar transaccions, és fonamental per identificar les raons que fan que aquests sistemes degradin el seu rendiment quan el nombre de nodes i / o quantitat de dades creix. A més, aquest anàlisi està orientat a justificar el disseny dels repositoris en el núvol que deliberadament han deixat de banda el suport transaccional. Efectivament, apropar el paradigma de l'emmagatzematge en el núvol a les aplicacions que tenen una forta dependència en les transaccions és fonamental per a la seva adaptació als requeriments actuals pel que fa a volums de dades i models de negoci. Aquesta tesi comença amb la proposta d'un simulador de protocols per a bases de dades distribuïdes estàtiques, el qual serveix com a base per a la revisió i comparativa de rendiment dels protocols de control de concurrència i les tècniques de replicació existents. Pel que fa a la escalabilitat de les bases de dades i les transaccions, s'estudien els efectes que té executar diferents perfils de transacció sota diferents condicions. Aquesta anàlisi contínua amb una revisió dels repositoris d'emmagatzematge de dades en el núvol existents—que prometen encaixar en entorns dinàmics que requereixen alta escalabilitat i disponibilitat—, el qual permet avaluar els paràmetres i característiques que aquests sistemes han sacrificat per tal de complir les necessitats actuals pel que fa a emmagatzematge de dades a gran escala. Per explorar les possibilitats que ofereix el paradigma del cloud computing en un escenari real, es presenta el desenvolupament d'una arquitectura d'emmagatzematge de dades inspirada en el cloud computing la qual s’utilitza per emmagatzemar la informació generada en les Smart Grids. Concretament, es combinen les tècniques de replicació en bases de dades transaccionals i la propagació epidèmica amb els principis de disseny usats per construir els repositoris de dades en el núvol. Les lliçons recollides en l'estudi dels protocols de replicació i control de concurrència en el simulador de base de dades, juntament amb les experiències derivades del desenvolupament del repositori de dades per a les Smart Grids, desemboquen en el que hem batejat com Epidemia: una infraestructura d'emmagatzematge per Big Data concebuda per proporcionar suport transaccional en el núvol. A més d'heretar els beneficis dels repositoris en el núvol en quant a escalabilitat, Epidemia inclou una capa de gestió de transaccions que reenvia les transaccions dels clients a un conjunt jeràrquic de particions de dades, cosa que permet al sistema oferir diferents nivells de consistència i adaptar elàsticament la seva configuració a noves demandes de càrrega de treball. Finalment, els resultats experimentals posen de manifest la viabilitat de la nostra contribució i encoratgen als professionals a continuar treballant en aquesta àrea.Durante las últimas tres décadas, las limitaciones tecnológicas (por ejemplo la capacidad de los dispositivos de almacenamiento o el ancho de banda de las redes de comunicación) y las crecientes demandas de los usuarios (estructuras de información, volúmenes de datos) han conducido la evolución de las bases de datos distribuidas. Desde los primeros repositorios de datos para archivos planos que se desarrollaron en la década de los ochenta, se han producido importantes avances en los algoritmos de control de concurrencia, protocolos de replicación y en la gestión de transacciones. Sin embargo, los retos modernos de almacenamiento de datos que plantean el Big Data y el cloud computing—orientados a mejorar la limitaciones en cuanto a escalabilidad y elasticidad de las bases de datos estáticas—están empujando a los profesionales a relajar algunas propiedades importantes de los sistemas transaccionales clásicos, lo que excluye a varias aplicaciones las cuales no pueden encajar en esta estrategia debido a su alta dependencia transaccional. El propósito de esta tesis es abordar dos retos importantes todavía latentes en el campo de las bases de datos distribuidas: (1) las limitaciones en cuanto a escalabilidad de los sistemas transaccionales y (2) el soporte transaccional en repositorios de almacenamiento en la nube. Analizar las técnicas tradicionales de control de concurrencia y de replicación, utilizadas por las bases de datos clásicas para soportar transacciones, es fundamental para identificar las razones que hacen que estos sistemas degraden su rendimiento cuando el número de nodos y/o cantidad de datos crece. Además, este análisis está orientado a justificar el diseño de los repositorios en la nube que deliberadamente han dejado de lado el soporte transaccional. Efectivamente, acercar el paradigma del almacenamiento en la nube a las aplicaciones que tienen una fuerte dependencia en las transacciones es crucial para su adaptación a los requerimientos actuales en cuanto a volúmenes de datos y modelos de negocio. Esta tesis empieza con la propuesta de un simulador de protocolos para bases de datos distribuidas estáticas, el cual sirve como base para la revisión y comparativa de rendimiento de los protocolos de control de concurrencia y las técnicas de replicación existentes. En cuanto a la escalabilidad de las bases de datos y las transacciones, se estudian los efectos que tiene ejecutar distintos perfiles de transacción bajo diferentes condiciones. Este análisis continua con una revisión de los repositorios de almacenamiento en la nube existentes—que prometen encajar en entornos dinámicos que requieren alta escalabilidad y disponibilidad—, el cual permite evaluar los parámetros y características que estos sistemas han sacrificado con el fin de cumplir las necesidades actuales en cuanto a almacenamiento de datos a gran escala. Para explorar las posibilidades que ofrece el paradigma del cloud computing en un escenario real, se presenta el desarrollo de una arquitectura de almacenamiento de datos inspirada en el cloud computing para almacenar la información generada en las Smart Grids. Concretamente, se combinan las técnicas de replicación en bases de datos transaccionales y la propagación epidémica con los principios de diseño usados para construir los repositorios de datos en la nube. Las lecciones recogidas en el estudio de los protocolos de replicación y control de concurrencia en el simulador de base de datos, junto con las experiencias derivadas del desarrollo del repositorio de datos para las Smart Grids, desembocan en lo que hemos acuñado como Epidemia: una infraestructura de almacenamiento para Big Data concebida para proporcionar soporte transaccional en la nube. Además de heredar los beneficios de los repositorios en la nube altamente en cuanto a escalabilidad, Epidemia incluye una capa de gestión de transacciones que reenvía las transacciones de los clientes a un conjunto jerárquico de particiones de datos, lo que permite al sistema ofrecer distintos niveles de consistencia y adaptar elásticamente su configuración a nuevas demandas cargas de trabajo. Por último, los resultados experimentales ponen de manifiesto la viabilidad de nuestra contribución y alientan a los profesionales a continuar trabajando en esta área.Over the past three decades, technology constraints (e.g., capacity of storage devices, communication networks bandwidth) and an ever-increasing set of user demands (e.g., information structures, data volumes) have driven the evolution of distributed databases. Since flat-file data repositories developed in the early eighties, there have been important advances in concurrency control algorithms, replication protocols, and transactions management. However, modern concerns in data storage posed by Big Data and cloud computing—related to overcome the scalability and elasticity limitations of classic databases—are pushing practitioners to relax some important properties featured by transactions, which excludes several applications that are unable to fit in this strategy due to their intrinsic transactional nature. The purpose of this thesis is to address two important challenges still latent in distributed databases: (1) the scalability limitations of transactional databases and (2) providing transactional support on cloud-based storage repositories. Analyzing the traditional concurrency control and replication techniques, used by classic databases to support transactions, is critical to identify the reasons that make these systems degrade their throughput when the number of nodes and/or amount of data rockets. Besides, this analysis is devoted to justify the design rationale behind cloud repositories in which transactions have been generally neglected. Furthermore, enabling applications which are strongly dependent on transactions to take advantage of the cloud storage paradigm is crucial for their adaptation to current data demands and business models. This dissertation starts by proposing a custom protocol simulator for static distributed databases, which serves as a basis for revising and comparing the performance of existing concurrency control protocols and replication techniques. As this thesis is especially concerned with transactions, the effects on the database scalability of different transaction profiles under different conditions are studied. This analysis is followed by a review of existing cloud storage repositories—that claim to be highly dynamic, scalable, and available—, which leads to an evaluation of the parameters and features that these systems have sacrificed in order to meet current large-scale data storage demands. To further explore the possibilities of the cloud computing paradigm in a real-world scenario, a cloud-inspired approach to store data from Smart Grids is presented. More specifically, the proposed architecture combines classic database replication techniques and epidemic updates propagation with the design principles of cloud-based storage. The key insights collected when prototyping the replication and concurrency control protocols at the database simulator, together with the experiences derived from building a large-scale storage repository for Smart Grids, are wrapped up into what we have coined as Epidemia: a storage infrastructure conceived to provide transactional support on the cloud. In addition to inheriting the benefits of highly-scalable cloud repositories, Epidemia includes a transaction management layer that forwards client transactions to a hierarchical set of data partitions, which allows the system to offer different consistency levels and elastically adapt its configuration to incoming workloads. Finally, experimental results highlight the feasibility of our contribution and encourage practitioners to further research in this area
    corecore