111 research outputs found

    Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

    Get PDF
    Tachyon is a distributed file system enabling reliable data sharing at memory speed across cluster computing frameworks. While caching today improves read workloads, writes are either network or disk bound, as replication is used for fault-tolerance. Tachyon eliminates this bottleneck by pushing lineage, a well-known technique, into the storage layer. The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers. Our evaluation shows that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x. Tachyon is open source and is deployed at multiple companies.National Science Foundation (U.S.) (CISE Expeditions Award CCF-1139158)Lawrence Berkeley National Laboratory (Award 7076018)United States. Defense Advanced Research Projects Agency (XData Award FA8750-12-2-0331

    Towards Comparative Analysis of Resumption Techniques in ETL

    Get PDF
    Data warehouses are loaded with data from sources such as operational data bases. Failure of loading process or failure of any of the process such as extraction or transformation is expensive because of the non-availability of data for analysis. With the advent of e-commerce and many real time application analysis of data in real time becomes a norm and hence any misses while the data is being loaded into data warehouse needs to be handled in an efficient and optimized way. The techniques to handle failure of process to populate the data are very much important as the actual loading process. Alternative arrangement needs to be made for in case of failure so that processes of populating the data warehouse are done in time. This paper explores the various ways through which a failed process of populating the data warehouse could be resumed. Various resumption techniques are compared and a novel block based technique is proposed to improve one of the existing resumption techniques

    Extract, Transform, and Load data from Legacy Systems to Azure Cloud

    Get PDF
    Internship report presented as partial requirement for obtaining the Master’s degree in Information Management, with a specialization in Knowledge Management and Business IntelligenceIn a world with continuously evolving technologies and hardened competitive markets, organisations need to continually be on guard to grasp cutting edge technology and tools that will help them to surpass any competition that arises. Modern data platforms that incorporate cloud technologies, support organisations to strive and get ahead of their competitors by providing solutions that help them capture and optimally use untapped data, and scalable storages to adapt to ever-growing data quantities. Also, adopt data processing and visualisation tools that help to improve the decision-making process. With many cloud providers available in the market, from small players to major technology corporations, this offers much flexibility to organisations to choose the best cloud technology that will align with their use cases and overall products and services strategy. This internship came up at the time when one of Accenture’s significant client in the financial industry decided to migrate from legacy systems to a cloud-based data infrastructure that is Microsoft Azure cloud. During this internship, development of the data lake, which is a core part of the MDP, was done to understand better the type of challenges that can be faced when migrating data from on-premise legacy systems to a cloud-based infrastructure. Also, provided in this work, are the main recommendations and guidelines when it comes to performing a large scale data migration

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

    Upgrading decision support systems with Cloud-based environments and machine learning

    Get PDF
    Business Intelligence (BI) is a process for analyzing raw data and displaying it in order to make it easier for business users to take the right decision at the right time. Inthe market we can find several BI platforms. One commonly used BI solution is calledMicroStrategy, which allows users to build and display reports.Machine Learning (ML) is a process of using algorithms to search for patterns in data which are used to predict and/or classify other data.In recent years, these two fields have been integrated into one another in order to try to complement the prediction side of BI to enable higher quality results for the client.The consulting company (CC) where I have worked on has several solutions related to Data & Analytics built on top of Micro Strategy. Those solutions were all demonstrable in a server installed on-premises. This server was also utilized to build proofs of concept(PoC) to be used as demos for other potential clients. CC also develops new PoCs for clients from the ground up, with the objective of show casing what is possible to display to the client in order to optimize business management.CC was using a local, out of date server to demo the PoCs to clients, which suffered from stability and reliability issues. To address these issues, the server has been migrated and set up in a cloud based solution using a Microsoft Azure-based Virtual Machine,where it now performs similar functions compared to its previous iteration. This move has made the server more reliable, as well as made developing new solutions easier forthe team and enabled a new kind of service (Analytics as a Service).My work at CC was focused on one main task: Migration of the demo server for CCsolutions (which included PoCs for testing purposes, one of which is a machine learning model to predict wind turbine failures). The migration was successful as previously stated and the prediction models, albeit with mostly negative results, demonstrated successfully the development of large PoCs.Business Intelligence (BI) é um processo para analizar dados não tratados e mostrá-los para ajudar gestores a fazer a decisão correcta no momento certo. No mercado, pode-se encontrar várias plataformas de BI. Uma solução de BI comum chama-se MicroStrategy,que permite com que os utilizadores construam e mostrem relatórios.Machine Learning (ML) é um processo de usar algoritmos para procurar padrões em dados que por sua vez são usados para prever e/ou classificar outros dados.Nos últimos anos, estes campos foram integrados um no outro para tentar complementar o lado predictivo de BI para possibilitar resultados de mais alta qualidade para o cliente.A empresa de consultoria (EC) onde trabalhei tem várias soluções relacionadas com Data e Analytics construídas com base no MicroStrategy. Essas soluções eram todas demonstráveis num servidor instalado no local. Este servidor também era usado para criar provas de conceito (PoC) para serem usadas como demos para outros potenciais clientes.A EC também desenvolve novas PoCs para clientes a partir do zero, com o objectivo de demonstrar ao cliente o que é possível mostrar para optimizar a gestão do negócio.A EC estava a utilizar um servidor local desactualizado para demonstrar os PoCs aos clientes, que tinha problemas de estabilidade e fiabilidade. Para resolver estes problemas,o servidor foi migrado e configurado numa solução baseada na cloud com o uso de uma Máquina Virtual baseada no Microsoft Azure, onde executa funções semelhantes à versão anterior. Esta migração tornou o servidor mais fiável, simplificou o processo de desenvolver novas soluções para a equipa e disponibilizou um novo tipo de serviço (Analytics as a Service).O meu trabalho na EC foi focado numa tarefas principal: Migração do servidor de demonstrações de soluções CC (que inclui PoCs para propósitos de testes, uma das quais é um modelo de aprendizagem de máquina para prever falhas em turbinas eólicas). A migração foi efectuada com sucesso (como mencionado previamente) e os modelos testados,apesar de terem maioritariamente resultados negativos, demonstraram com sucesso que é possível desenvolver PoCs de grande dimensão

    New techniques to integrate blockchain in Internet of Things scenarios for massive data management

    Get PDF
    Mención Internacional en el título de doctorNowadays, regardless of the use case, most IoT data is processed using workflows that are executed on different infrastructures (edge-fog-cloud), which produces dataflows from the IoT through the edge to the fog/cloud. In many cases, they also involve several actors (organizations and users), which poses a challenge for organizations to establish verification of the transactions performed by the participants in the dataflows built by the workflow engines and pipeline frameworks. It is essential for organizations, not only to verify that the execution of applications is performed in the strict sequence previously established in a DAG by authenticated participants, but also to verify that the incoming and outgoing IoT data of each stage of a workflow/pipeline have not been altered by third parties or by the users associated to the organizations participating in a workflow/pipeline. Blockchain technology and its mechanism for recording immutable transactions in a distributed and decentralized manner, characterize it as an ideal technology to support the aforementioned challenges and challenges since it allows the verification of the records generated in a secure manner. However, the integration of blockchain technology with workflows for IoT data processing is not trivial considering that it is a challenge not to lose the generalization of workflows and/or pipeline engines, which must be modified to include the embedded blockchain module. The main objective of this doctoral research was to create new techniques to use blockchain in the Internet of Things (IoT). Thus, we defined the main goal of this thesis is to develop new techniques to integrate blockchain in Internet of Things scenarios for massive data management in edge-fog-cloud environments. To fulfill this general objective, we have designed a content delivery model for processing big IoT data in Edge-Fog-Cloud computing by using micro/nanoservice composition, a continuous verification model based on blockchain to register significant events from the continuous delivery model, selecting techniques to integrate blockchain in quasi-real systems that allow ensuring traceability and non-repudiation of data obtained from devices and sensors. The evaluation proposed has been thoroughly evaluated, showing its feasibility and good performance.Hoy en día, independientemente del caso de uso, la mayoría de los datos de IoT se procesan utilizando flujos de trabajo que se ejecutan en diferentes infraestructuras (edge-fog-cloud) desde IoT a través del edge hasta la fog/cloud. En muchos casos, también involucran a varios actores (organizaciones y usuarios), lo que plantea un desafío para las organizaciones a la hora de verificar las transacciones realizadas por los participantes en los flujos de datos. Es fundamental para las organizaciones, no solo para verificar que la ejecución de aplicaciones se realiza en la secuencia previamente establecida en un DAG y por participantes autenticados, sino también para verificar que los datos IoT entrantes y salientes de cada etapa de un flujo de trabajo no han sido alterados por terceros o por usuarios asociados a las organizaciones que participan en el mismo. La tecnología Blockchain, gracias a su mecanismo para registrar transacciones de manera distribuida y descentralizada, es un tecnología ideal para soportar los retos y desafíos antes mencionados ya que permite la verificación de los registros generados de manera segura. Sin embargo, la integración de la tecnología blockchain con flujos de trabajo para IoT no es baladí considerando que es un desafío proporcionar el rendimiento necesario sin perder la generalización de los motores de flujos de trabajo, que deben ser modificados para incluir el módulo blockchain integrado. El objetivo principal de esta investigación doctoral es desarrollar nuevas técnicas para integrar blockchain en Internet de las Cosas (IoT) para la gestión masiva de datos en un entorno edge-fog-cloud. Para cumplir con este objetivo general, se ha diseñado un modelo de flujos para procesar grandes datos de IoT en computación Edge-Fog-Cloud mediante el uso de la composición de micro/nanoservicio, un modelo de verificación continua basado en blockchain para registrar eventos significativos de la modelo de entrega continua de datos, seleccionando técnicas para integrar blockchain en sistemas cuasi-reales que permiten asegurar la trazabilidad y el no repudio de datos obtenidos de dispositivos y sensores, La evaluación propuesta ha sido minuciosamente evaluada, mostrando su factibilidad y buen rendimiento.This work has been partially supported by the project "CABAHLA-CM: Convergencia Big data-Hpc: de los sensores a las Aplicaciones" S2018/TCS-4423 from Madrid Regional Government.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Paolo Trunfio.- Secretario: David Exposito Singh.- Vocal: Rafael Mayo Garcí

    Big Data Now, 2015 Edition

    Get PDF
    Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction. Our list of 2015 topics include: Data-driven cultures Data science Data pipelines Big data architecture and infrastructure The Internet of Things and real time Applications of big data Security, ethics, and governance Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data

    Koneoppimiskehys petrokemianteollisuuden sovelluksille

    Get PDF
    Machine learning has many potentially useful applications in process industry, for example in process monitoring and control. Continuously accumulating process data and the recent development in software and hardware that enable more advanced machine learning, are fulfilling the prerequisites of developing and deploying process automation integrated machine learning applications which improve existing functionalities or even implement artificial intelligence. In this master's thesis, a framework is designed and implemented on a proof-of-concept level, to enable easy acquisition of process data to be used with modern machine learning libraries, and to also enable scalable online deployment of the trained models. The literature part of the thesis concentrates on studying the current state and approaches for digital advisory systems for process operators, as a potential application to be developed on the machine learning framework. The literature study shows that the approaches for process operators' decision support tools have shifted from rule-based and knowledge-based methods to machine learning. However, no standard methods can be concluded, and most of the use cases are quite application-specific. In the developed machine learning framework, both commercial software and open source components with permissive licenses are used. Data is acquired over OPC UA and then processed in Python, which is currently almost the de facto standard language in data analytics. Microservice architecture with containerization is used in the online deployment, and in a qualitative evaluation, it proved to be a versatile and functional solution.Koneoppimisella voidaan osoittaa olevan useita hyödyllisiä käyttökohteita prosessiteollisuudessa, esimerkiksi prosessinohjaukseen liittyvissä sovelluksissa. Jatkuvasti kerääntyvä prosessidata ja toisaalta koneoppimiseen soveltuvien ohjelmistojen sekä myös laitteistojen viimeaikainen kehitys johtavat tilanteeseen, jossa prosessiautomaatioon liitettyjen koneoppimissovellusten avulla on mahdollista parantaa nykyisiä toiminnallisuuksia tai jopa toteuttaa tekoälysovelluksia. Tässä diplomityössä suunniteltiin ja toteutettiin prototyypin tasolla koneoppimiskehys, jonka avulla on helppo käyttää prosessidataa yhdessä nykyaikaisten koneoppimiskirjastojen kanssa. Kehys mahdollistaa myös koneopittujen mallien skaalautuvan käyttöönoton. Diplomityön kirjallisuusosa keskittyy prosessioperaattoreille tarkoitettujen digitaalisten avustajajärjestelmien nykytilaan ja toteutustapoihin, avustajajärjestelmän tai sen päätöstukijärjestelmän ollessa yksi mahdollinen koneoppimiskehyksen päälle rakennettava ohjelma. Kirjallisuustutkimuksen mukaan prosessioperaattorin päätöstukijärjestelmien taustalla olevat menetelmät ovat yhä useammin koneoppimiseen perustuvia, aiempien sääntö- ja tietämyskantoihin perustuvien menetelmien sijasta. Selkeitä yhdenmukaisia lähestymistapoja ei kuitenkaan ole helposti pääteltävissä kirjallisuuden perusteella. Lisäksi useimmat tapausesimerkit ovat sovellettavissa vain kyseisissä erikoistapauksissa. Kehitetyssä koneoppimiskehyksessä on käytetty sekä kaupallisia että avoimen lähdekoodin komponentteja. Prosessidata haetaan OPC UA -protokollan avulla, ja sitä on mahdollista käsitellä Python-kielellä, josta on muodostunut lähes de facto -standardi data-analytiikassa. Kehyksen käyttöönottokomponentit perustuvat mikropalveluarkkitehtuuriin ja konttiteknologiaan, jotka osoittautuivat laadullisessa testauksessa monipuoliseksi ja toimivaksi toteutustavaksi
    corecore