111 research outputs found
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
Tachyon is a distributed file system enabling reliable data sharing at memory speed across cluster computing frameworks. While caching today improves read workloads, writes are either network or disk bound, as replication is used for fault-tolerance. Tachyon eliminates this bottleneck by pushing lineage, a well-known technique, into the storage layer. The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers. Our evaluation shows that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x. Tachyon is open source and is deployed at multiple companies.National Science Foundation (U.S.) (CISE Expeditions Award CCF-1139158)Lawrence Berkeley National Laboratory (Award 7076018)United States. Defense Advanced Research Projects Agency (XData Award FA8750-12-2-0331
Towards Comparative Analysis of Resumption Techniques in ETL
Data warehouses are loaded with data from sources such as operational data bases. Failure of loading process or failure of any of the process such as extraction or transformation is expensive because of the non-availability of data for analysis. With the advent of e-commerce and many real time application analysis of data in real time becomes a norm and hence any misses while the data is being loaded into data warehouse needs to be handled in an efficient and optimized way. The techniques to handle failure of process to populate the data are very much important as the actual loading process. Alternative arrangement needs to be made for in case of failure so that processes of populating the data warehouse are done in time. This paper explores the various ways through which a failed process of populating the data warehouse could be resumed. Various resumption techniques are compared and a novel block based technique is proposed to improve one of the existing resumption techniques
Extract, Transform, and Load data from Legacy Systems to Azure Cloud
Internship report presented as partial requirement for obtaining the Master’s degree in Information
Management, with a specialization in Knowledge Management and Business IntelligenceIn a world with continuously evolving technologies and hardened competitive markets, organisations need to continually be on guard to grasp cutting edge technology and tools that will help them to surpass any competition that arises. Modern data platforms that incorporate cloud technologies, support organisations to strive and get ahead of their competitors by providing solutions that help them capture and optimally use untapped data, and scalable storages to adapt to ever-growing data quantities. Also, adopt data processing and visualisation tools that help to improve the decision-making process. With many cloud providers available in the market, from small players to major technology corporations, this offers much flexibility to organisations to choose the best cloud technology that will align with their use cases and overall products and services strategy. This internship came up at the time when one of Accenture’s significant client in the financial industry decided to migrate from legacy systems to a cloud-based data infrastructure that is Microsoft Azure cloud. During this internship, development of the data lake, which is a core part of the MDP, was done to understand better the type of challenges that can be faced when migrating data from on-premise legacy systems to a cloud-based infrastructure. Also, provided in this work, are the main recommendations and guidelines when it comes to performing a large scale data migration
Adaptive Big Data Pipeline
Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C
Upgrading decision support systems with Cloud-based environments and machine learning
Business Intelligence (BI) is a process for analyzing raw data and displaying it in order to make it easier for business users to take the right decision at the right time. Inthe market we can find several BI platforms. One commonly used BI solution is calledMicroStrategy, which allows users to build and display reports.Machine Learning (ML) is a process of using algorithms to search for patterns in data which are used to predict and/or classify other data.In recent years, these two fields have been integrated into one another in order to try to complement the prediction side of BI to enable higher quality results for the client.The consulting company (CC) where I have worked on has several solutions related to Data & Analytics built on top of Micro Strategy. Those solutions were all demonstrable in a server installed on-premises. This server was also utilized to build proofs of concept(PoC) to be used as demos for other potential clients. CC also develops new PoCs for clients from the ground up, with the objective of show casing what is possible to display to the client in order to optimize business management.CC was using a local, out of date server to demo the PoCs to clients, which suffered from stability and reliability issues. To address these issues, the server has been migrated and set up in a cloud based solution using a Microsoft Azure-based Virtual Machine,where it now performs similar functions compared to its previous iteration. This move has made the server more reliable, as well as made developing new solutions easier forthe team and enabled a new kind of service (Analytics as a Service).My work at CC was focused on one main task: Migration of the demo server for CCsolutions (which included PoCs for testing purposes, one of which is a machine learning model to predict wind turbine failures). The migration was successful as previously stated and the prediction models, albeit with mostly negative results, demonstrated successfully the development of large PoCs.Business Intelligence (BI) é um processo para analizar dados não tratados e mostrá-los para ajudar gestores a fazer a decisão correcta no momento certo. No mercado, pode-se encontrar várias plataformas de BI. Uma solução de BI comum chama-se MicroStrategy,que permite com que os utilizadores construam e mostrem relatórios.Machine Learning (ML) é um processo de usar algoritmos para procurar padrões em dados que por sua vez são usados para prever e/ou classificar outros dados.Nos últimos anos, estes campos foram integrados um no outro para tentar complementar o lado predictivo de BI para possibilitar resultados de mais alta qualidade para o cliente.A empresa de consultoria (EC) onde trabalhei tem várias soluções relacionadas com Data e Analytics construídas com base no MicroStrategy. Essas soluções eram todas demonstráveis num servidor instalado no local. Este servidor também era usado para criar provas de conceito (PoC) para serem usadas como demos para outros potenciais clientes.A EC também desenvolve novas PoCs para clientes a partir do zero, com o objectivo de demonstrar ao cliente o que é possível mostrar para optimizar a gestão do negócio.A EC estava a utilizar um servidor local desactualizado para demonstrar os PoCs aos clientes, que tinha problemas de estabilidade e fiabilidade. Para resolver estes problemas,o servidor foi migrado e configurado numa solução baseada na cloud com o uso de uma Máquina Virtual baseada no Microsoft Azure, onde executa funções semelhantes à versão anterior. Esta migração tornou o servidor mais fiável, simplificou o processo de desenvolver novas soluções para a equipa e disponibilizou um novo tipo de serviço (Analytics as a Service).O meu trabalho na EC foi focado numa tarefas principal: Migração do servidor de demonstrações de soluções CC (que inclui PoCs para propósitos de testes, uma das quais é um modelo de aprendizagem de máquina para prever falhas em turbinas eólicas). A migração foi efectuada com sucesso (como mencionado previamente) e os modelos testados,apesar de terem maioritariamente resultados negativos, demonstraram com sucesso que é possível desenvolver PoCs de grande dimensão
New techniques to integrate blockchain in Internet of Things scenarios for massive data management
Mención Internacional en el título de doctorNowadays, regardless of the use case, most IoT data is processed using
workflows that are executed on different infrastructures (edge-fog-cloud),
which produces dataflows from the IoT through the edge to the fog/cloud.
In many cases, they also involve several actors (organizations and users),
which poses a challenge for organizations to establish verification of the
transactions performed by the participants in the dataflows built by the
workflow engines and pipeline frameworks. It is essential for organizations,
not only to verify that the execution of applications is performed in the
strict sequence previously established in a DAG by authenticated participants,
but also to verify that the incoming and outgoing IoT data of each
stage of a workflow/pipeline have not been altered by third parties or by the
users associated to the organizations participating in a workflow/pipeline.
Blockchain technology and its mechanism for recording immutable transactions
in a distributed and decentralized manner, characterize it as an
ideal technology to support the aforementioned challenges and challenges since it allows the verification of the records generated in a secure manner.
However, the integration of blockchain technology with workflows for IoT
data processing is not trivial considering that it is a challenge not to lose
the generalization of workflows and/or pipeline engines, which must be
modified to include the embedded blockchain module. The main objective
of this doctoral research was to create new techniques to use blockchain
in the Internet of Things (IoT). Thus, we defined the main goal of this thesis
is to develop new techniques to integrate blockchain in Internet of
Things scenarios for massive data management in edge-fog-cloud environments.
To fulfill this general objective, we have designed a content
delivery model for processing big IoT data in Edge-Fog-Cloud computing
by using micro/nanoservice composition, a continuous verification model
based on blockchain to register significant events from the continuous delivery
model, selecting techniques to integrate blockchain in quasi-real systems
that allow ensuring traceability and non-repudiation of data obtained
from devices and sensors. The evaluation proposed has been thoroughly
evaluated, showing its feasibility and good performance.Hoy en día, independientemente del caso de uso, la mayoría de los datos
de IoT se procesan utilizando flujos de trabajo que se ejecutan en diferentes
infraestructuras (edge-fog-cloud) desde IoT a través del edge hasta la
fog/cloud. En muchos casos, también involucran a varios actores (organizaciones
y usuarios), lo que plantea un desafío para las organizaciones a la
hora de verificar las transacciones realizadas por los participantes en los
flujos de datos. Es fundamental para las organizaciones, no solo para verificar
que la ejecución de aplicaciones se realiza en la secuencia previamente
establecida en un DAG y por participantes autenticados, sino también para
verificar que los datos IoT entrantes y salientes de cada etapa de un flujo
de trabajo no han sido alterados por terceros o por usuarios asociados a
las organizaciones que participan en el mismo. La tecnología Blockchain,
gracias a su mecanismo para registrar transacciones de manera distribuida
y descentralizada, es un tecnología ideal para soportar los retos y desafíos
antes mencionados ya que permite la verificación de los registros generados de manera segura. Sin embargo, la integración de la tecnología blockchain
con flujos de trabajo para IoT no es baladí considerando que es un desafío
proporcionar el rendimiento necesario sin perder la generalización de los
motores de flujos de trabajo, que deben ser modificados para incluir el
módulo blockchain integrado. El objetivo principal de esta investigación
doctoral es desarrollar nuevas técnicas para integrar blockchain en Internet
de las Cosas (IoT) para la gestión masiva de datos en un entorno
edge-fog-cloud. Para cumplir con este objetivo general, se ha diseñado
un modelo de flujos para procesar grandes datos de IoT en computación
Edge-Fog-Cloud mediante el uso de la composición de micro/nanoservicio,
un modelo de verificación continua basado en blockchain para registrar
eventos significativos de la modelo de entrega continua de datos, seleccionando
técnicas para integrar blockchain en sistemas cuasi-reales que
permiten asegurar la trazabilidad y el no repudio de datos obtenidos de
dispositivos y sensores, La evaluación propuesta ha sido minuciosamente
evaluada, mostrando su factibilidad y buen rendimiento.This work has been partially supported by the project "CABAHLA-CM: Convergencia
Big data-Hpc: de los sensores a las Aplicaciones" S2018/TCS-4423
from Madrid Regional Government.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Paolo Trunfio.- Secretario: David Exposito Singh.- Vocal: Rafael Mayo Garcí
Big Data Now, 2015 Edition
Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction.
Our list of 2015 topics include:
Data-driven cultures
Data science
Data pipelines
Big data architecture and infrastructure
The Internet of Things and real time
Applications of big data
Security, ethics, and governance
Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data
Koneoppimiskehys petrokemianteollisuuden sovelluksille
Machine learning has many potentially useful applications in process industry, for example in process monitoring and control. Continuously accumulating process data and the recent development in software and hardware that enable more advanced machine learning, are fulfilling the prerequisites of developing and deploying process automation integrated machine learning applications which improve existing functionalities or even implement artificial intelligence.
In this master's thesis, a framework is designed and implemented on a proof-of-concept level, to enable easy acquisition of process data to be used with modern machine learning libraries, and to also enable scalable online deployment of the trained models. The literature part of the thesis concentrates on studying the current state and approaches for digital advisory systems for process operators, as a potential application to be developed on the machine learning framework.
The literature study shows that the approaches for process operators' decision support tools have shifted from rule-based and knowledge-based methods to machine learning. However, no standard methods can be concluded, and most of the use cases are quite application-specific.
In the developed machine learning framework, both commercial software and open source components with permissive licenses are used. Data is acquired over OPC UA and then processed in Python, which is currently almost the de facto standard language in data analytics. Microservice architecture with containerization is used in the online deployment, and in a qualitative evaluation, it proved to be a versatile and functional solution.Koneoppimisella voidaan osoittaa olevan useita hyödyllisiä käyttökohteita prosessiteollisuudessa, esimerkiksi prosessinohjaukseen liittyvissä sovelluksissa. Jatkuvasti kerääntyvä prosessidata ja toisaalta koneoppimiseen soveltuvien ohjelmistojen sekä myös laitteistojen viimeaikainen kehitys johtavat tilanteeseen, jossa prosessiautomaatioon liitettyjen koneoppimissovellusten avulla on mahdollista parantaa nykyisiä toiminnallisuuksia tai jopa toteuttaa tekoälysovelluksia.
Tässä diplomityössä suunniteltiin ja toteutettiin prototyypin tasolla koneoppimiskehys, jonka avulla on helppo käyttää prosessidataa yhdessä nykyaikaisten koneoppimiskirjastojen kanssa. Kehys mahdollistaa myös koneopittujen mallien skaalautuvan käyttöönoton. Diplomityön kirjallisuusosa keskittyy prosessioperaattoreille tarkoitettujen digitaalisten avustajajärjestelmien nykytilaan ja toteutustapoihin, avustajajärjestelmän tai sen päätöstukijärjestelmän ollessa yksi mahdollinen koneoppimiskehyksen päälle rakennettava ohjelma.
Kirjallisuustutkimuksen mukaan prosessioperaattorin päätöstukijärjestelmien taustalla olevat menetelmät ovat yhä useammin koneoppimiseen perustuvia, aiempien sääntö- ja tietämyskantoihin perustuvien menetelmien sijasta. Selkeitä yhdenmukaisia lähestymistapoja ei kuitenkaan ole helposti pääteltävissä kirjallisuuden perusteella. Lisäksi useimmat tapausesimerkit ovat sovellettavissa vain kyseisissä erikoistapauksissa.
Kehitetyssä koneoppimiskehyksessä on käytetty sekä kaupallisia että avoimen lähdekoodin komponentteja. Prosessidata haetaan OPC UA -protokollan avulla, ja sitä on mahdollista käsitellä Python-kielellä, josta on muodostunut lähes de facto -standardi data-analytiikassa. Kehyksen käyttöönottokomponentit perustuvat mikropalveluarkkitehtuuriin ja konttiteknologiaan, jotka osoittautuivat laadullisessa testauksessa monipuoliseksi ja toimivaksi toteutustavaksi
- …