42 research outputs found

    Enabling data-driven decision-making for a Finnish SME: a data lake solution

    Get PDF
    In the era of big data, data-driven decision-making has become a key success factor for companies of all sizes. Technological development has made it possible to store, process and analyse vast amounts of data effectively. The availability of cloud computing services has lowered the costs of data analysis. Even small businesses have access to advanced technical solutions, such as data lakes and machine learning applications. Data-driven decision-making requires integrating relevant data from various sources. Data has to be extracted from distributed internal and external systems and stored into a centralised system that enables processing and analysing it for meaningful insights. Data can be structured, semi-structured or unstructured. Data lakes have emerged as a solution for storing vast amounts of data, including a growing amount of unstructured data, in a cost-effective manner. The rise of the SaaS model has led to companies abandoning on-premise software. This blurs the line between internal and external data as the company’s own data is actually maintained by a third-party. Most enterprise software targeted for small businesses are provided through the SaaS model. Small businesses are facing the challenge of adopting data-driven decision-making, while having limited visibility to their own data. In this thesis, we study how small businesses can take advantage of data-driven decision-making by leveraging cloud computing services. We found that the report- ing features of SaaS based business applications used by our case company, a sales oriented SME, were insufficient for detailed analysis. Data-driven decision-making required aggregating data from multiple systems, causing excessive manual labour. A cloud based data lake solution was found to be a cost-effective solution for creating a centralised repository and automated data integration. It enabled management to visualise customer and sales data and to assess the effectiveness of marketing efforts. Better skills at data analysis among the managers of the case company would have been detrimental to obtaining the full benefits of the solution

    Cooperative caching for object storage

    Full text link
    Data is increasingly stored in data lakes, vast immutable object stores that can be accessed from anywhere in the data center. By providing low cost and scalable storage, today immutable object-storage based data lakes are used by a wide range of applications with diverse access patterns. Unfortunately, performance can suffer for applications that do not match the access patterns for which the data lake was designed. Moreover, in many of today's (non-hyperscale) data centers, limited bisectional bandwidth will limit data lake performance. Today many computer clusters integrate caches both to address the mismatch between application performance requirements and the capabilities of the shared data lake, and to reduce the demand on the data center network. However, per-cluster caching; i) means the expensive cache resources cannot be shifted between clusters based on demand, ii) makes sharing expensive because data accessed by multiple clusters is independently cached by each of them, and iii) makes it difficult for clusters to grow and shrink if their servers are being used to cache storage. In this dissertation, we present two novel data-center wide cooperative cache architectures, Datacenter-Data-Delivery Network (D3N) and Directory-Based Datacenter-Data-Delivery Network (D4N) that are designed to be part of the data lake itself rather than part of the computer clusters that use it. D3N and D4N distribute caches across the data center to enable data sharing and elasticity of cache resources where requests are transparently directed to nearby cache nodes. They dynamically adapt to changes in access patterns and accelerate workloads while providing the same consistency, trust, availability, and resilience guarantees as the underlying data lake. We nd that exploiting the immutability of object stores significantly reduces the complexity and provides opportunities for cache management strategies that were not feasible for previous cooperative cache systems for le or block-based storage. D3N is a multi-layer cooperative cache that targets workloads with large read-only datasets like big data analytics. It is designed to be easily integrated into existing data lakes with only limited support for write caching of intermediate data, and avoiding any global state by, for example, using consistent hashing for locating blocks and making all caching decisions based purely on local information. Our prototype is performant enough to fully exploit the (5 GB/s read) SSDs and (40, Gbit/s) NICs in our system and improve the runtime of realistic workloads by up to 3x. The simplicity of D3N has enabled us, in collaboration with industry partners, to upstream the two-layer version of D3N into the existing code base of the Ceph object store as a new experimental feature, making it available to the many data lakes around the world based on Ceph. D4N is a directory-based cooperative cache that provides a reliable write tier and a distributed directory that maintains a global state. It explores the use of global state to implement more sophisticated cache management policies and enables application-specific tuning of caching policies to support a wider range of applications than D3N. In contrast to previous cache systems that implement their own mechanism for maintaining dirty data redundantly, D4N re-uses the existing data lake (Ceph) software for implementing a write tier and exploits the semantics of immutable objects to move aged objects to the shared data lake. This design greatly reduces the barrier to adoption and enables D4N to take advantage of sophisticated data lake features such as erasure coding. We demonstrate that D4N is performant enough to saturate the bandwidth of the SSDs, and it automatically adapts replication to the working set of the demands and outperforms the state of art cluster cache Alluxio. While it will be substantially more complicated to integrate the D4N prototype into production quality code that can be adopted by the community, these results are compelling enough that our partners are starting that effort. D3N and D4N demonstrate that cooperative caching techniques, originally designed for file systems, can be employed to integrate caching into today’s immutable object-based data lakes. We find that the properties of immutable object storage greatly simplify the adoption of these techniques, and enable integration of caching in a fashion that enables re-use of existing battle tested software; greatly reducing the barrier of adoption. In integrating the caching in the data lake, and not the compute cluster, this research opens the door to efficient data center wide sharing of data and resources

    Client analysis

    Get PDF
    Trabalho de Projeto de Mestrado, Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasData visualization existed long before all modern data visualization tools were invented, initially used on maps that feature markers of land, cities, roads, and features. As time went on, the need for more accurate physical mappings and measurements grew, data visualization also had to grow accordingly, continuing to enhance the latest existing data visualization solution. However, creating a visualization is not as simple as throwing the "information" element of an infographic on top of a graphic to make it look nicer. A careful balancing act between form and function is necessary for effective data display. The most basic graphic may be too uninteresting to look at or may send a strong message; the most impressive representation may completely fail to convey the proper idea, or it may say too much. Information and images must complement each other, and combining strong analysis with good storytelling is an art. To summarize, the purpose of data visualization tools is to make sense of data, and how well that data is integrated into the visuals depends drastically on the tool used and the person using the tool. Every company must incorporate at least one data visualization tool to make significant gains in its crucial areas of operations. Client analysis in correlation with behavioral analysis is one of the key aspects in companies. Analyzing clients may be one of the most important business topics inside every company due to this being a critical part of their business model and their marketing plan. Giving a deeper insight into the trends and patterns that occur in one of their key products, helps the company adapt and take into consideration their clients’ needs in more detail and further improve their IP services. In particular, the company, on which this work is based, considers their clients also as partners, they need to be able to have some access to a huge diversity of data regarding the trademarks that they have and are being taken care of. Examples would be: knowing the applied fees, the payment and instruction dates, the best country law modeling available in the market, trends, etc. This thesis provides insight on how the work carried out during the project brings updates to the department, improving existing tools such as the Pentaho portal and the Web2Py portal. In addition, a merged implementation of the portals facilitates the maintenance of users of both portals, facilitating the life of the department through the automation of some procedures. Finally, the study of data visualization tools was done so that some of the main implementations were presented and discussed among the relevant people in the department. This thesis focuses on the trademark department and allowed me to acquire a lot of knowledge about how to work as a team, individually, manage my time, set priorities, direct relevant people and be critical in matters related to the project.A visualização de dados já existia muito antes de todas as ferramentas modernas de visualização de dados serem inventadas, inicialmente usadas em mapas que apresentam marcadores de terras, cidades, estradas e recursos. Com o passar do tempo, a necessidade de mapeamentos e medições físicas mais precisas cresceu, a visualização de dados também teve que crescer de acordo, continuando a aprimorar a solução de visualização de dados mais recente existente. Agora no mercado, há uma variedade de ferramentas de visualização de dados de código aberto, parcialmente aberto e pago com uma variedade de recursos disponíveis para extrair e exibir todos os tipos de dados fechado. Além disso, todas as recentes e poderosas ferramentas de visualização de dados disponíveis, permitem a ligação a diferentes fontes de dados, como bancos de dados SQL, Excel, Json, CSV, entre muitos outros. As suas visualizações poderosas e reativas são uma grande atualização em termos de análise em muitas grandes empresas. Aliás, o uso de ferramentas de visualização de dados mais poderosas e recentes beneficia os analistas para identificar padrões, representando dados complicados de uma forma muito legível, e torna os usuários mais cientes em certas partes dos dados que requerem atenção para fazer seus negócios avançarem tomando melhores decisões de negócios. Encontrar erros também é uma vantagem, pois permite que os usuários encontrem visualmente os erros de dados, identificando as áreas que indicam os sinais de alerta. A tomada de decisão por meio da análise de visualizações é melhor, pois humanos são capazes de processar visuais muito melhor do que formulários tabulares de relatórios de dados, o que também permite o crescimento da empresa. À medida que a "era do Big Data" se acelera, a visualização está se tornando uma ferramenta cada vez mais importante para compreender os bilhões de linhas de dados criados no dia à dia. Como humanos estamos programados para nos envolvermos com histórias. Na forma mais básica, os dados são tudo, menos uma história - na verdade, dados na sua forma mais básica são exatamente o oposto. A visualização de dados ajuda a contar histórias, transformando os dados em um formato mais compreensível mostrando padrões. Uma boa visualização conta uma narrativa reduzindo o ruído dos dados e enfatizando os fatos mais importantes. A criação de visuais é fácil e muito interativa na maioria dos casos, não necessitando de nenhuma ou quase nenhuma experiência de programação. A visualização de dados amplifica o efeito de suas mensagens para seu público-alvo e oferece resultados de análise de dados da maneira mais atraente possível. No entanto, não é tão simples quanto jogar o elemento "informação" de um infográfico no topo de um gráfico para torná-lo mais agradável. Um ato de equilíbrio cuidadoso entre a forma e a função é necessário para a exibição de dados eficaz. O gráfico mais básico pode ser muito desinteressante para ser visto ou pode enviar uma mensagem forte; a representação mais impressionante pode falhar completamente em transmitir a ideia apropriada, ou pode dizer muito. As informações e as imagens devem se complementar, e combinar análise marcante com boa narrativa é uma arte. Para resumir, o objetivo das ferramentas de visualização de dados é dar sentido aos dados, e quão bem esses dados são integrados aos visuais depende drasticamente da ferramenta usada e, da pessoa, que usa a ferramenta. Cada empresa deve incorporar pelo menos uma ferramenta de visualização de dados para obter ganhos significativos em suas áreas cruciais de operações. A análise do cliente em correlação com a análise comportamental é um dos aspetos-chave nas empresas. Analisar clientes é um dos tópicos de negócios mais importantes dentro de qualquer empresa, por ser uma parte crítica de seu modelo de negócios e plano de marketing. Dar uma visão mais profunda das tendências e padrões que ocorrem em um de seus principais produtos ajuda a empresa a se adaptar e levar em consideração as necessidades de seus clientes com mais detalhes e melhorar ainda mais seus serviços de IP. Em particular, a empresa na qual se baseia este trabalho considera os seus clientes também como parceiros, estes precisam de poder ter algum acesso a uma grande diversidade de dados sobre as marcas que possuem e estão a ser geridas. Os exemplos seriam; conhecer as taxas aplicadas, estudar instruções tardias, as datas de pagamento e instrução, a melhor modelagem de legislação do país disponível no mercado, tendências, etc. A unidade de inteligência do cliente (CIU) começou recentemente a analisar extensivamente os dados disponíveis para oferecer melhores produtos aos seus clientes. No entanto, embora tenham realizado um trabalho extensivo em algumas áreas (especialmente nos seus principais produtos mais vendidos), não podem negligenciar os seus outros serviços. Em particular, depois de examinar que áreas poderiam ser melhoradas, chegámos à conclusão de nos concentrarmos no departamento de Renovação de Marcas produzindo meios analíticos para os utilizadores a fim de me familiarizar com as tecnologias utilizadas. Conhecer o comportamento e os desejos dos clientes é crucial para a qualidade do negócio, não só para lhes prestar um melhor serviço, mas também para poder adaptar-se a um cenário global em constante mudança que se tornou ainda mais desafiante com a pandemia de 2020. Outra questão importante que foi detetada foi que, devido à complexidade e flexibilidade da unidade, o departamento estava a apoiar dois portais para fins analíticos. Portanto, os utilizadores deviam ligar-se a dois portais a fim de aceder a todas as funcionalidades, em vez de terem todas as funcionalidades/módulos analíticos num único portal. Como seria de esperar, isto vem com a questão comum de ter uma forma não harmonizada no tratamento da segurança e na criação de relatórios. Além disso, os desenhos, estrutura de codificação, módulos e controlo de versões diferem entre os dois portais, não há consistência no desenho e em como utilizar os módulos para a parte analítica. Ambos os portais, usando tecnologias diferentes, não tornam este problema mais fácil de resolver. Encontrar uma maneira de fusionar os dois portais e facilitar a criação de funções, criação de contas e ter uma interface onde só é necessário de fazer o login uma vez para poder explorar todas as ferramentas analíticas, é o objetivo da terceira parte do projeto. Como ponto final, no que se refere à visualização analítica, a unidade está na realidade a codificá-la usando bibliotecas Python ou utilizando o leque limitado de possibilidades, que o Pentaho oferece. Fornecer aos utilizadores uma análise sobre os benefícios e como integrar um software de visualização de dados mais complexo e sofisticado ajudá-los-ia sem dúvida a planear um futuro com uma gama mais ampla de ferramentas para melhorar a eficiência dos relatórios e das visualizações. Consequentemente, três ferramentas de visualização de dados poderosas e muito usadas foram testadas e comparadas, Power BI, Alteryx e Qlik. Essas ferramentas foram avaliadas definindo primeiro uma lista de funcionalidades essenciais e, em seguida, pesquisando e testando parcialmente se as ferramentas oferecem suporte a essas funcionalidades. Além disso, foram feitas várias demonstrações das aplicabilidades requeridas das ferramentas à equipa da CIU, apoiando se mais no Power BI, sendo essa a ferramenta para a qual já está a ser paga uma licença PRO para cada utilizador. Resumindo, esta tese fornece uma visão sobre como criar módulos analíticos em Web2Py e Pentaho para analisar clientes e o seus comportamentos, sendo elas duas tecnologias open source. A parte analítica da tese gira em torno dos clientes de marcas, uma vez que é a parte menos desenvolvida analiticamente da empresa. Além de uma descrição clara dos módulos analíticos criados, ela também contém uma visão clara sobre como 2 portais existentes usando tecnologias diferentes são unidos em um, unindo as suas ‘interfaces’ para fazer parecer ao utilizador que está só a navegar num único portal, na pesquisa dos seus módulos analíticos. Um usa a tecnologia Web2Py programado em Python e baseado no MVC e o outro sendo o Pentaho programado em Java, que usa JS para os seus módulos analíticos e está ligado à ‘software’ de relatórios B.I.R.T. Por fim, a tese termina com um estudo sobre diferentes ferramentas de visualização poderosas para substituir módulos e gráficos existentes e sobre como eles poderão ser integrados na interface fundida. O trabalho realizado durante o projeto traz atualizações para o departamento, aprimorando as ferramentas existentes como o portal Pentaho e o portal Web2Py. Além disso, uma implementação fundida dos portais facilita a manutenção dos usuários de ambos os portais, facilitando a vida do departamento por meio da automação de alguns procedimentos. Por fim, o estudo sobre ferramentas de visualização de dados foi feito de forma que algumas das principais implementações foram apresentadas e discutidas entre as pessoas relevantes do departamento. Esta tese gira em torno do departamento de marcas e permitiu-me adquirir muitos conhecimentos sobre como trabalhar em equipa, individualmente, administrar meu tempo, definir prioridades, direcionar pessoas relevantes e ser franco nos assuntos relacionados ao projeto

    Metadata Management for Data Lake Governance

    Get PDF
    A l’ère du Big Data, les données sont caractérisées par le volume, la vitesse, la variété, la véracité et la valeur (5V). L’enjeu majeur du Big Data, au-delà du stockage, est d’extraire de la valeur de qualité à travers des analyses avancées sur des données volumineuses, véloces et variées. Depuis une décennie, le Lac de données (LD) est apparu comme une nouvelle solution répondant à cet enjeu de Big Data Analytics. En tant que concept relativement nouveau, le lac de données n’a pas de définition standard ni d’architecture reconnue. Les propositions de la littérature sont insuffisantes au regard de l’ampleur du contexte. Notre première contribution se résume en une définition complète ainsi qu’une architecture générique du lac de données qui contient une zone d’ingestion, une zone de préparation, une zone d’analyse et une zone de gouvernance de données. De plus, afin que les lacs de données ne soient ni invisibles ni inaccessibles par ses différents d’utilisateurs, une gouvernance est vitale. L’élément central d’une bonne gouvernance est un système de management de métadonnées. Dans la littérature, les approches de management des métadonnées sont parcellaires et pas nécessairement génériques pour les LD. La contribution majeure de cette thèse est une solution complète de management de métadonnées permettant aux utilisateurs de trouver, d’accéder, d’interopérer et de réutiliser facilement aussi bien des données que des processus ou des analyses effectuées par le LD. Dans un premier temps, nous avons proposé un modèle de métadonnées permettant de gérer tout le cycle de vie des données dans un LD comme suit : (i) métadonnées représentant différents types de données ingérées (structurées, semi structurées et non structurées) et différents modes d’ingestion (batch et en temps réel), (ii) métadonnées représentant différents processus de transformation des données (ETL, exploration statistiques et phase de préparation en science des données) au travers de la spécification d’opérations de haut niveau, (iii) métadonnées orientées analyse et notamment l’apprentissage automatique pour caractériser les analyses effectuées dans le LD et de pouvoir réutiliser et paramétrer rapidement les futures analyses. Dans un second temps, nous avons défini un système de gestion de métadonnées, nommé DAMMS. DAMMS permet (i) d’ingérer de manière semi-automatique des métadonnées et (ii) d’explorer le contenu du LD (données, processus de transformation ou analyses) de manière ergonomique afin de pouvoir les réutiliser ou les adapter. DAMMS présente ainsi l’avantage de répondre au besoin d’industrialisation de la science des données. Enfin, pour évaluer la faisabilité et l’utilisabilité de notre proposition, nous avons mené conjointement une étude de performance de l’ingestion des métadonnées et une étude analysant l’expérience utilisateur de DAMMS.In the era of Big Data, data is characterized by volume, velocity, variety, veracity and value (5V). The major challenge of Big Data, beyond storage, is to extract quality value through advanced analytics on voluminous, fast and varied data. Over the past decade, Data Lake (DL) has emerged as a new solution to address this Big Data Analytics challenge. As a relatively new concept, data lake has no standard definition or recognized architecture. The proposals in the literature are insufficient for the scope of the context. Our first contribution is a comprehensive definition and a generic architecture of data lake that contains an ingestion zone, a preparation zone, an analysis zone and a data governance zone. Furthermore, in order to ensure that a data lake is neither invisible nor inaccessible to its various users, the data lake governance is vital. The central element of good governance is a metadata management system. In the literature, approaches to metadata management are fragmented and not necessarily generic for DL. The major contribution of this thesis is a comprehensive metadata management solution that allows users to easily And, access, interoperate and reuse data as well as processes or analyses performed by the DL. As a first step, we proposed a model to manage the entire data life-cycle in a DL as follows: (i) metadata representing different types of ingested data (structured, semi-structured and unstructured) and different ingestion modes (batch and real-time), (ii) metadata representing different data transformation processes (ETL, statistical mining and the preparation phase of data science) through the specification of high-level operations, (iii) metadata that are oriented to analysis and in particular machine learning to characterize the analyses performed in the DL and to be able to reuse and quickly parameterize future analyses. In a second step, we defined a metadata management system, named DAMMS. DAMMS allows users (i) to automatically ingest metadata and (ii) to explore the content of the DL (data, transformation processes or analyses) in an ergonomic way in order to be able to reuse or adapt them. DAMMS thus has the advantage of responding to the need for data science industrialization. Finally, in order to evaluate the feasibility and usability of our proposal, we have jointly conducted a performance study of metadata ingestion and a study analyzing the user experience of DAMMS

    Metadata Management for Data Lake Governance

    Get PDF
    A l’ère du Big Data, les données sont caractérisées par le volume, la vitesse, la variété, la véracité et la valeur (5V). L’enjeu majeur du Big Data, au-delà du stockage, est d’extraire de la valeur de qualité à travers des analyses avancées sur des données volumineuses, véloces et variées. Depuis une décennie, le Lac de données (LD) est apparu comme une nouvelle solution répondant à cet enjeu de Big Data Analytics. En tant que concept relativement nouveau, le lac de données n’a pas de définition standard ni d’architecture reconnue. Les propositions de la littérature sont insuffisantes au regard de l’ampleur du contexte. Notre première contribution se résume en une définition complète ainsi qu’une architecture générique du lac de données qui contient une zone d’ingestion, une zone de préparation, une zone d’analyse et une zone de gouvernance de données. De plus, afin que les lacs de données ne soient ni invisibles ni inaccessibles par ses différents d’utilisateurs, une gouvernance est vitale. L’élément central d’une bonne gouvernance est un système de management de métadonnées. Dans la littérature, les approches de management des métadonnées sont parcellaires et pas nécessairement génériques pour les LD. La contribution majeure de cette thèse est une solution complète de management de métadonnées permettant aux utilisateurs de trouver, d’accéder, d’interopérer et de réutiliser facilement aussi bien des données que des processus ou des analyses effectuées par le LD. Dans un premier temps, nous avons proposé un modèle de métadonnées permettant de gérer tout le cycle de vie des données dans un LD comme suit : (i) métadonnées représentant différents types de données ingérées (structurées, semi structurées et non structurées) et différents modes d’ingestion (batch et en temps réel), (ii) métadonnées représentant différents processus de transformation des données (ETL, exploration statistiques et phase de préparation en science des données) au travers de la spécification d’opérations de haut niveau, (iii) métadonnées orientées analyse et notamment l’apprentissage automatique pour caractériser les analyses effectuées dans le LD et de pouvoir réutiliser et paramétrer rapidement les futures analyses. Dans un second temps, nous avons défini un système de gestion de métadonnées, nommé DAMMS. DAMMS permet (i) d’ingérer de manière semi-automatique des métadonnées et (ii) d’explorer le contenu du LD (données, processus de transformation ou analyses) de manière ergonomique afin de pouvoir les réutiliser ou les adapter. DAMMS présente ainsi l’avantage de répondre au besoin d’industrialisation de la science des données. Enfin, pour évaluer la faisabilité et l’utilisabilité de notre proposition, nous avons mené conjointement une étude de performance de l’ingestion des métadonnées et une étude analysant l’expérience utilisateur de DAMMS.In the era of Big Data, data is characterized by volume, velocity, variety, veracity and value (5V). The major challenge of Big Data, beyond storage, is to extract quality value through advanced analytics on voluminous, fast and varied data. Over the past decade, Data Lake (DL) has emerged as a new solution to address this Big Data Analytics challenge. As a relatively new concept, data lake has no standard definition or recognized architecture. The proposals in the literature are insufficient for the scope of the context. Our first contribution is a comprehensive definition and a generic architecture of data lake that contains an ingestion zone, a preparation zone, an analysis zone and a data governance zone. Furthermore, in order to ensure that a data lake is neither invisible nor inaccessible to its various users, the data lake governance is vital. The central element of good governance is a metadata management system. In the literature, approaches to metadata management are fragmented and not necessarily generic for DL. The major contribution of this thesis is a comprehensive metadata management solution that allows users to easily And, access, interoperate and reuse data as well as processes or analyses performed by the DL. As a first step, we proposed a model to manage the entire data life-cycle in a DL as follows: (i) metadata representing different types of ingested data (structured, semi-structured and unstructured) and different ingestion modes (batch and real-time), (ii) metadata representing different data transformation processes (ETL, statistical mining and the preparation phase of data science) through the specification of high-level operations, (iii) metadata that are oriented to analysis and in particular machine learning to characterize the analyses performed in the DL and to be able to reuse and quickly parameterize future analyses. In a second step, we defined a metadata management system, named DAMMS. DAMMS allows users (i) to automatically ingest metadata and (ii) to explore the content of the DL (data, transformation processes or analyses) in an ergonomic way in order to be able to reuse or adapt them. DAMMS thus has the advantage of responding to the need for data science industrialization. Finally, in order to evaluate the feasibility and usability of our proposal, we have jointly conducted a performance study of metadata ingestion and a study analyzing the user experience of DAMMS

    System architecture for sensor data stream processing using machine learning capabilities

    Get PDF
    Proces výběru vhodných nástrojů pro architekturu našeho IT systému je stále obtížnější. V moderních technologických trendech, jako je strojové učení a velká data, je snadné ztratit se v bezpočtu nástrojů použitených k našim účelům. Tato práce je příkladem toho, jak dlouhá může být analýza návrhu relativně jednoduchého, ale zároveň zatěžujících případů užití. Naším hlavním cílem je seznámit čtenáře s technologiemi běžně používanými pro budování špičkových systémů zpracování proudů dat. Tato práce vysvětluje obecné pojmy a trendy návrhu takové architektury a dále se specializuje na případy užití v reálném světě. Tyto případy užití zahrnují sběr, zpracování, ukládání a analýzu údajů ze senzorů. Po návrhu potenciální architektury a výběru vhodných nástrojů teoreticky vyhodnotíme a vybereme vhodné nástroje pro naše účely. Tyto nástroje jsou poté využity v návrzích různých konfiguracích systému. Z těchto možných konfigurací je jedna vybrána a dále vyhodnocena na reálných datech a scénáři.The process of selecting proper tools for our IT system architecture becomes increasingly difficult. With modern technology trends, such as Machine Learning and Big Data, it is easy to get lost in the myriad of tools capable of serving our purposes. This thesis exemplifies how thorough analysis for designing a relatively simple yet demanding use case can be. Our main goal is to familiarise the reader with technologies commonly used for building cutting-edge stream processing system. This thesis explains general concepts and trends of designing such architecture and is further specialised for real-world use cases. These use cases involve sensor data collection, processing, storage and analysis. After drafting potential architecture and selecting appropriate tools, we theoretically evaluate and select well-suited tools for our purposes. These tools are then suggested for use in various system configurations. Out of these possible configurations, one is selected and evaluated further on real-life data and scenario

    Data Spaces

    Get PDF
    This open access book aims to educate data space designers to understand what is required to create a successful data space. It explores cutting-edge theory, technologies, methodologies, and best practices for data spaces for both industrial and personal data and provides the reader with a basis for understanding the design, deployment, and future directions of data spaces. The book captures the early lessons and experience in creating data spaces. It arranges these contributions into three parts covering design, deployment, and future directions respectively. The first part explores the design space of data spaces. The single chapters detail the organisational design for data spaces, data platforms, data governance federated learning, personal data sharing, data marketplaces, and hybrid artificial intelligence for data spaces. The second part describes the use of data spaces within real-world deployments. Its chapters are co-authored with industry experts and include case studies of data spaces in sectors including industry 4.0, food safety, FinTech, health care, and energy. The third and final part details future directions for data spaces, including challenges and opportunities for common European data spaces and privacy-preserving techniques for trustworthy data sharing. The book is of interest to two primary audiences: first, researchers interested in data management and data sharing, and second, practitioners and industry experts engaged in data-driven systems where the sharing and exchange of data within an ecosystem are critical

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

    Data Spaces

    Get PDF
    This open access book aims to educate data space designers to understand what is required to create a successful data space. It explores cutting-edge theory, technologies, methodologies, and best practices for data spaces for both industrial and personal data and provides the reader with a basis for understanding the design, deployment, and future directions of data spaces. The book captures the early lessons and experience in creating data spaces. It arranges these contributions into three parts covering design, deployment, and future directions respectively. The first part explores the design space of data spaces. The single chapters detail the organisational design for data spaces, data platforms, data governance federated learning, personal data sharing, data marketplaces, and hybrid artificial intelligence for data spaces. The second part describes the use of data spaces within real-world deployments. Its chapters are co-authored with industry experts and include case studies of data spaces in sectors including industry 4.0, food safety, FinTech, health care, and energy. The third and final part details future directions for data spaces, including challenges and opportunities for common European data spaces and privacy-preserving techniques for trustworthy data sharing. The book is of interest to two primary audiences: first, researchers interested in data management and data sharing, and second, practitioners and industry experts engaged in data-driven systems where the sharing and exchange of data within an ecosystem are critical
    corecore