33 research outputs found

    How to feed the squerall with RDF and other data nuts?

    Get PDF
    Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting different data formats. Traditionally, heterogeneous data was transformed off-line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic, modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the-fly without prior data transformation. Squerall is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall’s extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources

    Semantic Data Management in Data Lakes

    Full text link
    In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose the linkage of metadata to knowledge graphs based on the Linked Data principles to provide more meaning and semantics to the data in the lake. Such a semantic layer may be utilized not only for data management but also to tackle the problem of data integration from heterogeneous sources, in order to make data access more expressive and interoperable. In this survey, we review recent approaches with a specific focus on the application within data lake systems and scalability to Big Data. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access. In each category, we cover the main techniques and their background, and compare latest research. Finally, we point out challenges for future work in this research area, which needs a closer integration of Big Data and Semantic Web technologies

    Deep Lake: a Lakehouse for Deep Learning

    Full text link
    Traditional data lakes provide critical data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets on cloud storage. They allow organizations to break down data silos, unlock data-driven decision-making, improve operational efficiency, and reduce costs. However, as deep learning takes over common analytical workflows, traditional data lakes become less useful for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets. This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data, such as images, videos, annotations, as well as tabular data, in the form of tensors and rapidly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engine, or (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed from PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools

    A performance comparison of data lake table formats in cloud object storages

    Get PDF
    The increasing informatization of processes involved in our daily lives has generated a significant increase on the number of software developed to meet these needs. Consider ing this, the volume of data generated by applications is increasing, which is generating a bigger interest in the usage of it for analytical purposes, with the objective of getting in sights and extracting valuable information from it. This increase, however, has generated new challenges related to the storage, organization and processing of the data. In general, the interest is in obtaining relevant information quickly, consistently and at the lowest possible cost. In this context, new approaches have emerged to facilitate the organization and access to data at a massive scale. An already widespread concept is to have a central repository, known as Data Lake, in which data from different sources, having variable characteristics, are massively stored, so that they can be explored and processed in order to obtain new relevant information. These environments have been implemented in object storages lately, especially in the Cloud, given the rise of this model in recent years. F Frequently, the data stored in these environments is often structured as tables, and stored in files, which can be text files, such as CSVs, or binary files, such as Parquet, Avro or ORC, that implement specific properties, like data compression, for example. The modeling of these tables resembles, in some aspects, the structures of Data Warehouses, which are frequently implemented using Database Management Systems (DBMSs), and are used to make data available in a structured way, for analytical purposes. Given these characteristics, new specifications of table formats have emerged, applied as layers above these files, which aim to implement the support to usual operations and properties of DBMSs, using object storages as a storage layer. In practice, it is intended to guarantee the ACID properties during these operations, as well as the ability to per form operations that involve mutability, like updates, upserts or merges, in a simpler way. Thus, this work aims to evaluate and compare the performance of some of these formats, defined as Data Lake Table Formats: Delta Lake, Apache Hudi and Apache Iceberg, in order to identify how each format behaves when performing usual operations in these en vironments, like: inserting data, updating data and querying data.A crescente informatização dos processos envolvidos em nosso cotidiano gerou um aumento significativo no número de softwares desenvolvidos para atender a essas necessidades. Diante disso, o volume de dados gerados tem crescido, o que está gerando um maior interesse na utilização dos mesmos para fins analíticos, com o objetivo de obter insights e extrair informações valiosas deles. Esse aumento, no entanto, gerou novos desafios relacionados ao armazenamento, organização e processamento dos dados. Em geral, o in teresse está em obter informações relevantes de forma rápida, consistente e com o menor custo possível. Nesse contexto, surgiram novas abordagens para facilitar a organização e o acesso a esses dados em grande escala. Um conceito já difundido envolve ter um repositório central, conhecido como Data Lake, no qual dados de diferentes fontes, com características variáveis, são armazenados massivamente, para que possam ser explorados e processados de forma a obter novas informações relevantes a partir deles. Esses ambientes vêm sendo implementados em Object Storages ultimamente, em especial, na Nuvem, dada a ascensão desse modelo nos últimos anos. Frequentemente, os dados armazenados nesses ambientes costumam ser estruturados como tabelas e armazenados em arquivos, que podem ser arquivos de texto, como CSVs, ou arquivos binários, como Apache Par quet, Apache Avro ou Apache ORC, que implementam alguma propriedade específica, como compressão de dados, por exemplo. A modelagem dessas tabelas se assemelha, em alguns aspectos, às estruturas de Data Warehouses, que são frequentemente implemen tados usando Sistemas de Gerenciamento de Banco de Dados (SGBDs), sendo utiizados para disponibilizar dados de forma estruturada, para fins analíticos. Diante dessas características, novas especificações de formatos de tabelas, têm surgido, visando aplicar camadas acima desses arquivos, a fim de implementar o suporte às ope rações e propriedades comuns em SGBDs, utilizando Object Storages como camada de armazenamento. Na prática, pretende-se garantir as propriedades ACID durante essas operações, bem como a capacidade de realizar operações que envolvam mutabilidade, como atualizações, upserts ou merges, de forma mais simples. Assim, este trabalho tem como objetivo avaliar e comparar o desempenho de alguns desses formatos, definidos como "formatos de tabela de Data Lakes": Delta Lake, Apache Hudi e Apache Iceberg, a fim de identificar como cada formato se comporta ao realizar operações usuais nesses ambientes, como: inserção de dados, atualização de dados e consulta aos dados

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

    LST-Bench: Benchmarking Log-Structured Tables in the Cloud

    Full text link
    Log-Structured Tables (LSTs), also commonly referred to as table formats, have recently emerged to bring consistency and isolation to object stores. With the separation of compute and storage, object stores have become the go-to for highly scalable and durable storage. However, this comes with its own set of challenges, such as the lack of recovery and concurrency management that traditional database management systems provide. This is where LSTs such as Delta Lake, Apache Iceberg, and Apache Hudi come into play, providing an automatic metadata layer that manages tables defined over object stores, effectively addressing these challenges. A paradigm shift in the design of these systems necessitates the updating of evaluation methodologies. In this paper, we examine the characteristics of LSTs and propose extensions to existing benchmarks, including workload patterns and metrics, to accurately capture their performance. We introduce our framework, LST-Bench, which enables users to execute benchmarks tailored for the evaluation of LSTs. Our evaluation demonstrates how these benchmarks can be utilized to evaluate the performance, efficiency, and stability of LSTs. The code for LST-Bench is open sourced and is available at https://github.com/microsoft/lst-bench/

    Extract, Transform, and Load data from Legacy Systems to Azure Cloud

    Get PDF
    Internship report presented as partial requirement for obtaining the Master’s degree in Information Management, with a specialization in Knowledge Management and Business IntelligenceIn a world with continuously evolving technologies and hardened competitive markets, organisations need to continually be on guard to grasp cutting edge technology and tools that will help them to surpass any competition that arises. Modern data platforms that incorporate cloud technologies, support organisations to strive and get ahead of their competitors by providing solutions that help them capture and optimally use untapped data, and scalable storages to adapt to ever-growing data quantities. Also, adopt data processing and visualisation tools that help to improve the decision-making process. With many cloud providers available in the market, from small players to major technology corporations, this offers much flexibility to organisations to choose the best cloud technology that will align with their use cases and overall products and services strategy. This internship came up at the time when one of Accenture’s significant client in the financial industry decided to migrate from legacy systems to a cloud-based data infrastructure that is Microsoft Azure cloud. During this internship, development of the data lake, which is a core part of the MDP, was done to understand better the type of challenges that can be faced when migrating data from on-premise legacy systems to a cloud-based infrastructure. Also, provided in this work, are the main recommendations and guidelines when it comes to performing a large scale data migration

    Proposal of an approach for the design and implementation of a data mesh

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoAtualmente existe uma tendência, cada vez mais acentuada, para a utilização de software por parte da esmagadora maioria da população (aplicações de caráter social, software de gestão, plataformas e-commerce, entre outros), identificando-se a criação e armazenamento de dados que, devido às suas características (volume, variedade e velocidade), fazem emergir o conceito de Big Data. Nesta área, e para suportar o armazenamento dos dados, Big Data Warehouses e Data Lakes são conceitos cimentados e implementados por várias organizações, de forma a servirem a sua necessidade de tomada de decisão. No entanto, apesar de serem conceitos estabelecidos e aceites pela maioria da comunidade científica e por diversas organizações a nível mundial, tal não elimina a necessidade de melhoria e inovação. É, este contexto, que origina o surgimento do conceito de Data Mesh, propondo arquiteturas de dados decentralizadas. Após a análise das limitações demonstrados pelas arquiteturas monolíticas (e.g., dificuldade em mudar as tecnologias de armazenamento usadas para implementar o sistema de dados), é possível concluir sobre a necessidade de uma mudança de paradigma que tornará as organizações verdadeiramente orientadas aos dados. A Data Mesh consiste, na implementação de uma arquitetura onde os dados se encontram intencionalmente distribuídos por vários nós da Data Mesh e onde não existe caos, uma vez que existem estratégias centralizadas de governança de dados e a garantia de que os princípios fundamentais dos domínios são partilhados por toda a arquitetura. A presente dissertação propõe uma abordagem para a implementação de uma Data Mesh, procurando definir o modelo de domínios do conceito. Após esta definição é proposta de uma arquitetura concetual e tecnológica, que visam a auxiliar a materialização dos conceitos apresentados no modelo de domínios e assim auxiliar na conceção e implementação de uma Data Mesh. Posteriormente é realizada uma prova de conceito, de forma a validar os supracitados modelos, contribuindo com conhecimento técnico e científico relacionado com este conceito emergente.Currently there is an increasingly accentuated trend towards the use of software by most of the population (social applications, management software, e-commerce platforms, among others), identifying the creation and storage of data that, due to its characteristics (volume, variety, and speed), make the concept of Big Data emerge. In this area, and to support data storage, Big Data Warehouses and Data Lakes are solid concept and implemented by various organizations to serve their decision-making needs. However, despite being established and accepted concepts by most of the scientific community and by several organizations worldwide, this does not eliminate the need for improvement and innovation in the field. It is this context that gives rise to the emergence of the Data Mesh concept, proposing decentralized data architectures. After analyzing the limitations demonstrated by monolithic architectures (e.g., difficulty in changing the storage technologies used to implement the data system), it is possible to conclude on the need for a paradigm shift that will make organizations truly data driven. Data Mesh consists, in the implementation of an architecture where data is intentionally distributed over several nodes of the Data Mesh, and where there is no chaos, since there are centralized data governance strategies and the assurance that the fundamental principles of the domains are shared throughout the architecture. This master thesis proposes an approach for the implementation of a Data Mesh, seeking to define the domain model of the concept. After this definition, a conceptual and technological architecture is proposed, which aim to help materialize the concepts presented in the domain model and thus assist in the design and implementation of a Data Mesh. Afterwards a proof-of-concept is carried out, to validate the aforementioned models, contributing with technical and scientific knowledge related to this emerging concept

    Design of a reference architecture for an IoT sensor network

    Get PDF
    corecore