11 research outputs found

    Cold Storage Data Archives: More Than Just a Bunch of Tapes

    Full text link
    The abundance of available sensor and derived data from large scientific experiments, such as earth observation programs, radio astronomy sky surveys, and high-energy physics already exceeds the storage hardware globally fabricated per year. To that end, cold storage data archives are the---often overlooked---spearheads of modern big data analytics in scientific, data-intensive application domains. While high-performance data analytics has received much attention from the research community, the growing number of problems in designing and deploying cold storage archives has only received very little attention. In this paper, we take the first step towards bridging this gap in knowledge by presenting an analysis of four real-world cold storage archives from three different application domains. In doing so, we highlight (i) workload characteristics that differentiate these archives from traditional, performance-sensitive data analytics, (ii) design trade-offs involved in building cold storage systems for these archives, and (iii) deployment trade-offs with respect to migration to the public cloud. Based on our analysis, we discuss several other important research challenges that need to be addressed by the data management community

    Adaptive Management of Multimodel Data and Heterogeneous Workloads

    Get PDF
    Data management systems are facing a growing demand for a tighter integration of heterogeneous data from different applications and sources for both operational and analytical purposes in real-time. However, the vast diversification of the data management landscape has led to a situation where there is a trade-off between high operational performance and a tight integration of data. The difference between the growth of data volume and the growth of computational power demands a new approach for managing multimodel data and handling heterogeneous workloads. With PolyDBMS we present a novel class of database management systems, bridging the gap between multimodel database and polystore systems. This new kind of database system combines the operational capabilities of traditional database systems with the flexibility of polystore systems. This includes support for data modifications, transactions, and schema changes at runtime. With native support for multiple data models and query languages, a PolyDBMS presents a holistic solution for the management of heterogeneous data. This does not only enable a tight integration of data across different applications, it also allows a more efficient usage of resources. By leveraging and combining highly optimized database systems as storage and execution engines, this novel class of database system takes advantage of decades of database systems research and development. In this thesis, we present the conceptual foundations and models for building a PolyDBMS. This includes a holistic model for maintaining and querying multiple data models in one logical schema that enables cross-model queries. With the PolyAlgebra, we present a solution for representing queries based on one or multiple data models while preserving their semantics. Furthermore, we introduce a concept for the adaptive planning and decomposition of queries across heterogeneous database systems with different capabilities and features. The conceptual contributions presented in this thesis materialize in Polypheny-DB, the first implementation of a PolyDBMS. Supporting the relational, document, and labeled property graph data model, Polypheny-DB is a suitable solution for structured, semi-structured, and unstructured data. This is complemented by an extensive type system that includes support for binary large objects. With support for multiple query languages, industry standard query interfaces, and a rich set of domain-specific data stores and data sources, Polypheny-DB offers a flexibility unmatched by existing data management solutions

    ICSEA 2021: the sixteenth international conference on software engineering advances

    Get PDF
    The Sixteenth International Conference on Software Engineering Advances (ICSEA 2021), held on October 3 - 7, 2021 in Barcelona, Spain, continued a series of events covering a broad spectrum of software-related topics. The conference covered fundamentals on designing, implementing, testing, validating and maintaining various kinds of software. The tracks treated the topics from theory to practice, in terms of methodologies, design, implementation, testing, use cases, tools, and lessons learnt. The conference topics covered classical and advanced methodologies, open source, agile software, as well as software deployment and software economics and education. The conference had the following tracks: Advances in fundamentals for software development Advanced mechanisms for software development Advanced design tools for developing software Software engineering for service computing (SOA and Cloud) Advanced facilities for accessing software Software performance Software security, privacy, safeness Advances in software testing Specialized software advanced applications Web Accessibility Open source software Agile and Lean approaches in software engineering Software deployment and maintenance Software engineering techniques, metrics, and formalisms Software economics, adoption, and education Business technology Improving productivity in research on software engineering Trends and achievements Similar to the previous edition, this event continued to be very competitive in its selection process and very well perceived by the international software engineering community. As such, it is attracting excellent contributions and active participation from all over the world. We were very pleased to receive a large amount of top quality contributions. We take here the opportunity to warmly thank all the members of the ICSEA 2021 technical program committee as well as the numerous reviewers. The creation of such a broad and high quality conference program would not have been possible without their involvement. We also kindly thank all the authors that dedicated much of their time and efforts to contribute to the ICSEA 2021. We truly believe that thanks to all these efforts, the final conference program consists of top quality contributions. This event could also not have been a reality without the support of many individuals, organizations and sponsors. We also gratefully thank the members of the ICSEA 2021 organizing committee for their help in handling the logistics and for their work that is making this professional meeting a success. We hope the ICSEA 2021 was a successful international forum for the exchange of ideas and results between academia and industry and to promote further progress in software engineering research

    Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake

    Get PDF
    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution

    Advanced distributed data integration infrastructure and research data management portal

    Get PDF
    The amount of data available due to the rapid spread of advanced information technology is exploding. At the same time, continued research on data integration systems aims to provide users with uniform data access and efficient data sharing. The ability to share data is particularly important for interdisciplinary research, where a comprehensive picture of the subject requires large amounts of data from disparate data sources from a variety of disciplines. While there are numerous data sets available from various groups worldwide, the existing data sources are principally oriented toward regional comparative efforts rather than global applications. They vary widely both in content and format. Such data sources cannot be easily integrated, and maintained by small groups of developers. I propose an advanced infrastructure for large-scale data integration based on crowdsourcing. In particular, I propose a novel architecture and algorithms to efficiently store dynamically incoming heterogeneous datasets enabling both data integration and data autonomy. My proposed infrastructure combines machine learning algorithms and human expertise to perform efficient schema alignment and maintain relationships between the datasets. It provides efficient data exploration functionality without requiring users to write complex queries, as well as performs approximate information fusion when exact match does not exist. Finally, I introduce Col*Fusion system that implements the proposed advance data integration infrastructure

    Designing Data Spaces

    Get PDF
    This open access book provides a comprehensive view on data ecosystems and platform economics from methodical and technological foundations up to reports from practical implementations and applications in various industries. To this end, the book is structured in four parts: Part I “Foundations and Contexts” provides a general overview about building, running, and governing data spaces and an introduction to the IDS and GAIA-X projects. Part II “Data Space Technologies” subsequently details various implementation aspects of IDS and GAIA-X, including eg data usage control, the usage of blockchain technologies, or semantic data integration and interoperability. Next, Part III describes various “Use Cases and Data Ecosystems” from various application areas such as agriculture, healthcare, industry, energy, and mobility. Part IV eventually offers an overview of several “Solutions and Applications”, eg including products and experiences from companies like Google, SAP, Huawei, T-Systems, Innopay and many more. Overall, the book provides professionals in industry with an encompassing overview of the technological and economic aspects of data spaces, based on the International Data Spaces and Gaia-X initiatives. It presents implementations and business cases and gives an outlook to future developments. In doing so, it aims at proliferating the vision of a social data market economy based on data spaces which embrace trust and data sovereignty

    The Circular Economy Challenge: Towards a Sustainable Development

    Get PDF
    Many recent events, including the COVID-19 pandemic and climate change, have proven the necessity of a transformation of the current economic system based on a linear schema of: “take”, “make”, “use”, and “dispose”. This radical change should involve all of the actors involved in the economic system: institutions, industries, consumers, and scientific research. Only cooperation among these stakeholders can ensure an effective shift toward a circular model. However, which kinds of actions can be performed to implement an effective circular economy? The present Special Issue collects nine papers that prove the possibility of implementing the circular economy from different points of view. The authors analyze all of the spheres of sustainability (environmental, economic, and social) in a variety of contexts, evaluating the effect of the circular choices. The nine papers include several key product value chains, in agreement with the most recent European Circular Economy Action Plan (e.g., electronics and ICT, batteries, plastics, construction and buildings, and food). The present paper collection proves that the circular economy is not only a simple business model, but rather, it involves the integration of many strategies for the protection of the natural ecosystem and the maintenance of worldwide economic stability. The holistic approach is essential for a successful business model, and innovation has an indispensable role in the transition. In this context, the present Special Issue aims to be a multidisciplinary collection of innovations useful for all of the stakeholders involved in the circular economy

    Market driven elastic secure infrastructure

    Full text link
    In today’s Data Centers, a combination of factors leads to the static allocation of physical servers and switches into dedicated clusters such that it is difficult to add or remove hardware from these clusters for short periods of time. This silofication of the hardware leads to inefficient use of clusters. This dissertation proposes a novel architecture for improving the efficiency of clusters by enabling them to add or remove bare-metal servers for short periods of time. We demonstrate by implementing a working prototype of the architecture that such silos can be broken and it is possible to share servers between clusters that are managed by different tools, have different security requirements, and are operated by tenants of the Data Center, which may not trust each other. Physical servers and switches in a Data Center are grouped for a combination of reasons. They are used for different purposes (staging, production, research, etc); host applications required for servicing specific workloads (HPC, Cloud, Big Data, etc); and/or configured to meet stringent security and compliance requirements. Additionally, different provisioning systems and tools such as Openstack-Ironic, MaaS, Foreman, etc that are used to manage these clusters take control of the servers making it difficult to add or remove the hardware from their control. Moreover, these clusters are typically stood up with sufficient capacity to meet anticipated peak workload. This leads to inefficient usage of the clusters. They are under-utilized during off-peak hours and in the cases where the demand exceeds capacity the clusters suffer from degraded quality of service (QoS) or may violate service level objectives (SLOs). Although today’s clouds offer huge benefits in terms of on-demand elasticity, economies of scale, and a pay-as-you-go model yet many organizations are reluctant to move their workloads to the cloud. Organizations that (i) needs total control of their hardware (ii) has custom deployment practices (iii) needs to match stringent security and compliance requirements or (iv) do not want to pay high costs incurred from running workloads in the cloud prefers to own its hardware and host it in a data center. This includes a large section of the economy including financial companies, medical institutions, and government agencies that continue to host their own clusters outside of the public cloud. Considering that all the clusters may not undergo peak demand at the same time provides an opportunity to improve the efficiency of clusters by sharing resources between them. The dissertation describes the design and implementation of the Market Driven Elastic Secure Infrastructure (MESI) as an alternative to the public cloud and as an architecture for the lowest layer of the public cloud to improve its efficiency. It allows mutually non-trusting physically deployed services to share the physical servers of a data center efficiently. The approach proposed here is to build a system composed of a set of services each fulfilling a specific functionality. A tenant of the MESI has to trust only a minimal functionality of the tenant that offers the hardware resources. The rest of the services can be deployed by each tenant themselves MESI is based on the idea of enabling tenants to share hardware they own with tenants they may not trust and between clusters with different security requirements. The architecture provides control and freedom of choice to the tenants whether they wish to deploy and manage these services themselves or use them from a trusted third party. MESI services fit into three layers that build on each other to provide: 1) Elastic Infrastructure, 2) Elastic Secure Infrastructure, and 3) Market-driven Elastic Secure Infrastructure. 1) Hardware Isolation Layer (HIL) – the bottommost layer of MESI is designed for moving nodes between multiple tools and schedulers used for managing the clusters. It defines HIL to control the layer 2 switches and bare-metal servers such that tenants can elastically adjust the size of the clusters in response to the changing demand of the workload. It enables the movement of nodes between clusters with minimal to no modifications required to the tools and workflow used for managing these clusters. (2) Elastic Secure Infrastructure (ESI) builds on HIL to enable sharing of servers between clusters with different security requirements and mutually non-trusting tenants of the Data Center. ESI enables the borrowing tenant to minimize its trust in the node provider and take control of trade-offs between cost, performance, and security. This enables sharing of nodes between tenants that are not only part of the same organization by can be organization tenants in a co-located Data Center. (3) The Bare-metal Marketplace is an incentive-based system that uses economic principles of the marketplace to encourage the tenants to share their servers with others not just when they do not need them but also when others need them more. It provides tenants the ability to define their own cluster objectives and sharing constraints and the freedom to decide the number of nodes they wish to share with others. MESI is evaluated using prototype implementations at each layer of the architecture. (i) The HIL prototype implemented with only 3000 Lines of Code (LOC) is able to support many provisioning tools and schedulers with little to no modification; adds no overhead to the performance of the clusters and is in active production use at MOC managing over 150 servers and 11 switches. (ii) The ESI prototype builds on the HIL prototype and adds to it an attestation service, a provisioning service, and a deterministically built open-source firmware. Results demonstrate that it is possible to build a cluster that is secure, elastic, and fairly quick to set up. The tenant requires only minimum trust in the provider for the availability of the node. (iii) The MESI prototype demonstrates the feasibility of having a one-of-kind multi-provider marketplace for trading bare-metal servers where providers also use the nodes. The evaluation of the MESI prototype shows that all the clusters benefit from participating in the marketplace. It uses agents to trade bare-metal servers in a marketplace to meet the requirements of their clusters. Results show that compared to operating as silos individual clusters see a 50% improvement in the total work done; up to 75% improvement (reduction) in waiting for queues and up to 60% improvement in the aggregate utilization of the test bed. This dissertation makes the following contributions: (i) It defines the architecture of MESI allows mutually non-trusting tenants of the data center to share resources between clusters with different security requirements. (ii) Demonstrates that it is possible to design a service that breaks the silos of static allocation of clusters yet has a small Trusted Computing Base (TCB) and no overhead to the performance of the clusters. (iii) Provides a unique architecture that puts the tenant in control of its own security and minimizes the trust needed in the provider for sharing nodes. (iv) A working prototype of a multi-provider marketplace for bare-metal servers which is a first proof-of-concept that demonstrates that it is possible to trade real bare-metal nodes at practical time scales such that moving nodes between clusters is sufficiently fast to be able to get some useful work done. (v) Finally results show that it is possible to encourage even mutually non-trusting tenants to share their nodes with each other without any central authority making allocation decisions. Many smart, dedicated engineers and researchers have contributed to this work over the years. I have jointly led the efforts to design the HIL and the ESI layer; led the design and implementation of the bare-metal marketplace and the overall MESI architecture

    The Circular Economy Challenge: Towards a Sustainable Development

    Get PDF
    As it is now known, we have only one earth available for our life and it is our duty to preserve it [...
    corecore