52 research outputs found

    Polyflow: a Polystore-compliant mechanism to provide interoperability to heterogeneous provenance graphs

    Get PDF
    Many scientific experiments are modeled as workflows. Workflows usually output massive amounts of data. To guarantee the reproducibility of workflows, they are usually orchestrated by Workflow Management Systems (WfMS), that capture provenance data. Provenance represents the lineage of a data fragment throughout its transformations by activities in a workflow. Provenance traces are usually represented as graphs. These graphs allows scientists to analyze and evaluate results produced by a workflow. However, each WfMS has a proprietary format for provenance and do it in different granularity levels. Therefore, in more complex scenarios in which the scientist needs to interpret provenance graphs generated by multiple WfMSs and workflows, a challenge arises. To first understand the research landscape, we conduct a Systematic Literature Mapping, assessing existing solutions under several different lenses. With a clearer understanding of the state of the art, we propose a tool called Polyflow, which is based on the concept of Polystore systems, integrating several databases of heterogeneous origin by adopting a global ProvONE schema. Polyflow allows scientists to query multiple provenance graphs in an integrated way. Polyflow was evaluated by experts using provenance data collected from real experiments that generate phylogenetic trees through workflows. The experiment results suggest that Polyflow is a viable solution for interoperating heterogeneous provenance data generated by different WfMSs, from both a usability and performance standpoint.Muitos experimentos científicos são modelados como workflows (fluxos de trabalho). Workflows produzem comumente um grande volume de dados. De forma a garantir a reprodutibilidade desses workflows, estes geralmente são orquestrados por Sistemas de Gerência de Workflows (SGWfs), garantindo que dados de proveniência sejam capturados. Dados de proveniência representam o histórico de derivação de um dado ao longo da execução do workflow. Assim, o histórico de derivação dos dados pode ser representado por meio de um grafo de proveniência. Este grafo possibilita aos cientistas analisarem e avaliarem resultados produzidos por um workflow. Todavia, cada SGWf tem seu formato proprietário de representação para dados de proveniência, e os armazenam em diferentes granularidades. Consequentemente, em cenários mais complexos em que um cientista precisa analisar de forma integrada grafos de proveniência gerados por múltiplos workflows, isso se torna desafiador. Primeiramente, para entender o campo de pesquisa, realizamos um Mapeamento Sistemático da Literatura, avaliando soluções existentes sob diferentes lentes. Com uma compreensão mais clara do atual estado da arte, propomos uma ferramenta chamada Polyflow, inspirada em conceitos de sistemas Polystore, possibilitando a integração de várias bases de dados heterogêneas por meio de uma interface de consulta única que utiliza o ProvONE como schema global. Polyflow permite que cientistas submetam consultas em múltiplos grafos de proveniência de maneira integrada. Polyflow foi avaliado em conjunto com especialistas usando dados de proveniência coletados de workflows reais que apoiam o estudo de geração de árvores filogenéticas. O resultado da avaliação mostrou a viabilidade do Polyflow para interoperar semanticamente dados de proveniência gerado por distintos SGWfs, tanto do ponto de vista de desempenho quanto de usabilidade

    A scalable database model of RFI data for the MeerKAT radio telescope

    Get PDF
    In radio astronomy, radio frequency interference (RFI) refers to anysignal captured by a radio telescope that did not originate fromthe observed target in the sky. As RFI corrupts observational data and may damage radio telescope equipment, astronomers seek to store data on RFI, with the aim of mitigating or preventing future interference events. This is a concern for the MeerKAT telescope, a precursor to the planned powerful Square Kilometre Array telescope. Currently, RFI data atMeerKAT is collected in many different file formats that do not fit into traditional database models created to store data in a fixed schema. Here, we design a scalable database model for RFI storage, that supports many databases and many data models. The database is deployed in a Dockerized environment. Preliminary testing of our design shows linear scaling of data ingestion as data sizes increases, as well as fast query processing

    Performance evaluation of an integrated RFI database for the MeerKAT/SKA radio telescope

    Get PDF
    For radio telescopes, radio frequency interference from terrestrial and other sources is a recognized problem that contaminates the signal (RFI) and must be tracked and ultimately removed. At the MeerKAT/SKA telescope, RFI is recorded with a variety of devices, including telescopes, sensors, and scanners; but the combination of data from these multiple sources to yield a unified view of RFI remains a challenging problem. Previously, we demonstrated that a scalable database model with an implementation based on the Polystore framework is a potential solution for RFI monitoring. Here we extend this work, implementing the database model in an integrated environment and evaluating its performance across a range of workloads with three data stores: SciDB, PSQL, and Accumulo. We find that SciDB and Accumulo scale better than PSQL under multi-user environments. Results show a minimal latency as low as 0.02 seconds, irrespective of the location, and data store type. Further, integrated APIs provide single notation and are 5% faster than third-party APIs. Our findings thus provide a guide to the proposed integrated RFI system at MeerKAT/SKA radio telescope

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    Processing Analytical Queries in the AWESOME Polystore [Information Systems Architectures]

    Full text link
    Modern big data applications usually involve heterogeneous data sources and analytical functions, leading to increasing demand for polystore systems, especially analytical polystore systems. This paper presents AWESOME system along with a domain-specific language ADIL. ADIL is a powerful language which supports 1) native heterogeneous data models such as Corpus, Graph, and Relation; 2) a rich set of analytical functions; and 3) clear and rigorous semantics. AWESOME is an efficient tri-store middle-ware which 1) is built on the top of three heterogeneous DBMSs (Postgres, Solr, and Neo4j) and is easy to be extended to incorporate other systems; 2) supports the in-memory query engines and is equipped with analytical capability; 3) applies a cost model to efficiently execute workloads written in ADIL; 4) fully exploits machine resources to improve scalability. A set of experiments on real workloads demonstrate the capability, efficiency, and scalability of AWESOME
    • …