181 research outputs found

    Gerenciamento de proveniência de dados de workflows de bioinformática em ambiente de nuvens federadas

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2019.Workflows de Bioinformática prioritariamente visam tratar, processar e analisar dados oriundos de sequenciamento de DNA/RNA. A diversidade desses workflows é dependente da questão biológica que se pretende responder, e por isso podem ser bastante com- plexos. O uso de nuvem federada em workflows de Bioinformática, ao mesmo tempo que oferece flexibilidade para o usuário, pode aumentar o trabalho de configuração do ambiente quando comparado a um ambiente de nuvem computacional. Independentemente da questão biológica, e considerando o ambiente computacional como parte do experimento in silico, a documentação do workflow tem particularidades a serem preservadas com vistas à sua reprodutibilidade. Os modelos de proveniência de dados proveem uma estrutura de armazenamento e recuperação dos dados de proveniência, mantendo seus significados. A maneira com a qual os dados de proveniência são armazenados é outra característica, cujos aspectos tecnológicos influenciam o resultado final. Neste contexto, este trabalho propõe uma solução que permita o gerenciamento de dados de proveniência de workflows de Bioinformática em um ambiente de nuvem federada, armazenando os dados de proveniência de forma distribuída em esquemas de dados baseados no PROV- DM, utilizando sistemas de banco de dados NoSQL. Nos resultados, foram explorados aspectos relacionados à federação de nuvens, o que proporcionou menos dependência de um único provedor para os serviços hospedados. Em relação às bases de dados, este trabalho traz três opções de tecnologias de banco de dados para armazenar a proveniência de dados usando o modelo de dados PROV-DM, incluindo o esquema de dados específico de cada banco de dados, que pode ser usado de acordo com a preferência do pesquisador ou integrado aos sistemas de gerenciamento de workflows. Por fim, a solução proposta demonstrou ser adequada para o gerenciamento dos dados de proveniência para workflows de Bioinformática em nuvem federada.Bioinformatics workflows essentially aim to treat, process, and analyze data from DNA or RNA sequencing. The diversity of these workflows is dependent on the biological question to be answered, which therefore can be quite complex. The computational environment is part of the in silico experiment, and regardless of biological questions, the workflow’s documentation has particularities to be preserved to promote its reproducibility. Data provenance models address this problem providing a storage and query structure of data provenance while maintaining their meanings. Moreover, technological aspects can influence how data provenance is stored. Using federated cloud in Bioinformatics workflows can provide both flexibility for the user and increase the environment configuring work compared to a cloud computing environment. In this context, this work proposes a solution to data provenance management for Bioinformatics workflows using NoSQL database systems in a federated cloud environment, storing data provenance in distributed databases using data schemas based on PROV-DM. The results report aspects related to cloud federation providing less dependence on a single provider for the hosted services. Concerning the databases, this work draws three options of database technolo- gies to store data provenance using the PROV-DM data model. Specific database data schemas are provided and can be used according to the researcher’s preference and can be integrated into workflow management systems. Finally, it is proposed a suitable solution for the data provenance management for Bioinformatics workflows in a federated cloud

    Workflow Provenance: from Modeling to Reporting

    Get PDF
    Workflow provenance is a crucial part of a workflow system as it enables data lineage analysis, error tracking, workflow monitoring, usage pattern discovery, and so on. Integrating provenance into a workflow system or modifying a workflow system to capture or analyze different provenance information is burdensome, requiring extensive development because provenance mechanisms rely heavily on the modelling, architecture, and design of the workflow system. Various tools and technologies exist for logging events in a software system. Unfortunately, logging tools and technologies are not designed for capturing and analyzing provenance information. Workflow provenance is not only about logging, but also about retrieving workflow related information from logs. In this work, we propose a taxonomy of provenance questions and guided by these questions, we created a workflow programming model 'ProvMod' with a supporting run-time library to provide automated provenance and log analysis for any workflow system. The design and provenance mechanism of ProvMod is based on recommendations from prominent research and is easy to integrate into any workflow system. ProvMod offers Neo4j graph database support to manage semi-structured heterogeneous JSON logs. The log structure is adaptable to any NoSQL technology. For each provenance question in our taxonomy, ProvMod provides the answer with data visualization using Neo4j and the ELK Stack. Besides analyzing performance from various angles, we demonstrate the ease of integration by integrating ProvMod with Apache Taverna and evaluate ProvMod usability by engaging users. Finally, we present two Software Engineering research cases (clone detection and architecture extraction) where our proposed model ProvMod and provenance questions taxonomy can be applied to discover meaningful insights

    Improving Usability And Scalability Of Big Data Workflows In The Cloud

    Get PDF
    Big data workflows have recently emerged as the next generation of data-centric workflow technologies to address the five “V” challenges of big data: volume, variety, velocity, veracity, and value. More formally, a big data workflow is the computerized modeling and automation of a process consisting of a set of computational tasks and their data interdependencies to process and analyze data of ever increasing in scale, complexity, and rate of acquisition. The convergence of big data and workflows creates new challenges in workflow community. First, the variety of big data results in a need for integrating large number of remote Web services and other heterogeneous task components that can consume and produce data in various formats and models into a uniform and interoperable workflow. Existing approaches fall short in addressing the so-called shimming problem only in an adhoc manner and unable to provide a generic solution. We automatically insert a piece of code called shims or adaptors in order to resolve the data type mismatches. Second, the volume of big data results in a large number of datasets that needs to be queried and analyzed in an effective and personalized manner. Further, there is also a strong need for sharing, reusing, and repurposing existing tasks and workflows across different users and institutes. To overcome such limitations, we propose a folksonomy- based social workflow recommendation system to improve workflow design productivity and efficient dataset querying and analyzing. Third, the volume of big data results in the need to process and analyze data of ever increasing in scale, complexity, and rate of acquisition. But a scalable distributed data model is still missing that abstracts and automates data distribution, parallelism, and scalable processing. We propose a NoSQL collectional data model that addresses this limitation. Finally, the volume of big data combined with the unbound resource leasing capability foreseen in the cloud, facilitates data scientists to wring actionable insights from the data in a time and cost efficient manner. We propose BARENTS scheduler that supports high-performance workflow scheduling in a heterogeneous cloud-computing environment with a single objective to minimize the workflow makespan under a user provided budget constraint

    Gerenciamento de proveniência de dados de workflows de bioinformática em ambiente de nuvem computacional

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2018.Os experimentos da biologia molecular são frequentemente apresentados sob a forma de workflows científicos. Um workflow científico é composto por um conjunto de atividades realizadas por diferentes entidades de processamento através de tarefas gerenciadas. O conhecimento sobre a trajetória dos dados ao longo de um determinado workflow permite a reprodutibilidade por meio da proveniência de dados. Para reproduzir um experimento de Bioinformática in silico, é preciso considerar outros aspectos, além das tarefas executadas em um workflow. De fato, as configurações computacionais nas quais os programas envolvidos são executados são um requisito para a reprodutibilidade. A tecnologia da computação em nuvem pode ocultar detalhes técnicos e facilitar ao usuário a configuração desse ambiente sob demanda. Os sistemas de banco de dados NoSQL também ganharam popularidade, particularmente na nuvem. Considerando este cenário, é proposta uma modelagem para a proveniência de dados de experimentos científicos, em ambiente de nuvem computacional, utilizando o PROV-DM e realizando o mapeamento para três diferentes tipos de famílias de sistemas de banco de dados NoSQL. Foram executados dois workflows de Bioinformática envolvendo diferentes fases, os quais foram utilizados para os testes nos bancos de dados NoSQL Cassandra, MongoDB e OrientDB, e em seguida é apresentada uma análise dessas execuções e testes. Os resultados obtidos mostraram que os tempos de armazenamento da proveniência são mínimos comparados aos tempos de execução dos workflows sem o uso da proveniência e, portanto, os modelos propostos para os bancos de dados NoSQL mostraram ser uma boa opção para armazenamento e gerenciamento de proveniência de dados biológicos.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).Molecular biology experiments are often presented in the form of scientific workflows. There is a set of activities performed by different processing entities through managed tasks. Knowledge about the data trajectory throughout a given workflow enables reproducibility by data provenance. In order to reproduce an in silico bioinformatics experiment one must consider other aspects besides those steps followed by a workflow. Indeed, the computational settings in which the involved programs run is a requirement for reproducibility. Cloud computing technology may hide the technical details and make it easier for the user to set up such an on-demand environment. NoSQL database systems have also gained popularity, particularly in the cloud. Considering this scenario, a model for the provenance of data from scientific experiments in a computational cloud environment is proposed, using the PROV-DM and mapping to three different types of families of NoSQL database systems. Two Bioinformatics workflows involving different phases were performed, which were used for the tests in the NoSQL Cassandra, MongoDB and OrientDB databases, followed by an analysis of these executions and tests.The results obtained showed that the storage times of the provenance are minimal compared to the execution times of the workflows without the use of the provenance and therefore, the proposed models for the NoSQL databases proved to be a good option for storage and management of biological data

    Managing Workflows on top of a Cloud Computing Orchestrator for using heterogeneous environments on e-Science

    Full text link
    [EN] Scientific workflows (SWFs) are widely used to model processes in e-Science. SWFs are executed by means of workflow management systems (WMSs), which orchestrate the workload on top of computing infrastructures. The advent of cloud computing infrastructures has opened the door of using on-demand infrastructures to complement or even replace local infrastructures. However, new issues have arisen, such as the integration of hybrid resources or the compromise between infrastructure reutilisation and elasticity. In this article, we present an ad hoc solution for managing workflows exploiting the capabilities of cloud orchestrators to deploy resources on demand according to the workload and to combine heterogeneous cloud providers (such as on-premise clouds and public clouds) and traditional infrastructures (clusters) to minimise costs and response time. The work does not propose yet another WMS but demonstrates the benefits of the integration of cloud orchestration when running complex workflows. The article shows several configuration experiments from a realistic comparative genomics workflow called Orthosearch, to migrate memory-intensive workload to public infrastructures while keeping other blocks of the experiment running locally. The article computes running time and cost suggesting best practices.This paper wants to acknowledge the support of the EUBrazilCC project, funded by the European Commission (STREP 614048) and the Brazilian MCT/CNPq N. 13/2012, for the use of its infrastructure. The authors would like also to thank the Spanish 'Ministerio de Economia y Competitividad' for the project 'Clusters Virtuales Elasticos y Migrables sobre Infraestructuras Cloud Hibridas' with reference TIN2013-44390-R.Carrión Collado, AA.; Caballer Fernández, M.; Blanquer Espert, I.; Kotowski, N.; Jardim, R.; Dávila, AMR. (2017). Managing Workflows on top of a Cloud Computing Orchestrator for using heterogeneous environments on e-Science. International Journal of Web and Grid Services. 13(4):375-402. doi:10.1504/IJWGS.2017.10003225S37540213

    Análise de modelos de dados para NoSQL baseados em documento em workflows de bioinformática

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2018.Para a quantidade crescente de dados gerados por várias áreas do conhecimento dá-se o nome de Big Data. Neste cenário, pode-se dizer que as pesquisas de bioinformática necessitam de dados de proveniência, pois estes são capazes de fornecer o histórico das informações coletadas no workflow da pesquisa e responder questões relacionadas à origem dos dados. Big Data trouxe o surgimento da abordagem NoSQL (Not Only SQL) como uma alternativa ao uso de Modelos de Banco de Dados Relacional por não apresentar as limitações observadas no Modelo de Banco de Dados Relacional quando este é aplicado em uma grande quantidade de dados. Com foco no MongoDB, este trabalho propõe, com o auxílio de um programa, criado capaz de executar automaticamente um workflow, armazenar sua proveniência e dados brutos em três diferentes formatos de documentos: referencial, embutido e híbrido. Essas três maneiras diferentes são comparadas e analisadas usando parâmetros como tempo e recursos de consulta. Os resultados mostraram algumas particularidades da bioinformática e vantagens ou desvantagens para cada mod- elo.The increasing amount of data named generated by several areas of knowledge is named Big Data. In this scenary, it can be said that Bioinformatic researchs needs provenance data, since it is capable of providing the history of the information collected in the research workflow and answer questions related to the data source. Big Data brought the emergence of the NoSQL (Not Only SQL) approach as an alternative to the use of Relational Database Models because it does not present the limitations observed in the Relational Database Model when it is applied in a large dataset. With focus on MongoDB, this work proposes a program that can automatically execute a workflow and store its provenance and raw data into three different document formats: reference, embedded and hybrid. Those three different ways are compared using parameters such as time and query capabilities. Results showed some bioinformatics particularities and advantages or disadvantages for each model

    Web technologies for environmental big data

    Get PDF
    Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources

    A Survey of Semantic Integration Approaches in Bioinformatics

    Get PDF
    Technological advances of computer science and data analysis are helping to provide continuously huge volumes of biological data, which are available on the web. Such advances involve and require powerful techniques for data integration to extract pertinent knowledge and information for a specific question. Biomedical exploration of these big data often requires the use of complex queries across multiple autonomous, heterogeneous and distributed data sources. Semantic integration is an active area of research in several disciplines, such as databases, information-integration, and ontology. We provide a survey of some approaches and techniques for integrating biological data, we focus on those developed in the ontology community

    Leveraging Metadata in NoSQL Storage Systems

    Get PDF
    NoSQL systems have grown in popularity for storing big data because these systems offer high availability, i.e., operations with high throughput and low latency. However, metadata in these systems are handled today in ad-hoc ways. We present Wasef, a system that treats metadata in a NoSQL database system, as first-class citizens. Metadata may include information such as: operational history for portions of a database table (e.g., columns), placement information for ranges of keys, and operational logs for data items (keyvalue pairs). Wasef allows the NoSQL system to store and query this metadata efficiently.We integrateWasef into Apache Cassandra, one of the most popular key-value stores. We then implement three important uses cases in Cassandra: dropping columns in a flexible manner, verifying data durability during migrational operations such as node decommissioning, and maintaining data provenance. Our experimental evaluation uses AWS EC2 instances and YCSB workloads. Our results show that Wasef: i) scales well with the size of the data and the metadata; ii) affects throughput minimally by only 9%, and iii) affects operational latencies by only 3%.Ope
    corecore