196 research outputs found

    BioCloud Search EnGene: Surfing Biological Data on the Cloud

    Get PDF
    The massive production and spread of biomedical data around the web introduces new challenges related to identify computational approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud application that facilitates searching and integration of the many layers of biological information offered by public large-scale genomic repositories. Grounding on the concept of dataspace, BSE is built on top of a cloud platform that severely curtails issues associated with scalability and performance. Like popular online gene portals, BSE adopts a gene-centric approach: researchers can find their information of interest by means of a simple “Google-like” query interface that accepts standard gene identification as keywords. We present BSE architecture and functionality and discuss how our strategies contribute to successfully tackle big data problems in querying gene-based web resources. BSE is publically available at: http://biocloud-unica.appspot.com/

    A Survey of the State of Dataspaces

    Get PDF
    Published in International Journal of Computer and Information Technology.This paper presents a survey of the state of dataspaces. With dataspaces becoming the modern technique of systems integration, the achievement of complete dataspace development is a critical issue. This has led to the design and implementation of dataspace systems using various approaches. Dataspaces are data integration approaches that target for data coexistence in the spatial domain. Unlike traditional data integration techniques, they do not require up front semantic integration of data. In this paper, we outline and compare the properties and implementations of dataspaces including the approaches of optimizing dataspace development. We finally present actual dataspace development recommendations to provide a global overview of this significant research topic.This paper presents a survey of the state of dataspaces . With dataspaces becoming the modern technique of systems integration, the ach ievement of complete dataspace development is a critical issue. This has led to the design and implementation of dataspace systems using various approaches. Dataspaces are data integration approaches that target for data coexistence in the spatial domain. Unlike traditional data integration techniques, they do not require up front semantic integration of data. In this paper, we outline and compare the properties and implementations of dataspaces including the approaches of optimizing dataspace development. We finally present actual dataspace development recommendations to provide a global overview of this significant research topic

    Linked Data - the story so far

    No full text
    The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward

    Towards Knowledge in the Cloud

    Get PDF
    Knowledge in the form of semantic data is becoming more and more ubiquitous, and the need for scalable, dynamic systems to support collaborative work with such distributed, heterogeneous knowledge arises. We extend the “data in the cloud” approach that is emerging today to “knowledge in the cloud”, with support for handling semantic information, organizing and finding it efficiently and providing reasoning and quality support. Both the life sciences and emergency response fields are identified as strong potential beneficiaries of having ”knowledge in the cloud”

    Performance analysis and optimization of in-situ integration of simulation with data analysis: zipping applications up

    Get PDF
    This paper targets an important class of applications that requires combining HPC simulations with data analysis for online or real-time scientific discovery. We use the state-of-the-art parallel-IO and data-staging libraries to build simulation-time data analysis workflows, and conduct performance analysis with real-world applications of computational fluid dynamics (CFD) simulations and molecular dynamics (MD) simulations. Driven by in-depth performance inefficiency analysis, we design an end-to-end application-level approach to eliminating the interlocks and synchronizations existent in the present methods. Our new approach employs both task parallelism and pipeline parallelism to reduce synchronizations effectively. In addition, we design a fully asynchronous, fine-grain, and pipelining runtime system, which is named Zipper. Zipper is a multi-threaded distributed runtime system and executes in a layer below the simulation and analysis applications. To further reduce the simulation application's stall time and enhance the data transfer performance, we design a concurrent data transfer optimization that uses both HPC network and parallel file system for improved bandwidth. The scalability of the Zipper system has been verified by a performance model and various empirical large scale experiments. The experimental results on an Intel multicore cluster as well as a Knight Landing HPC system demonstrate that the Zipper based approach can outperform the fastest state-of-the-art I/O transport library by up to 220% using 13,056 processor cores

    Building a scientific workflow framework to enable real‐time machine learning and visualization

    Get PDF
    Nowadays, we have entered the era of big data. In the area of high performance computing, large‐scale simulations can generate huge amounts of data with potentially critical information. However, these data are usually saved in intermediate files and are not instantly visible until advanced data analytics techniques are applied after reading all simulation data from persistent storages (eg, local disks or a parallel file system). This approach puts users in a situation where they spend long time on waiting for running simulations while not knowing the status of the running job. In this paper, we build a new computational framework to couple scientific simulations with multi‐step machine learning processes and in‐situ data visualizations. We also design a new scalable simulation‐time clustering algorithm to automatically detect fluid flow anomalies. This computational framework is built upon different software components and provides plug‐in data analysis and visualization functions over complex scientific workflows. With this advanced framework, users can monitor and get real‐time notifications of special patterns or anomalies from ongoing extreme‐scale turbulent flow simulations

    Combining in-situ and in-transit processing to enable extreme-scale scientific analysis

    Get PDF
    pre-printWith the onset of extreme-scale computing, I/O constraints make it increasingly difficult for scientists to save a sufficient amount of raw simulation data to persistent storage. One potential solution is to change the data analysis pipeline from a post-process centric to a concurrent approach based on either in-situ or in-transit processing. In this context computations are considered in-situ if they utilize the primary compute resources, while in-transit processing refers to offloading computations to a set of secondary resources using asynchronous data transfers. In this paper we explore the design and implementation of three common analysis techniques typically performed on large-scale scientific simulations: topological analysis, descriptive statistics, and visualization. We summarize algorithmic developments, describe a resource scheduling system to coordinate the execution of various analysis workflows, and discuss our implementation using the DataSpaces and ADIOS frameworks that support efficient data movement between in-situ and in-transit computations. We demonstrate the efficiency of our lightweight, flexible framework by deploying it on the Jaguar XK6 to analyze data generated by S3D, a massively parallel turbulent combustion code. Our framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight

    LinkedScales : bases de dados em multiescala

    Get PDF
    Orientador: AndrĂ© SantanchĂšTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: As ciĂȘncias biolĂłgicas e mĂ©dicas precisam cada vez mais de abordagens unificadas para a anĂĄlise de dados, permitindo a exploração da rede de relacionamentos e interaçÔes entre elementos. No entanto, dados essenciais estĂŁo frequentemente espalhados por um conjunto cada vez maior de fontes com mĂșltiplos nĂ­veis de heterogeneidade entre si, tornando a integração cada vez mais complexa. Abordagens de integração existentes geralmente adotam estratĂ©gias especializadas e custosas, exigindo a produção de soluçÔes monolĂ­ticas para lidar com formatos e esquemas especĂ­ficos. Para resolver questĂ”es de complexidade, essas abordagens adotam soluçÔes pontuais que combinam ferramentas e algoritmos, exigindo adaptaçÔes manuais. Abordagens nĂŁo sistemĂĄticas dificultam a reutilização de tarefas comuns e resultados intermediĂĄrios, mesmo que esses possam ser Ășteis em anĂĄlises futuras. AlĂ©m disso, Ă© difĂ­cil o rastreamento de transformaçÔes e demais informaçÔes de proveniĂȘncia, que costumam ser negligenciadas. Este trabalho propĂ”e LinkedScales, um dataspace baseado em mĂșltiplos nĂ­veis, projetado para suportar a construção progressiva de visĂ”es unificadas de fontes heterogĂȘneas. LinkedScales sistematiza as mĂșltiplas etapas de integração em escalas, partindo de representaçÔes brutas (escalas mais baixas), indo gradualmente para estruturas semelhantes a ontologias (escalas mais altas). LinkedScales define um modelo de dados e um processo de integração sistemĂĄtico e sob demanda, atravĂ©s de transformaçÔes em um banco de dados de grafos. Resultados intermediĂĄrios sĂŁo encapsulados em escalas reutilizĂĄveis e transformaçÔes entre escalas sĂŁo rastreadas em um grafo de proveniĂȘncia ortogonal, que conecta objetos entre escalas. Posteriormente, consultas ao dataspace podem considerar objetos nas escalas e o grafo de proveniĂȘncia ortogonal. AplicaçÔes prĂĄticas de LinkedScales sĂŁo tratadas atravĂ©s de dois estudos de caso, um no domĂ­nio da biologia -- abordando um cenĂĄrio de anĂĄlise centrada em organismos -- e outro no domĂ­nio mĂ©dico -- com foco em dados de medicina baseada em evidĂȘnciasAbstract: Biological and medical sciences increasingly need a unified, network-driven approach for exploring relationships and interactions among data elements. Nevertheless, essential data is frequently scattered across sources with multiple levels of heterogeneity. Existing data integration approaches usually adopt specialized, heavyweight strategies, requiring a costly upfront effort to produce monolithic solutions for handling specific formats and schemas. Furthermore, such ad-hoc strategies hamper the reuse of intermediary integration tasks and outcomes. This work proposes LinkedScales, a multiscale-based dataspace designed to support the progressive construction of a unified view of heterogeneous sources. It departs from raw representations (lower scales) and goes towards ontology-like structures (higher scales). LinkedScales defines a data model and a systematic, gradual integration process via operations over a graph database. Intermediary outcomes are encapsulated as reusable scales, tracking the provenance of inter-scale operations. Later, queries can combine both scale data and orthogonal provenance information. Practical applications of LinkedScales are discussed through two case studies on the biology domain -- addressing an organism-centric analysis scenario -- and the medical domain -- focusing on evidence-based medicine dataDoutoradoCiĂȘncia da ComputaçãoDoutor em CiĂȘncia da Computação141353/2015-5CAPESCNP
    • 

    corecore