196 research outputs found
BioCloud Search EnGene: Surfing Biological Data on the Cloud
The massive production and spread of biomedical data around the web introduces new challenges related to identify computational approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud application that facilitates searching and integration of the many layers of biological information offered by public large-scale genomic repositories. Grounding on the concept of dataspace, BSE is built on top of a cloud platform that severely curtails issues associated with scalability and performance. Like popular online gene portals, BSE adopts a gene-centric approach: researchers can find their information of interest by means of a simple âGoogle-likeâ query interface that accepts standard gene identification as keywords. We present BSE architecture and functionality and discuss how our strategies contribute to successfully tackle big data problems in querying gene-based web resources. BSE is publically available at: http://biocloud-unica.appspot.com/
A Survey of the State of Dataspaces
Published in International Journal of Computer and Information Technology.This paper presents a survey of the state of dataspaces. With dataspaces becoming the modern technique of systems integration, the achievement of complete dataspace development is a critical issue. This has led to the design and implementation of dataspace systems using various approaches. Dataspaces are data integration approaches that target for data coexistence in the spatial domain. Unlike traditional data integration techniques, they do not require up front semantic integration of data. In this paper, we outline and compare the properties and implementations of dataspaces including the approaches of optimizing dataspace development. We finally present actual dataspace development recommendations to provide a global overview of this significant research topic.This paper presents a
survey of the state of
dataspaces
.
With dataspaces becoming the modern technique of
systems integration, the ach
ievement of complete dataspace
development is a critical issue. This has led to the design and
implementation of dataspace systems using various approaches.
Dataspaces are data integration approaches that target for data
coexistence in the spatial domain.
Unlike traditional data
integration techniques, they do not require up front semantic
integration of data. In this paper, we outline and compare the
properties and implementations of dataspaces including the
approaches of optimizing dataspace development.
We finally
present actual dataspace development recommendations
to
provide a global overview of this significant research topic
Linked Data - the story so far
The term âLinked Dataâ refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertionsâ the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
Towards Knowledge in the Cloud
Knowledge in the form of semantic data is becoming more and more ubiquitous, and the need for scalable, dynamic systems to support collaborative work with such distributed, heterogeneous knowledge arises. We extend the âdata in the cloudâ approach that is emerging today to âknowledge in the cloudâ, with support for handling semantic information, organizing and finding it efficiently and providing reasoning and quality support. Both the life sciences and emergency response fields are identified as strong potential beneficiaries of having âknowledge in the cloudâ
Performance analysis and optimization of in-situ integration of simulation with data analysis: zipping applications up
This paper targets an important class of applications that requires combining HPC simulations with data analysis for online or real-time scientific discovery. We use the state-of-the-art parallel-IO and data-staging libraries to build simulation-time data analysis workflows, and conduct performance analysis with real-world applications of computational fluid dynamics (CFD) simulations and molecular dynamics (MD) simulations. Driven by in-depth performance inefficiency analysis, we design an end-to-end application-level approach to eliminating the interlocks and synchronizations existent in the present methods. Our new approach employs both task parallelism and pipeline parallelism to reduce synchronizations effectively. In addition, we design a fully asynchronous, fine-grain, and pipelining runtime system, which is named Zipper. Zipper is a multi-threaded distributed runtime system and executes in a layer below the simulation and analysis applications. To further reduce the simulation application's stall time and enhance the data transfer performance, we design a concurrent data transfer optimization that uses both HPC network and parallel file system for improved bandwidth. The scalability of the Zipper system has been verified by a performance model and various empirical large scale experiments. The experimental results on an Intel multicore cluster as well as a Knight Landing HPC system demonstrate that the Zipper based approach can outperform the fastest state-of-the-art I/O transport library by up to 220% using 13,056 processor cores
Building a scientific workflow framework to enable realâtime machine learning and visualization
Nowadays, we have entered the era of big data. In the area of high performance computing, largeâscale simulations can generate huge amounts of data with potentially critical information. However, these data are usually saved in intermediate files and are not instantly visible until advanced data analytics techniques are applied after reading all simulation data from persistent storages (eg, local disks or a parallel file system). This approach puts users in a situation where they spend long time on waiting for running simulations while not knowing the status of the running job. In this paper, we build a new computational framework to couple scientific simulations with multiâstep machine learning processes and inâsitu data visualizations. We also design a new scalable simulationâtime clustering algorithm to automatically detect fluid flow anomalies. This computational framework is built upon different software components and provides plugâin data analysis and visualization functions over complex scientific workflows. With this advanced framework, users can monitor and get realâtime notifications of special patterns or anomalies from ongoing extremeâscale turbulent flow simulations
Combining in-situ and in-transit processing to enable extreme-scale scientific analysis
pre-printWith the onset of extreme-scale computing, I/O constraints make it increasingly difficult for scientists to save a sufficient amount of raw simulation data to persistent storage. One potential solution is to change the data analysis pipeline from a post-process centric to a concurrent approach based on either in-situ or in-transit processing. In this context computations are considered in-situ if they utilize the primary compute resources, while in-transit processing refers to offloading computations to a set of secondary resources using asynchronous data transfers. In this paper we explore the design and implementation of three common analysis techniques typically performed on large-scale scientific simulations: topological analysis, descriptive statistics, and visualization. We summarize algorithmic developments, describe a resource scheduling system to coordinate the execution of various analysis workflows, and discuss our implementation using the DataSpaces and ADIOS frameworks that support efficient data movement between in-situ and in-transit computations. We demonstrate the efficiency of our lightweight, flexible framework by deploying it on the Jaguar XK6 to analyze data generated by S3D, a massively parallel turbulent combustion code. Our framework allows scientists dealing with the data deluge at extreme scale to perform analyses at increased temporal resolutions, mitigate I/O costs, and significantly improve the time to insight
LinkedScales : bases de dados em multiescala
Orientador: AndrĂ© SantanchĂšTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: As ciĂȘncias biolĂłgicas e mĂ©dicas precisam cada vez mais de abordagens unificadas para a anĂĄlise de dados, permitindo a exploração da rede de relacionamentos e interaçÔes entre elementos. No entanto, dados essenciais estĂŁo frequentemente espalhados por um conjunto cada vez maior de fontes com mĂșltiplos nĂveis de heterogeneidade entre si, tornando a integração cada vez mais complexa. Abordagens de integração existentes geralmente adotam estratĂ©gias especializadas e custosas, exigindo a produção de soluçÔes monolĂticas para lidar com formatos e esquemas especĂficos. Para resolver questĂ”es de complexidade, essas abordagens adotam soluçÔes pontuais que combinam ferramentas e algoritmos, exigindo adaptaçÔes manuais. Abordagens nĂŁo sistemĂĄticas dificultam a reutilização de tarefas comuns e resultados intermediĂĄrios, mesmo que esses possam ser Ășteis em anĂĄlises futuras. AlĂ©m disso, Ă© difĂcil o rastreamento de transformaçÔes e demais informaçÔes de proveniĂȘncia, que costumam ser negligenciadas. Este trabalho propĂ”e LinkedScales, um dataspace baseado em mĂșltiplos nĂveis, projetado para suportar a construção progressiva de visĂ”es unificadas de fontes heterogĂȘneas. LinkedScales sistematiza as mĂșltiplas etapas de integração em escalas, partindo de representaçÔes brutas (escalas mais baixas), indo gradualmente para estruturas semelhantes a ontologias (escalas mais altas). LinkedScales define um modelo de dados e um processo de integração sistemĂĄtico e sob demanda, atravĂ©s de transformaçÔes em um banco de dados de grafos. Resultados intermediĂĄrios sĂŁo encapsulados em escalas reutilizĂĄveis e transformaçÔes entre escalas sĂŁo rastreadas em um grafo de proveniĂȘncia ortogonal, que conecta objetos entre escalas. Posteriormente, consultas ao dataspace podem considerar objetos nas escalas e o grafo de proveniĂȘncia ortogonal. AplicaçÔes prĂĄticas de LinkedScales sĂŁo tratadas atravĂ©s de dois estudos de caso, um no domĂnio da biologia -- abordando um cenĂĄrio de anĂĄlise centrada em organismos -- e outro no domĂnio mĂ©dico -- com foco em dados de medicina baseada em evidĂȘnciasAbstract: Biological and medical sciences increasingly need a unified, network-driven approach for exploring relationships and interactions among data elements. Nevertheless, essential data is frequently scattered across sources with multiple levels of heterogeneity. Existing data integration approaches usually adopt specialized, heavyweight strategies, requiring a costly upfront effort to produce monolithic solutions for handling specific formats and schemas. Furthermore, such ad-hoc strategies hamper the reuse of intermediary integration tasks and outcomes. This work proposes LinkedScales, a multiscale-based dataspace designed to support the progressive construction of a unified view of heterogeneous sources. It departs from raw representations (lower scales) and goes towards ontology-like structures (higher scales). LinkedScales defines a data model and a systematic, gradual integration process via operations over a graph database. Intermediary outcomes are encapsulated as reusable scales, tracking the provenance of inter-scale operations. Later, queries can combine both scale data and orthogonal provenance information. Practical applications of LinkedScales are discussed through two case studies on the biology domain -- addressing an organism-centric analysis scenario -- and the medical domain -- focusing on evidence-based medicine dataDoutoradoCiĂȘncia da ComputaçãoDoutor em CiĂȘncia da Computação141353/2015-5CAPESCNP
- âŠ