21,128 research outputs found

    Big Data and Analysis of Data Transfers for International Research Networks Using NetSage

    Get PDF
    Modern science is increasingly data-driven and collaborative in nature. Many scientific disciplines, including genomics, high-energy physics, astronomy, and atmospheric science, produce petabytes of data that must be shared with collaborators all over the world. The National Science Foundation-supported International Research Network Connection (IRNC) links have been essential to enabling this collaboration, but as data sharing has increased, so has the amount of information being collected to understand network performance. New capabilities to measure and analyze the performance of international wide-area networks are essential to ensure end-users are able to take full advantage of such infrastructure for their big data applications. NetSage is a project to develop a unified, open, privacy-aware network measurement, and visualization service to address the needs of monitoring today's high-speed international research networks. NetSage collects data on both backbone links and exchange points, which can be as much as 1Tb per month. This puts a significant strain on hardware, not only in terms storage needs to hold multi-year historical data, but also in terms of processor and memory needs to analyze the data to understand network behaviors. This paper addresses the basic NetSage architecture, its current data collection and archiving approach, and details the constraints of dealing with this big data problem of handling vast amounts of monitoring data, while providing useful, extensible visualization to end users

    Comprehensive Analysis of Non Redundant Protein Database

    Get PDF
    Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset

    Grid infrastructures for secure access to and use of bioinformatics data: experiences from the BRIDGES project

    Get PDF
    The BRIDGES project was funded by the UK Department of Trade and Industry (DTI) to address the needs of cardiovascular research scientists investigating the genetic causes of hypertension as part of the Wellcome Trust funded (£4.34M) cardiovascular functional genomics (CFG) project. Security was at the heart of the BRIDGES project and an advanced data and compute grid infrastructure incorporating latest grid authorisation technologies was developed and delivered to the scientists. We outline these grid infrastructures and describe the perceived security requirements at the project start including data classifications and how these evolved throughout the lifetime of the project. The uptake and adoption of the project results are also presented along with the challenges that must be overcome to support the secure exchange of life science data sets. We also present how we will use the BRIDGES experiences in future projects at the National e-Science Centre

    Data as a Service (DaaS) for sharing and processing of large data collections in the cloud

    Get PDF
    Data as a Service (DaaS) is among the latest kind of services being investigated in the Cloud computing community. The main aim of DaaS is to overcome limitations of state-of-the-art approaches in data technologies, according to which data is stored and accessed from repositories whose location is known and is relevant for sharing and processing. Besides limitations for the data sharing, current approaches also do not achieve to fully separate/decouple software services from data and thus impose limitations in inter-operability. In this paper we propose a DaaS approach for intelligent sharing and processing of large data collections with the aim of abstracting the data location (by making it relevant to the needs of sharing and accessing) and to fully decouple the data and its processing. The aim of our approach is to build a Cloud computing platform, offering DaaS to support large communities of users that need to share, access, and process the data for collectively building knowledge from data. We exemplify the approach from large data collections from health and biology domains.Peer ReviewedPostprint (author's final draft

    Towards data grids for microarray expression profiles

    Get PDF
    The UK DTI funded Biomedical Research Informatics Delivered by Grid Enabled Services (BRIDGES) project developed a Grid infrastructure through which research into the genetic causes of hypertension could be supported by scientists within the large Wellcome Trust funded Cardiovascular Functional Genomics project. The BRIDGES project had a focus on developing a compute Grid and a data Grid infrastructure with security at its heart. Building on the work within BRIDGES, the BBSRC funded Grid enabled Microarray Expression Profile Search (GEMEPS) project plans to provide an enhanced data Grid infrastructure to support richer queries needed for the discovery and analysis of microarray data sets, also based upon a fine-grained security infrastructure. This paper outlines the experiences gained within BRIDGES and outlines the status of the GEMEPS project, the open challenges that remain and plans for the future

    Grid Added Value to Address Malaria

    Get PDF
    Through this paper, we call for a distributed, internet-based collaboration to address one of the worst plagues of our present world, malaria. The spirit is a non-proprietary peer-production of information-embedding goods. And we propose to use the grid technology to enable such a world wide "open source" like collaboration. The first step towards this vision has been achieved during the summer on the EGEE grid infrastructure where 46 million ligands were docked for a total amount of 80 CPU years in 6 weeks in the quest for new drugs.Comment: 7 pages, 1 figure, 6th IEEE International Symposium on Cluster Computing and the Grid, Singapore, 16-19 may 2006, to appear in the proceeding
    corecore