Search CORE

55 research outputs found

The Storage Resource Broker and E-science in the UK

Author: Berrisford P
Blanshard L
Brown GD
Downing R
Drinkwater G
Hanlon D
Hasan A
Kleese-van-Dam K
Roberts L
Tyer R
Publication venue
Publication date: 01/01/2006
Field of study

ePubs: the open archive for STFC research publications

HUDDL for description and archive of hydrographic binary data

Author: Calder Brian R.
Masetti Giuseppe
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/04/2014
Field of study

Many of the attempts to introduce a universal hydrographic binary data format have failed or have been only partially successful. In essence, this is because such formats either have to simplify the data to such an extent that they only support the lowest common subset of all the formats covered, or they attempt to be a superset of all formats and quickly become cumbersome. Neither choice works well in practice. This paper presents a different approach: a standardized description of (past, present, and future) data formats using the Hydrographic Universal Data Description Language (HUDDL), a descriptive language implemented using the Extensible Markup Language (XML). That is, XML is used to provide a structural and physical description of a data format, rather than the content of a particular file. Done correctly, this opens the possibility of automatically generating both multi-language data parsers and documentation for format specification based on their HUDDL descriptions, as well as providing easy version control of them. This solution also provides a powerful approach for archiving a structural description of data along with the data, so that binary data will be easy to access in the future. Intending to provide a relatively low-effort solution to index the wide range of existing formats, we suggest the creation of a catalogue of format descriptions, each of them capturing the logical and physical specifications for a given data format (with its subsequent upgrades). A C/C++ parser code generator is used as an example prototype of one of the possible advantages of the adoption of such a hydrographic data format catalogue

UNH Scholars' Repository

Recommended from our members

Satellite data ingestion tool

Author: Kalinati Nimmi
Publication venue: 'Oregon State University'
Publication date
Field of study

Satellite data ingestion tool is an automated database application for processing and archiving the ocean and land data that is broadcasted by a remote sensing satellite. This Ingestion system has a two-tier architecture, with data processing algorithm forming the first tier and the database server forming the second tier. The raw satellite data is an HDF (Hierarchical Data Format) file. An HDF reader has been developed that reads this satellite data file to extract the required data. A program has been written, that converts HDF files into images. A database schema has been developed in such a way that all important parameters of the satellite file can be inserted into it along with the locations of data and image files. This report consists of a detailed description of design and implementation of this ingestion tool along with the design of the database schema

ScholarsArchive@OSU

Recommended from our members

Data Standards for the Genomes to Life Program

Author: Ambrosiano John
Arkin Adam
Babnigg Gyorgy
Frank Ed
Geist Al
Giometti Carol
Jacobsen Janet
Samatova Nagiza
Slater Nancy
Taylor Ron
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 31/01/2004
Field of study

Existing GTL Projects already have produced volumes of dataand, over the course of the next five years, will produce an estimatedhundreds, or possibly thousands, of terabytes of data from hundreds ofexperiments conducted at dozens of laboratories in National Labs anduniversities across the nation. These data will be the basis forpublications by individual researchers, research groups, andmulti-institutional collaborations, and the basis for future DOEdecisions on funding further research in bioremediation. The short-termand long-term value of the data to project participants, to the DOE, andto the nation depends, however, on being able to access the data and onhow, or whether, the data are archived. The ability to access data is thestarting point for data analysis and interpretation, data integration,data mining, and development of data-driven models. Limited orinefficient data access means that less data are analyzed in acost-effective and timely manner. Data production in the GTL Program willlikely outstrip, or may have already outstripped, the ability to analyzethe data. Being able to access data depends on two key factors: datastandards and implementation of the data standards. For the purpose ofthis proposal, a data standard is defined as a standard, documented wayin which data and information about the data are describe. The attributesof the experiment in which the data were collected need to be known andthe measurements corresponding to the data collected need to bedescribed. In general terms, a data standard could be a form (electronicor paper) that is completed by a researcher or a document that prescribeshow a protocol or experiment should be described in writing.Datastandards are critical to data access because they provide a frameworkfor organizing and managing data. Researchers spend significant amountsof time managing data and information about experiments using labnotebooks, computer files, Excel spreadsheets, etc. In addition, dataoutput format varies for different equipment and usually need to beformatted differently for the variety of computer programs used todisplay and analyze the data. If, however, data for a given type ofexperiment were converted from vendor format to a format defined by adata standard, then researchers and software developers could save time.In addition, if data and information describing how they were obtainedwere available in a consistent format throughout the GTL Program,comparison and integration of results would be facilitated and a datarepository could be built to encourage project-wide data mining.Datastandards also are essential for archiving data sets. If data are storedtogether with the experiment metadata (i.e., information about the data)in an 'information/data package', then the data retain their value due tothe accessibility of information about measurement and analysisprocedures.DOE's commitment to developing data standards for the GTLProgram is needed to ensure that the most value is obtained from DOE'sexpenditures on experimental work and to provide a data repository thatcan be used as the basis for on-going model development. By developingdata standards for experiments conducted as part of the GTL Program, DOEhas the opportunity to facilitate data sharing not only within the DOEcommunity, but also with research institutes through theworld

UNT Digital Library

Data Standards for the Genomes to Life Program

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5 and XML Multimodal and Hyperspectral Data Sets

Author: James Robertson
Kelly Bennett
Publication venue: 'IntechOpen'
Publication date: 13/10/2011
Field of study

IntechOpen

Data access layer optimization of the Gaia data processing in Barcelona for spatially arranged data

Author: Ronchini Francesca
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Gaia is an ambitious astrometric space mission adopted within the scientific programme of the European Space Agency (ESA) in October 2000. It measures with very high accuracy the positions and velocities of a large number of stars and astronomical objects. At the end of the mission, a detailed three-dimensional map of more than one billion stars will be obtained. The spacecraft is currently orbiting around the L2 Lagrangian Point, 1.5 million kilometers from the Earth. It is providing a complete survey down to the 20th magnitude. The two telescopes of Gaia will observe each object 85 times on average during the 5 years of the mission, recording each time its brightness, color and, most important, its position. This leads to an enormous quantity of complex, extremely precise data, representing the multiple observations of a billion different objects by an instrument that is spinning and precessing. The Gaia data challenge, processing raw satellite telemetry to produce valuable science products, is a huge task in terms of expertise, effort and computing power. To handle the reduction of the data, an iterative process between several systems has been designed, each solving different aspects of the mission. The Data Analysis and Processing Consortium (DPAC), a large team of scientists and software developers, is in charge of processing the Gaia data with the aim of producing the Gaia Catalogue. It is organized in Coordination Units (CUs), responsible of science and software development and validation, and Data Processing Centers (DPCs), which actually operate and execute the software systems developed by the CUs. This project has been developed within the frame of the Core Processing Unit (CU3) and the Data Processing Center of Barcelona (DPCB). One of the most important DPAC systems is the Intermediate Data Updating (IDU), executed at the Marenostrum supercomputer hosted by the Barcelona Supercomputing Center (BSC), which is the core of the DPCB hardware framework. It must reprocess, once every few months, all raw data accumulated up to that moment, giving a higher coherence to the scientific results and correcting any possible errors or wrong approximations from previous iterations. It has two main objectives: to refine the image parameters from the astrometric images acquired by the instrument, and to refine the Cross Match (XM) for all the detections. In particular, the XM will handle an enormous number of detections at the end of the mission, so it will obviously not be possible to handle them in a single process. Moreover, one should also consider some limitations and constraints imposed by the features of the execution environment (the Marenostrum supercomputer). Therefore, it is necessary to optimize the Data Access Layer (DAL) in order to efficiently store the huge amount of data coming from the spacecraft, and to access it in a smart manner. This is the main scope of this project. We have developed and implemented an efficient and flexible file format based on Hierarchical Data Format version 5 (HDF5), arranging the detections by a spatial index such as Hierarchical Equal Area isoLatitude Pixelization (HEALPix) to tessellate the sphere. In this way it is possible to distribute and process the detections separately and in parallel, according to their distribution on the sky. Moreover, the HEALPix library and the framework implemented here allows to consider the data at different resolution levels according to the desired precision. In this project we consider up to level 12, that is, 201 million pixels in the sphere. Two different alternatives have been designed and developed, namely, a Flat solution and a Hierarchical solution. It refers to the distribution of the data through the file. In the first case, all the dataset is contained inside a single group. On the other hand, the hierarchical solution stores the groups of data in a hierarchical way according to the HEALPix hierarchy. The Gaia DPAC software is implemented in Java, where the HDF5 Application Programming Interface (API) support is quite limited. Thus, it has also been necessary to use the Java Native Interface (JNI) to adapt the software developed in this project (in C language), which follows the HDF5 C API. On the Java side, two main classes have been implemented to read and write the data: FileHdf5Archiver and FileArchiveHdf5FileReader. The Java part of this project has been integrated into an existing operational software library, DpcbTools, in coordination with the Barcelona IDU/DPCB team. This has allowed to integrate the work done in this project into the existing DAL architecture in the most efficient way. Prior to the testing of the operational code, we have first evaluated the time required by the creation of the whole empty structure of the file. It has been done with a simple program written in C which, depending on the HEALPix level requested, creates the skeleton of the file. It has been implemented for both alternatives previously mentioned. Up to HEALPix level 6 it is not possible to notice a relevant difference. For level 7onwards the difference becomes more and more important, especially starting with level 9 where the creation time is uncontrollable for the Flat solution. Anyhow, the creation of the whole file is not convenient in the real case. Therefore, in order to evaluate the most suitable alternative, we have simply considered the Input/Output performance. Finally, we have run the performance tests in order to evaluate how the two solutions perform when actually dealing with data contents. Also the TAR and ZIP solutions have been tested in order to compare and appraise the speedup and the efficiency of our new two alternatives. The analysis of the results has been based on the time to write and read data, the compression ratio and the read/write rate. Moreover, the different alternatives have been evaluated on two systems with different sets of data as input. The speedup and the compression ratio improvement compared to the previously adopted solutions is considerable for both HDF5-based alternatives, whereas the difference between the two alternatives. The integration of one of these two solutions will allow the Gaia IDU software to handle the data in a more efficient manner, increasing the final I/O performance remarkably

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Gamma-ray identification of nuclear weapon materials

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref