55 research outputs found
HUDDL for description and archive of hydrographic binary data
Many of the attempts to introduce a universal hydrographic binary data format have failed or have been only partially successful. In essence, this is because such formats either have to simplify the data to such an extent that they only support the lowest common subset of all the formats covered, or they attempt to be a superset of all formats and quickly become cumbersome. Neither choice works well in practice. This paper presents a different approach: a standardized description of (past, present, and future) data formats using the Hydrographic Universal Data Description Language (HUDDL), a descriptive language implemented using the Extensible Markup Language (XML). That is, XML is used to provide a structural and physical description of a data format, rather than the content of a particular file. Done correctly, this opens the possibility of automatically generating both multi-language data parsers and documentation for format specification based on their HUDDL descriptions, as well as providing easy version control of them. This solution also provides a powerful approach for archiving a structural description of data along with the data, so that binary data will be easy to access in the future. Intending to provide a relatively low-effort solution to index the wide range of existing formats, we suggest the creation of a catalogue of format descriptions, each of them capturing the logical and physical specifications for a given data format (with its subsequent upgrades). A C/C++ parser code generator is used as an example prototype of one of the possible advantages of the adoption of such a hydrographic data format catalogue
Recommended from our members
Satellite data ingestion tool
Satellite data ingestion tool is an automated database application for processing and archiving the ocean and land data that is broadcasted by a remote sensing satellite. This Ingestion system has a two-tier architecture, with data processing algorithm forming the first tier and the database server forming the second tier. The raw satellite data is an HDF (Hierarchical Data Format) file. An HDF reader has been developed that reads this satellite data file to extract the required data. A program has been written, that converts HDF files into images. A database schema has been developed in such a way that all important parameters of the satellite file can be inserted into it along with the locations of data and image files.
This report consists of a detailed description of design and implementation of this ingestion tool along with the design of the database schema
Recommended from our members
Data Standards for the Genomes to Life Program
Existing GTL Projects already have produced volumes of dataand, over the course of the next five years, will produce an estimatedhundreds, or possibly thousands, of terabytes of data from hundreds ofexperiments conducted at dozens of laboratories in National Labs anduniversities across the nation. These data will be the basis forpublications by individual researchers, research groups, andmulti-institutional collaborations, and the basis for future DOEdecisions on funding further research in bioremediation. The short-termand long-term value of the data to project participants, to the DOE, andto the nation depends, however, on being able to access the data and onhow, or whether, the data are archived. The ability to access data is thestarting point for data analysis and interpretation, data integration,data mining, and development of data-driven models. Limited orinefficient data access means that less data are analyzed in acost-effective and timely manner. Data production in the GTL Program willlikely outstrip, or may have already outstripped, the ability to analyzethe data. Being able to access data depends on two key factors: datastandards and implementation of the data standards. For the purpose ofthis proposal, a data standard is defined as a standard, documented wayin which data and information about the data are describe. The attributesof the experiment in which the data were collected need to be known andthe measurements corresponding to the data collected need to bedescribed. In general terms, a data standard could be a form (electronicor paper) that is completed by a researcher or a document that prescribeshow a protocol or experiment should be described in writing.Datastandards are critical to data access because they provide a frameworkfor organizing and managing data. Researchers spend significant amountsof time managing data and information about experiments using labnotebooks, computer files, Excel spreadsheets, etc. In addition, dataoutput format varies for different equipment and usually need to beformatted differently for the variety of computer programs used todisplay and analyze the data. If, however, data for a given type ofexperiment were converted from vendor format to a format defined by adata standard, then researchers and software developers could save time.In addition, if data and information describing how they were obtainedwere available in a consistent format throughout the GTL Program,comparison and integration of results would be facilitated and a datarepository could be built to encourage project-wide data mining.Datastandards also are essential for archiving data sets. If data are storedtogether with the experiment metadata (i.e., information about the data)in an 'information/data package', then the data retain their value due tothe accessibility of information about measurement and analysisprocedures.DOE's commitment to developing data standards for the GTLProgram is needed to ensure that the most value is obtained from DOE'sexpenditures on experimental work and to provide a data repository thatcan be used as the basis for on-going model development. By developingdata standards for experiments conducted as part of the GTL Program, DOEhas the opportunity to facilitate data sharing not only within the DOEcommunity, but also with research institutes through theworld
Data access layer optimization of the Gaia data processing in Barcelona for spatially arranged data
Gaia is an ambitious astrometric space mission adopted within the scientific programme
of the European Space Agency (ESA) in October 2000. It measures with very high
accuracy the positions and velocities of a large number of stars and astronomical objects.
At the end of the mission, a detailed three-dimensional map of more than one billion
stars will be obtained. The spacecraft is currently orbiting around the L2 Lagrangian
Point, 1.5 million kilometers from the Earth. It is providing a complete survey down to
the 20th magnitude. The two telescopes of Gaia will observe each object 85 times on
average during the 5 years of the mission, recording each time its brightness, color and,
most important, its position. This leads to an enormous quantity of complex, extremely
precise data, representing the multiple observations of a billion different objects by an
instrument that is spinning and precessing. The Gaia data challenge, processing raw
satellite telemetry to produce valuable science products, is a huge task in terms of
expertise, effort and computing power. To handle the reduction of the data, an iterative
process between several systems has been designed, each solving different aspects of the
mission.
The Data Analysis and Processing Consortium (DPAC), a large team of scientists and
software developers, is in charge of processing the Gaia data with the aim of producing
the Gaia Catalogue. It is organized in Coordination Units (CUs), responsible of science
and software development and validation, and Data Processing Centers (DPCs), which
actually operate and execute the software systems developed by the CUs. This project
has been developed within the frame of the Core Processing Unit (CU3) and the Data
Processing Center of Barcelona (DPCB).
One of the most important DPAC systems is the Intermediate Data Updating (IDU),
executed at the Marenostrum supercomputer hosted by the Barcelona Supercomputing
Center (BSC), which is the core of the DPCB hardware framework. It must reprocess,
once every few months, all raw data accumulated up to that moment, giving a higher coherence to the scientific results and correcting any possible errors or wrong approximations
from previous iterations. It has two main objectives: to refine the image
parameters from the astrometric images acquired by the instrument, and to refine the
Cross Match (XM) for all the detections. In particular, the XM will handle an enormous
number of detections at the end of the mission, so it will obviously not be possible to
handle them in a single process. Moreover, one should also consider some limitations
and constraints imposed by the features of the execution environment (the Marenostrum
supercomputer). Therefore, it is necessary to optimize the Data Access Layer (DAL) in
order to efficiently store the huge amount of data coming from the spacecraft, and to
access it in a smart manner. This is the main scope of this project. We have developed
and implemented an efficient and flexible file format based on Hierarchical Data Format
version 5 (HDF5), arranging the detections by a spatial index such as Hierarchical Equal
Area isoLatitude Pixelization (HEALPix) to tessellate the sphere. In this way it is possible
to distribute and process the detections separately and in parallel, according to
their distribution on the sky. Moreover, the HEALPix library and the framework implemented
here allows to consider the data at different resolution levels according to the
desired precision. In this project we consider up to level 12, that is, 201 million pixels
in the sphere.
Two different alternatives have been designed and developed, namely, a Flat solution
and a Hierarchical solution. It refers to the distribution of the data through the file.
In the first case, all the dataset is contained inside a single group. On the other hand,
the hierarchical solution stores the groups of data in a hierarchical way according to the
HEALPix hierarchy.
The Gaia DPAC software is implemented in Java, where the HDF5 Application Programming
Interface (API) support is quite limited. Thus, it has also been necessary
to use the Java Native Interface (JNI) to adapt the software developed in this project
(in C language), which follows the HDF5 C API. On the Java side, two main classes
have been implemented to read and write the data: FileHdf5Archiver and FileArchiveHdf5FileReader.
The Java part of this project has been integrated into an existing
operational software library, DpcbTools, in coordination with the Barcelona IDU/DPCB
team. This has allowed to integrate the work done in this project into the existing DAL
architecture in the most efficient way.
Prior to the testing of the operational code, we have first evaluated the time required
by the creation of the whole empty structure of the file. It has been done with a simple
program written in C which, depending on the HEALPix level requested, creates the
skeleton of the file. It has been implemented for both alternatives previously mentioned.
Up to HEALPix level 6 it is not possible to notice a relevant difference. For level 7onwards the difference becomes more and more important, especially starting with level
9 where the creation time is uncontrollable for the Flat solution. Anyhow, the creation
of the whole file is not convenient in the real case. Therefore, in order to evaluate the
most suitable alternative, we have simply considered the Input/Output performance.
Finally, we have run the performance tests in order to evaluate how the two solutions
perform when actually dealing with data contents. Also the TAR and ZIP solutions have
been tested in order to compare and appraise the speedup and the efficiency of our new
two alternatives. The analysis of the results has been based on the time to write and read
data, the compression ratio and the read/write rate. Moreover, the different alternatives
have been evaluated on two systems with different sets of data as input. The speedup
and the compression ratio improvement compared to the previously adopted solutions
is considerable for both HDF5-based alternatives, whereas the difference between the
two alternatives. The integration of one of these two solutions will allow the Gaia
IDU software to handle the data in a more efficient manner, increasing the final I/O
performance remarkably
- …