Data Standards for the Genomes to Life Program

Abstract

Existing GTL Projects already have produced volumes of dataand, over the course of the next five years, will produce an estimatedhundreds, or possibly thousands, of terabytes of data from hundreds ofexperiments conducted at dozens of laboratories in National Labs anduniversities across the nation. These data will be the basis forpublications by individual researchers, research groups, andmulti-institutional collaborations, and the basis for future DOEdecisions on funding further research in bioremediation. The short-termand long-term value of the data to project participants, to the DOE, andto the nation depends, however, on being able to access the data and onhow, or whether, the data are archived. The ability to access data is thestarting point for data analysis and interpretation, data integration,data mining, and development of data-driven models. Limited orinefficient data access means that less data are analyzed in acost-effective and timely manner. Data production in the GTL Program willlikely outstrip, or may have already outstripped, the ability to analyzethe data. Being able to access data depends on two key factors: datastandards and implementation of the data standards. For the purpose ofthis proposal, a data standard is defined as a standard, documented wayin which data and information about the data are describe. The attributesof the experiment in which the data were collected need to be known andthe measurements corresponding to the data collected need to bedescribed. In general terms, a data standard could be a form (electronicor paper) that is completed by a researcher or a document that prescribeshow a protocol or experiment should be described in writing.Datastandards are critical to data access because they provide a frameworkfor organizing and managing data. Researchers spend significant amountsof time managing data and information about experiments using labnotebooks, computer files, Excel spreadsheets, etc. In addition, dataoutput format varies for different equipment and usually need to beformatted differently for the variety of computer programs used todisplay and analyze the data. If, however, data for a given type ofexperiment were converted from vendor format to a format defined by adata standard, then researchers and software developers could save time.In addition, if data and information describing how they were obtainedwere available in a consistent format throughout the GTL Program,comparison and integration of results would be facilitated and a datarepository could be built to encourage project-wide data mining.Datastandards also are essential for archiving data sets. If data are storedtogether with the experiment metadata (i.e., information about the data)in an 'information/data package', then the data retain their value due tothe accessibility of information about measurement and analysisprocedures.DOE's commitment to developing data standards for the GTLProgram is needed to ensure that the most value is obtained from DOE'sexpenditures on experimental work and to provide a data repository thatcan be used as the basis for on-going model development. By developingdata standards for experiments conducted as part of the GTL Program, DOEhas the opportunity to facilitate data sharing not only within the DOEcommunity, but also with research institutes through theworld

    Similar works