slides

Data submission and curation for caArray, a standard based microarray data repository system

Abstract

caArray is an open-source, open development, web and programmatically accessible array data management system developed at National Cancer Institute. It was developed to support the exchange of array data across the Cancer Biomedical Informatics Grid (caBIG™), a collaborative information network that connect scientists and practitioners through a shareable and interoperable infrastructure to share data and knowledge. caArray adopts a federated model of local installations, in which data deposited are shareable across caBIG™. 

Comprehensive in annotation yet easy to use has always been a challenge to any data repository system. To alleviate this difficulty, caArray accepts data upload using the MAGE-TAB, a spreadsheet-based format for annotating and communicating microarray data in a MIAME-compliant fashion ("http://www.mged.org/mage-tab":http://www.mged.org/mage-tab). MAGE-TAB is built on community standards – MAGE, MIAME, and Ontology. The components and work flow of MAGE-TAB files are organized in such a way which is already familiar to bench scientists and thus minimize the time and frustration of reorganizing their data before submission. The MAGE-TAB files are also structured to be machine readable so that they can be easily parsed into database. Users can control public access to experiment- and sample-level data and can create collaboration groups to support data exchange among a defined set of partners. 

All data submitted to caArray at NCI will go through strict curation by a group of scientists against these standards to make sure that the data are correctly annotated using proper controlled vocabulary terms and all required information are provided. Two of mostly used ontology sources are MGED ontology ("http://mged.sourceforge.net/ontologies/MGEDontology.php":http://mged.sourceforge.net/ontologies/MGEDontology.php) and NCI thesaurus ("http://nciterms.nci.nih.gov/NCIBrowser/Dictionary.do":http://nciterms.nci.nih.gov/NCIBrowser/Dictionary.do). The purpose of data curation is to ensure easy comparison of results from different labs and unambiguous report of results. 

Data will also undergo automatic validation process before parsed into database, in which minimum information requirement and data consistency with the array designs are checked. Files with error found during validation are flagged with error message. Curators will re-examine those files and make necessary corrections before re-load the files. The iteration repeats until files are validated successfully. Data are then imported into the system and ready for access through the portal or through API. Interested parties are encouraged to review the installation package, documentation, and source code available from "http://caarray.nci.nih.gov":http://caarray.nci.nih.gov

    Similar works