Article thumbnail

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

By Wei Hu, Amrapali Zaveri, Honglei Qiu and Michel Dumontier

Abstract

Abstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types

Topics: GEO, Metadata, Data quality, Clustering, Biomedical, Experimental data, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Publisher: BMC
Year: 2017
DOI identifier: 10.1186/s12859-017-1832-4
OAI identifier: oai:doaj.org/article:95b281a1929041f8a5d6ce36331bf14f
Journal:
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • https://doaj.org/article/95b28... (external link)
  • https://doaj.org/toc/1471-2105 (external link)
  • http://link.springer.com/artic... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.