Location of Repository

Database federation, resource interoperability and digital identity, for management and exploitation of contemporary biological data

By Gudmundur A. Thorisson


Modern research into the genetic basis of human health and disease is increasingly dominated by high-throughput experimentation and routine generation of large volumes of complex genotype to phenotype (G2P) information. Efforts to effectively manage, integrate, analyse and interpret this wealth of data face substantial challenges. This thesis discusses informatics approaches to addressing some of these challenges, primarily in the context of disease genetics.\ud The genome-wide association study (GWAS) is widely used in the field, but translation of findings into scientific knowledge is hampered by heterogeneous and incomplete reporting, restrictions on sharing of primary data, publication bias and other factors. The central focus of the work was design and implementation of a core informatics infrastructure for centralised gathering and presentation of GWAS results. The resulting open-access HGVbaseG2P genetic association database and web-based tools for search, retrieval and graphical genome viewing increase overall usefulness of published GWAS findings.\ud HGVbaseG2P conceptual modelling activities were also merged into a collaborative standardisation effort with international partners. A key outcome of this joint work is a minimal model for phenotype data which, together with ontologies and other standards, lays the foundation for a federated network of semantically and syntactically interoperable, distributed G2P databases.\ud Attempts to gather complete aggregate representations of primary GWAS data into\ud HGVbaseG2P were largely unsuccessful, chiefly due to concerns over re-identification of study participants. This led to a separate line of inquiry which explored - via in-depth field analysis, workshop organisation and other community outreach activities – potential applications of federated identity technologies for unambiguously identifying researchers online. Results suggest two broad use cases for user-centric researcher identities - i) practical, streamlined data access management and ii) tracking digital contributions for the purpose of attribution - which are critical to facilitating and incentivising sharing of GWAS (and other) research data

Publisher: University of Leicester
Year: 2011
OAI identifier: oai:lra.le.ac.uk:2381/8951

Suggested articles



  1. (2008). A navigator for human genome epidemiology. doi
  2. (2006). A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB.
  3. (2003). A systematic approach to modeling, capturing, and disseminating proteomics experimental data. doi
  4. (2007). A systematic strategy for largescale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis. doi
  5. (2009). An Open Access Database of Genome-wide Association Results. doi
  6. (2010). Archived by WebCite R© at http://www.webcitation.org/5lKQ0QD3W. Accessed
  7. (2007). Available online at http://www.l3s.de/˜olmedilla/pub/2009/
  8. (1965). Available online at http://www.nbic.nl/uploads/media/Nano-Publication_ BarendMons-JanVelterop.pdf [Accessed 2010-06-10]. Archived by WebCite R© at http://www.webcitation.org/5qSZBHvRi. 258
  9. (2009). B.1. PaGE-OM logical model diagrams Fig.
  10. (2008). Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. doi
  11. (2009). BioMart – biological queries made easy. doi
  12. (2009). CloudBurst: Highly Sensitive Read Mapping with MapReduce. doi
  13. (2004). Complex SNP-related sequence variation in segmental genome duplications. doi
  14. (2008). Data curation + process curation=data integration + science. doi
  15. (2009). Data publication: towards a database of everything. doi
  16. (2008). Dynamic Security Assertion Markup Language: Simplifying Single Sign-On. doi
  17. (2008). Escape from the impact factor. doi
  18. (2005). Genome-wide association studies for common diseases and complex traits. doi
  19. (2009). Genomewide association studies–illuminating biologic pathways. The New England doi
  20. (2008). HuGE Watch: tracking trends and patterns of published studies of genetic association and human genome epidemiology in near-real time. doi
  21. (2003). Information science. Going, going, gone: lost Internet references. doi
  22. (2004). Large-scale copy number polymorphism in the human genome. doi
  23. (1998). Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome. doi
  24. Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases. Bioinformatics, doi
  25. (2009). Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Research, doi
  26. (2009). Omics data sharing.
  27. (2009). PaGE-OM logical model diagrams Fig. B.3: The PaGE-OM GENOTYPE domain, simplified logical diagram and data example. From Brookes et al.
  28. (2008). Personal genomes: Misdirected precaution. doi
  29. (2009). PNAS takes action regarding breach of NIH embargo policy on a PNAS paper. doi
  30. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. doi
  31. (2010). Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies. doi
  32. (2007). Research, 38(Database issue), D142–D148. doi:10.1093/nar/gkp846 64 The Wellcome Trust Case Control Consortium
  33. (2008). Semantic mashup of biomedical data. doi
  34. (2009). Sharing data between LSDBs and central repositories. doi
  35. (2009). Stewardship of Human Biospecimens, DNA, Genotype, and Clinical Data in the GWAS Era. Annual Review of Genomics doi
  36. (2008). Structural variation of chromosomes in autism spectrum disorder.
  37. (2010). The Cafe website updates the ORCID registry to include a link from the data owner’s
  38. (1995). The death of biomedical journals. doi
  39. (2007). The ENCODEdb portal: simplified access to ENCODE Consortium data. doi
  40. (2009). The GOA database in 2009 — an integrated Gene Ontology Annotation resource. doi
  41. (2001). The International Human Genome Sequencing Consortium doi
  42. (2009). The phenotype and genotype experiment object model (PaGE-OM): a robust data structure for information related to DNA variation. doi
  43. (2008). The Rat Genome Database 2009: variation, ontologies and pathways. Nucleic Acids Research, pages D744–D749. doi:10.1093/nar/gkn842 32 Editors
  44. (2004). Using Digital Library Techniques – Registration of Scientific Primary Data. Research and Advanced Technology for Digital Libraries, doi
  45. (2010). XGAP: a uniform and extensible data model and software platform for genotype and phenotype experiments. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.