One key challenge in Systems Biology is to provide mechanisms to collect and integrate the necessary data to be
able to meet multiple analysis requirements. Typically, biological contents are scattered over multiple data sources
and there is no easy way of comparing heterogeneous data contents. This work discusses ongoing standardisation
and interoperability efforts and exposes integration challenges for the model organism Escherichia coli K-12. The
goal is to analyse the major obstacles faced by integration processes, suggest ways to systematically identify them,
and whenever possible, propose solutions or means to assistmanual curation. Integration of gene, protein and compound
data was evaluated by performing comparisons over EcoCyc, KEGG, BRENDA, ChEBI, Entrez Gene and
UniProt contents. Cross-links, a number of standard nomenclatures and name information supported the comparisons.
Except for the gene integration scenario, in no other scenario an element of integration performed well
enough to support the process by itself. Indeed, both the integration of enzyme and compound records imply considerable
curation. Results evidenced that, even for a well-studied model organism, source contents are still far
from being as standardized as it would be desired and metadata varies considerably from source to source. Before
designing any data integration pipeline, researchers should decide on the sources that best fit the purpose of analysis
and be aware of existing conflicts/inconsistencies to be able to intervene in their resolution. Moreover, they should
be aware of the limits of automatic integration such that they can define the extent of necessary manual curation
for each application.Portuguese FCT funded
MIT-Portugal Program in Bioengineering
(MIT-Pt/BS-BB/0082/2008); PhD grant from
FCT (ref. SFRH/BD/22863/2005) to S.