2 research outputs found
MilkMine: text-mining, milk proteins and hypothesis generation
The vast and increasing volume of biological data can make it a struggle for scientists to keep
up-to-date with the latest research and as a consequence they may miss significant biological
links, particularly those that extend outwith their own area of expertise. MilkMine is an attempt
to provide a single informatics resource to help milk protein scientists mine this information
mountain more effectively, by integrating standard experimental data types with data generated
by emerging text-mining techniques.
A method was initially developed to identify milk-related terminology from peer-reviewed
biological literature and this was used to complement the Unified Medical Language System
(UMLS), a large thesaurus of biological concepts, their variant names and their types. The
resultant enriched ontology was then mapped to the free text of peer-reviewed biological
literature using the MMTx program producing a database of semantically enriched sentences.
A co-occurrence relation extraction algorithm was written to identify relationships between milk
proteins and peptides, and other biological concepts, such as diseases or biological processes.
Using these literature relation sets new hypotheses can be generated using the basic principle
that if “A is linked to B”, and if “B is linked to C” then we can infer an association between A
and C. Filtering and downstream processing of the many generated relationships promotes
significant interactions. These literature relations and hypotheses are integrated with biological
data into the MilkMine database.
The MilkMine database is built upon on a generic data warehousing system, InterMine. This tool
enabled the integration of traditional data types, such as protein sequence or structural data, from
a variety of sources (e.g. UniProt). However, the standard InterMine model was also extended by
the author to include other data sources (e.g. the Protein Data Bank) and to incorporate the
output of the text-mining algorithm. This integration of otherwise disparate information allows
more complex querying of the data, across many data types. For example, protein sequences are
mapped to instances of the names, synonyms or symbols of the protein in text, therefore a raw
fragment of amino acid sequence (e.g. a particular binding region) can be used to search the
MilkMine database for literature information as well as the interactions and hypotheses of those
proteins that contain the sequence. The MilkMine resource is accessible online
(www.bioinformatics.ed.ac.uk/milkmine) through a professional level query interface offering
many features such as an interactive query builder, standard ready-to-run queries, bulk
downloads and the ability to store user preferences and query histories. Evaluation of MilkMine
showed that the text-mining algorithm, as well as the data integration, could provide the user
with interesting connections for further study