Transfer rule learning for biomarker discovery and verification from related data sets

Ganchev, Philip

thesis

Transfer rule learning for biomarker discovery and verification from related data sets

Authors: Philip Ganchev
Publication date: 30 January 2011
Publisher

Abstract

Biomarkers are a critical tool for the detection, diagnosis,monitoring and prognosis of diseases, and for understandingdisease mechanisms in order to create treatments. Unfortunately,finding reliable biomarkers is often hampered by a number of practicalproblems, including scarcity of samples, the high dimensionality of the data, and measurement error. An important opportunity to make the most ofthese scarce data is to combine information from multiple relateddata sets for more effective biomarker discovery. Because the costsof creating large data sets for every disease of interest are likelyto remain prohibitive, methods for more effectively making use ofrelated biomarker data sets continues to be important.This thesis develops TRL, a novel framework for integrative biomarkerdiscovery from related but separate data sets, such as those generatedfor similar biomarker profiling studies. TRL alleviates the problemof data scarcity by providing a way to validateknowledge learned from one data set and simultaneously learn newknowledge on a related data set. Unlike other transfer learningapproaches, TRL takes prior knowledge in the form of interpretable,modular classification rules, and uses them to seed learning on a newdata set.We evaluated TRL on 13 pairs of real-world biomarker discovery datasets, and found TRL improves accuracy twice as often asdegrading it. TRL consists of four alternative methods for transferand three measures of the amount of information transferred. Byexperimenting with these methods, we investigate the kinds ofinformation necessary to preserve for transfer learning from relateddata sets. We found it is important to keep track of therelationships between biomarker values and disease state, and toconsider during learning how rules will interact in the final model.If the source and target data are drawn from the same distribution, wefound the performance improvement and amount of transfer increase withincreasing size of the source compared to the target data