thesis

Methods and tools for mining the transcriptomic landscape of human tissue and disease

Abstract

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from PDF version of thesis.Includes bibliographical references (p. 343-356).Although there are a variety of high-throughput technologies used to perform biological experiments, DNA microarrays have become a standard tool in the modern biologist's arsenal. Microarray experiments provide measurements of thousands of genes simultaneously, and offer a snapshot view of transcriptomic activity. With the rapid growth of public availability of transcriptomic data, there is increasing recognition that large sets of such data can be mined to better understand disease states and mechanisms. Unfortunately, several challenges arise when attempting to perform such large-scale analyses. For instance, public repositories to which the data is being submitted to were designed around the simple task of storage rather than that of data mining. As such, the seemingly simple task of obtaining all data relating to a particular disease becomes an arduous task. Furthermore, prior gene expression analyses, both large and small, have been dichotomous in nature, in which phenotypes are compared using clearly defined controls. Such approaches may require arbitrary decisions about what are considered "normal" phenotypes, and what each phenotype should be compared to. Addressing these issues, we introduce methods for creating a large curated gene expression database geared towards data mining, and explore methods for efficiently expanding this database using active learning. Leveraging our curated expression database, we adopt a holistic approach in which we characterize phenotypes in the context of a myriad of tissues and diseases. We introduce scalable methods that associate expression patterns to phenotypes in order to assign phenotype labels to new expression samples and to select phenotypically meaningful gene signatures. By using a nonparametric statistical approach, we identify signatures that are more precise than those from existing approaches and accurately reveal biological processes that are hidden in case vs. control studies. We conclude the work by exploring the applicability of the heterogeneous expression database in analyzing clinical drugs for the purpose of drug repurposing.by Patrick Raphael Schmid.Ph.D

    Similar works