thesis

Biomedical data retrieval utilizing textual data in a gene expression database by Richard Lu, MD.

Abstract

Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 68-74).Background: The commoditization of high-throughput gene expression sequencing and microarrays has led to a proliferation in both the amount of genomic and clinical data that is available. Descriptive textual information deposited with gene expression data in the Gene Expression Omnibus (GEO) is an underutilized resource because the textual information is unstructured and difficult to query. Rendering this information in a structured format utilizing standard medical terms would facilitate better searching and data reuse. Such a procedure would significantly increase the clinical utility of biomedical data repositories. Methods: The thesis is divided into two sections. The first section compares how well four medical terminologies were able to represent textual information deposited in GEO. The second section implements free-text search and faceted search and evaluates how well they are able to answer clinical queries with varying levels of complexity. Part I: 120 samples were randomly extracted from samples deposited in the GEO database from six clinical domains-breast cancer, colon cancer, rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), type I diabetes mellitus (IDDM), and asthma. These samples were previously annotated manually and structured textual information was obtained in a tag:value format. Data was mapped to four different controlled terminologies: NCI Thesaurus, MeSH, SNOMED-CT, and ICD- 10. The samples were assigned a score on a three-point scale that was based on how well the terminology was able to represent descriptive textual information. Part II: Faceted and free-text search tools were implemented, with 300 GEO samples included for querying. Eight natural language search questions were selected randomly from scientific journals. Academic researchers were recruited and asked to use the faceted and free-text search tools to locate samples matching the question criteria. Precision, recall, F-score, and search time were compared and analyzed for both free-text and faceted search. Results: The results show that the NCI Thesaurus consistently ranked as the most comprehensive terminology across all domains while ICD-10 consistently ranked as the least comprehensive. Using NCI Thesaurus to augment the faceted search tool, each researcher was able to reach 100% precision and recall (F-score 1.0) for each of the eight search questions. Using free-text search, test users averaged 22.8% precision, 60.7% recall, and an F-score of 0.282. The mean search time per question using faceted search and free-text search were 116.7 seconds, and 138.4 seconds, respectively. The difference between search time was not statistically significant (p=0. 734). However, paired t-test analysis showed a statistically signficant difference between the two search strategies with respect to precision (p=O.001), recall (p=O.042), and F-score (p<0. 001). Conclusion: This work demonstrates that biomedical terms included in a gene expression database can be adequately expressed using the NCI Thesaurus. It also shows that faceted searching using a controlled terminology is superior to conventional free-text searching when answering queries of varying levels of complexity.S.M

    Similar works