Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser

Abstract

The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Understanding protein functions and how they modify and regulate each other is the next great challenge for life science researchers. The collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Bringing the relevant information together creates a bottleneck in the research and discovery process. The volume of such information grows exponentially which, in turn, renders manual curation impractical. As a viable alternative, automated literature processing tools could be employed to extract and organize biological data into a knowledge base, making it amenable to computational analysis and data mining. We present MedScan, a completely automated NLP-based information extraction system. We have used MedScan to extract about 280,000 mammalian proteins functional links from the entire 2003 release of MEDLINE in only 21 hours. The precision of the extracted information was found to be 91%. We have compared the extracted data with protein co-occurrence data and with the nine well-studied cellular signaling pathways and estimated the recovery rate of MedScan for the entirety of MEDLINE to be between 30 % and 50%. Further improvement of the MedScan technology is discussed. 1

    Similar works

    Full text

    thumbnail-image

    Available Versions