Search CORE

14 research outputs found

SparkText: Biomedical Text Mining on Big Data Framework

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date: 01/01/2016
Field of study

<div>BackgroundMany new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.ResultsIn this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.ConclusionsThis study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.</div

Directory of Open Access Journals

PubMed Central

The Francis Crick Institute

Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.</p

The Francis Crick Institute

An example of unigrams and bigrams extracted from the sentence “The purpose of this study was to examine the incidence of breast cancer with triple negative phenotype.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

” The sentence was chosen from an abstract downloaded from PubMed.</p

The Francis Crick Institute

The quantitative results for accuracy, precision, and recall of SparkText using three datasets.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

For each dataset, 80% was used to train a prediction model and the remaining 20% for testing.</p

The Francis Crick Institute

The datasets: all abstracts and full-text articles were downloaded from PubMed.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

The datasets included abstracts and full-text articles related to three types of cancer, including breast, lung, and prostate cancer. For each dataset, we employed 80% of the entire dataset to train a prediction model while the remaining 20% was used for testing.</p

The Francis Crick Institute

The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.</p

The Francis Crick Institute

An example of a bag-of-words representation.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

The terms “biology”, “biopsy”, “biolab”, “biotin”, and “almost” are unigrams, but “cancer-surviv”, and “cancer-stage” are bigrams. Using TF/IDF weighting scores, the feature value of the term “almost” equals to zero.</p

The Francis Crick Institute

The number of publications in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) over the last six years obtained by submitting a query for “cancer” in the all fields.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

The number of publications in PubMed (<a href="http://www.ncbi.nlm.nih.gov/pubmed" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed</a>) over the last six years obtained by submitting a query for “cancer” in the all fields.</p

The Francis Crick Institute

The basic framework of SparkText: We first loaded structured and unstructured abstracts and/or full-text articles into a Cassandra Database, which was then stored in multiple compute nodes.

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

After that, we started text preprocessing and feature extraction before building prediction models based on Apache Spark. The Apache Spark Core contains the main functionalities and APIs for distributed Big Data solutions. As a part of Apache Spark components, the MLlib is a scalable machine learning library that includes common machine learning methods and utilities, such as classification, clustering, regression, collaborative filtering, dimensionality reduction, and underlying optimization primitives. The Standalone Scheduler allows a standalone mode cluster, which runs applications in first-in-first-out (FIFO) fashion, and each application is deployed at multiple compute nodes. The Spark Streaming Real-Time handles real-time streaming of Big Data files based on a micro batch style of processing.</p

The Francis Crick Institute

SparkText: Biomedical Text Mining on Big Data Framework - Fig 5

Author: Ahmad P. Tafti (3184725)
Kai Wang (21246)
Karen Y. He (3184719)
Max M. He (3184722)
Zhan Ye (655055)
Publication venue
Publication date
Field of study

Quantitative comparisons of the prediction models on text mining: (A) the accuracy, precision, and recall obtained from 19,681 abstracts; (B) the accuracy, precision, and recall on 12,902 full-text articles; and (C) the accuracy, precision, and recall on 29,437 full-text articles. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0162721#pone.0162721.t002" target="_blank">Table 2</a> provides the details on these 3 datasets. Five-fold cross validation was used in all analyses.</p

The Francis Crick Institute