14 research outputs found
SparkText: Biomedical Text Mining on Big Data Framework
<div><p>Background</p><p>Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.</p><p>Results</p><p>In this study, we designed and developed an efficient text mining framework called SparkText on a <i>Big Data</i> infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed NaĂŻve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.</p><p>Conclusions</p><p>This study demonstrates the potential for mining large-scale scientific articles on a <i>Big Data</i> infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.</p></div
Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.
<p>Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.</p
An example of unigrams and bigrams extracted from the sentence “The purpose of this study was to examine the incidence of breast cancer with triple negative phenotype.
<p>” The sentence was chosen from an abstract downloaded from PubMed.</p
The quantitative results for accuracy, precision, and recall of SparkText using three datasets.
<p>For each dataset, 80% was used to train a prediction model and the remaining 20% for testing.</p
The datasets: all abstracts and full-text articles were downloaded from PubMed.
<p>The datasets included abstracts and full-text articles related to three types of cancer, including breast, lung, and prostate cancer. For each dataset, we employed 80% of the entire dataset to train a prediction model while the remaining 20% was used for testing.</p
The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.
<p>The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.</p
An example of a bag-of-words representation.
<p>The terms “biology”, “biopsy”, “biolab”, “biotin”, and “almost” are unigrams, but “cancer-surviv”, and “cancer-stage” are bigrams. Using TF/IDF weighting scores, the feature value of the term “almost” equals to zero.</p
The number of publications in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) over the last six years obtained by submitting a query for “cancer” in the all fields.
<p>The number of publications in PubMed (<a href="http://www.ncbi.nlm.nih.gov/pubmed" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed</a>) over the last six years obtained by submitting a query for “cancer” in the all fields.</p
The basic framework of SparkText: We first loaded structured and unstructured abstracts and/or full-text articles into a Cassandra Database, which was then stored in multiple compute nodes.
<p>After that, we started text preprocessing and feature extraction before building prediction models based on Apache Spark. The Apache Spark Core contains the main functionalities and APIs for distributed <i>Big Data</i> solutions. As a part of Apache Spark components, the MLlib is a scalable machine learning library that includes common machine learning methods and utilities, such as classification, clustering, regression, collaborative filtering, dimensionality reduction, and underlying optimization primitives. The Standalone Scheduler allows a standalone mode cluster, which runs applications in first-in-first-out (FIFO) fashion, and each application is deployed at multiple compute nodes. The Spark Streaming Real-Time handles real-time streaming of <i>Big Data</i> files based on a micro batch style of processing.</p
SparkText: Biomedical Text Mining on Big Data Framework - Fig 5
<p><b>Quantitative comparisons of the prediction models on text mining</b>: (A) the accuracy, precision, and recall obtained from 19,681 abstracts; (B) the accuracy, precision, and recall on 12,902 full-text articles; and (C) the accuracy, precision, and recall on 29,437 full-text articles. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0162721#pone.0162721.t002" target="_blank">Table 2</a> provides the details on these 3 datasets. Five-fold cross validation was used in all analyses.</p