14 research outputs found

    SparkText: Biomedical Text Mining on Big Data Framework

    Full text link
    <div><p>Background</p><p>Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.</p><p>Results</p><p>In this study, we designed and developed an efficient text mining framework called SparkText on a <i>Big Data</i> infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed NaĂŻve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.</p><p>Conclusions</p><p>This study demonstrates the potential for mining large-scale scientific articles on a <i>Big Data</i> infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.</p></div

    Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.

    Full text link
    <p>Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.</p

    The quantitative results for accuracy, precision, and recall of SparkText using three datasets.

    Full text link
    <p>For each dataset, 80% was used to train a prediction model and the remaining 20% for testing.</p

    The datasets: all abstracts and full-text articles were downloaded from PubMed.

    Full text link
    <p>The datasets included abstracts and full-text articles related to three types of cancer, including breast, lung, and prostate cancer. For each dataset, we employed 80% of the entire dataset to train a prediction model while the remaining 20% was used for testing.</p

    The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.

    Full text link
    <p>The ROC curves for the dataset “Full-text Articles II”: the area under the curve for the SVM classifier represents a better result compare to that of the Naïve Bayes and Logistic Regression algorithms.</p

    An example of a bag-of-words representation.

    Full text link
    <p>The terms “biology”, “biopsy”, “biolab”, “biotin”, and “almost” are unigrams, but “cancer-surviv”, and “cancer-stage” are bigrams. Using TF/IDF weighting scores, the feature value of the term “almost” equals to zero.</p

    The number of publications in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) over the last six years obtained by submitting a query for “cancer” in the all fields.

    Full text link
    <p>The number of publications in PubMed (<a href="http://www.ncbi.nlm.nih.gov/pubmed" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed</a>) over the last six years obtained by submitting a query for “cancer” in the all fields.</p

    The basic framework of SparkText: We first loaded structured and unstructured abstracts and/or full-text articles into a Cassandra Database, which was then stored in multiple compute nodes.

    Full text link
    <p>After that, we started text preprocessing and feature extraction before building prediction models based on Apache Spark. The Apache Spark Core contains the main functionalities and APIs for distributed <i>Big Data</i> solutions. As a part of Apache Spark components, the MLlib is a scalable machine learning library that includes common machine learning methods and utilities, such as classification, clustering, regression, collaborative filtering, dimensionality reduction, and underlying optimization primitives. The Standalone Scheduler allows a standalone mode cluster, which runs applications in first-in-first-out (FIFO) fashion, and each application is deployed at multiple compute nodes. The Spark Streaming Real-Time handles real-time streaming of <i>Big Data</i> files based on a micro batch style of processing.</p

    SparkText: Biomedical Text Mining on Big Data Framework - Fig 5

    Full text link
    <p><b>Quantitative comparisons of the prediction models on text mining</b>: (A) the accuracy, precision, and recall obtained from 19,681 abstracts; (B) the accuracy, precision, and recall on 12,902 full-text articles; and (C) the accuracy, precision, and recall on 29,437 full-text articles. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0162721#pone.0162721.t002" target="_blank">Table 2</a> provides the details on these 3 datasets. Five-fold cross validation was used in all analyses.</p
    corecore