8,236 research outputs found

    Predicting Pancreatic Cancer Using Support Vector Machine

    Get PDF
    This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately

    Using probe electrospray ionization mass spectrometry and machine learning for detecting pancreatic cancer with high performance

    Get PDF
    A rapid blood-based diagnostic modality to detect pancreatic ductal adenocarcinoma (PDAC) with high accuracy is an unmet medical need. The study aimed to validate a unique diagnosis system using Probe Electrospray Ionization Mass Spectrometry (PESI-MS) and Machine Learning to the diagnosis of PDAC. Peripheral blood samples were collected from a total of 322 consecutive PDAC patients and 265 controls with a family history of PDAC. Five µl of serum samples were analyzed using PESI-MS system. The mass spectra from each specimen were then fed into machine learning algorithms to discriminate between control and cancer cases. A total of 587 serum samples were analyzed. The sensitivity of the machine learning algorithm using PESI-MS profiles to identify PDAC is 90.8% with specificity of 91.7% (95% CI 83.9%-97.4% and 82.8%-97.7% respectively). Combined PESI-MS profiles with age and CA19-9 as predictors, the accuracy for stage 1 or 2 of PDAC is 92.9% and for stage 3 or 4 is 93% (95% CI 86.3-98.2; 87.9-97.4 respectively). The accuracy and simplicity of the PESI-MS profiles combined with machine learning provide an opportunity to detect PDAC at an early stage and must be applicable to the examination of at-risk populations. [Abstract copyright: AJTR Copyright © 2020.

    ISOWN: accurate somatic mutation identification in the absence of normal tissue controls.

    Get PDF
    BackgroundA key step in cancer genome analysis is the identification of somatic mutations in the tumor. This is typically done by comparing the genome of the tumor to the reference genome sequence derived from a normal tissue taken from the same donor. However, there are a variety of common scenarios in which matched normal tissue is not available for comparison.ResultsIn this work, we describe an algorithm to distinguish somatic single nucleotide variants (SNVs) in next-generation sequencing data from germline polymorphisms in the absence of normal samples using a machine learning approach. Our algorithm was evaluated using a family of supervised learning classifications across six different cancer types and ~1600 samples, including cell lines, fresh frozen tissues, and formalin-fixed paraffin-embedded tissues; we tested our algorithm with both deep targeted and whole-exome sequencing data. Our algorithm correctly classified between 95 and 98% of somatic mutations with F1-measure ranges from 75.9 to 98.6% depending on the tumor type. We have released the algorithm as a software package called ISOWN (Identification of SOmatic mutations Without matching Normal tissues).ConclusionsIn this work, we describe the development, implementation, and validation of ISOWN, an accurate algorithm for predicting somatic mutations in cancer tissues in the absence of matching normal tissues. ISOWN is available as Open Source under Apache License 2.0 from https://github.com/ikalatskaya/ISOWN

    Evidence-Based Detection of Pancreatic Canc

    Get PDF
    This study is an effort to develop a tool for early detection of pancreatic cancer using evidential reasoning. An evidential reasoning model predicts the likelihood of an individual developing pancreatic cancer by processing the outputs of a Support Vector Classifier, and other input factors such as smoking history, drinking history, sequencing reads, biopsy location, family and personal health history. Certain features of the genomic data along with the mutated gene sequence of pancreatic cancer patients was obtained from the National Cancer Institute (NIH) Genomic Data Commons (GDC). This data was used to train the SVC. A prediction accuracy of ~85% with a ROC AUC of 83.4% was achieved. Synthetic data was assembled in different combinations to evaluate the working of evidential reasoning model. Using this, variations in the belief interval of developing pancreatic cancer are observed. When the model is provided with an input of high smoking history and family history of cancer, an increase in the evidential reasoning interval in belief of pancreatic cancer and support in the machine learning model prediction is observed. Likewise, decrease in the quantity of genetic material and an irregularity in the cellular structure near the pancreas increases support in the machine learning classifier’s prediction of having pancreatic cancer. This evidence-based approach is an attempt to diagnose the pancreatic cancer at a premalignant stage. Future work includes using the real sequencing reads as well as accurate habits and real medical and family history of individuals to increase the efficiency of the evidential reasoning model. Next steps also involve trying out different machine learning models to observe their performance on the dataset considered in this study

    Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine

    Get PDF
    Despite the tremendous success, pitfalls have been observed in every step of a clinical metabolomics workflow, which impedes the internal validity of the study. Furthermore, the demand for logistics, instrumentations, and computational resources for metabolic phenotyping studies has far exceeded our expectations. In this conceptual review, we will cover inclusive barriers of a metabolomics-based clinical study and suggest potential solutions in the hope of enhancing study robustness, usability, and transferability. The importance of quality assurance and quality control procedures is discussed, followed by a practical rule containing five phases, including two additional "pre-pre-" and "post-post-" analytical steps. Besides, we will elucidate the potential involvement of machine learning and demonstrate that the need for automated data mining algorithms to improve the quality of future research is undeniable. Consequently, we propose a comprehensive metabolomics framework, along with an appropriate checklist refined from current guidelines and our previously published assessment, in the attempt to accurately translate achievements in metabolomics into clinical and epidemiological research. Furthermore, the integration of multifaceted multi-omics approaches with metabolomics as the pillar member is in urgent need. When combining with other social or nutritional factors, we can gather complete omics profiles for a particular disease. Our discussion reflects the current obstacles and potential solutions toward the progressing trend of utilizing metabolomics in clinical research to create the next-generation healthcare system.11Ysciescopu

    Mining Oncology Data: Knowledge Discovery in Clinical Performance of Cancer Patients

    Get PDF
    Our goal in this research is twofold: to develop clinical performance databases of cancer patients, and to conduct data mining and machine learning studies on collected patient records. We use these studies to develop models for predicting cancer patient medical outcomes. The clinical database is developed in conjunction with surgeons and oncologists at UMass Memorial Hospital. Aspects of the database design and representation of patient narrative are discussed here. Current predictive model design in medical literature is dominated by linear and logistic regression techniques. We seek to show that novel machine learning methods can perform as well or better than these traditional techniques. Our machine learning focus for this thesis is on pancreatic cancer patients. Classification and regression prediction targets include patient survival, wellbeing scores, and disease characteristics. Information research in oncology is often constrained by type variation, missing attributes, high dimensionality, skewed class distribution, and small data sets. We compensate for these difficulties using preprocessing, meta-learning, and other algorithmic methods during data analysis. The predictive accuracy and regression error of various machine learning models are presented as results, as are t-tests comparing these to the accuracy of traditional regression methods. In most cases, it is shown that the novel machine learning prediction methods offer comparable or superior performance. We conclude with an analysis of results and discussion of future research possibilities

    Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.

    Get PDF
    OBJECTIVE: Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning. MATERIALS AND METHODS: We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network. RESULTS: For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction. DISCUSSION/CONCLUSION: In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted
    corecore