218 research outputs found

    Feature Selection Method using Genetic Algorithm for Medical Dataset

    Get PDF
    There is a massive amount of high dimensional data that is pervasive in the healthcare domain. Interpreting these data continues as a challenging problem and it is an active research area due to their nature of high dimensional and low sample size. These problems produce a significant challenge to the existing classification methods in achieving high accuracy. Therefore, a compelling feature selection method is important in this case to improve the correctly classify different diseases and consequently lead to help medical practitioners. The methodology for this paper is adapted from KDD method. In this work, a wrapper-based feature selection using the Genetic Algorithm (GA) is proposed and the classifier is based on Support Vector Machine (SVM). The proposed algorithms was tested on five medical datasets naming the Breast Cancer, Parkinson’s, Heart Disease, Statlog (Heart), and Hepatitis. The results obtained from this work, which apply GA as feature selection yielded competitive results on most of the datasets. The accuracies of the said datasets are as follows: Breast Cancer - 72.71%, Parkinson’s – 88.36%, Heart Disease – 86.73%, Statlog (Heart) – 85.48 %, and Hepatitis – 76.95%. This prediction method with GA as feature selection will help medical practitioners to make better diagnose with patient’s disease. 

    Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection

    Get PDF
    Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schönhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC Bioinformatics. 2019;20(1): 480

    Machine Learning Applied to Raman Spectroscopy to Classify Cancers

    Get PDF
    Cancer diagnosis is notoriously difficult, evident in the inter-rater variability between histopathologists classifying cancerous sub-types. Although there are many cancer pathologies, they have in common that earlier diagnosis would maximise treatment potential. To reduce this variability and expedite diagnosis, there has been a drive to arm histopathologists with additional tools. One such tool is Raman spectroscopy, which has demonstrated potential in distinguishing between various cancer types. However, Raman data has high dimensionality and often contains artefacts and together with challenges inherent to medical data, classification attempts can be frustrated. Deep learning has recently emerged with the promise of unlocking many complex datasets, but it is not clear how this modelling paradigm can best exploit Raman data for cancer diagnosis. Three Raman oncology datasets (from ovarian, colonic and oesophageal tissue) were used to examine various methodological challenges to machine learning applied to Raman data, in conjunction with a thorough review of the recent literature. The performance of each dataset is assessed with two traditional and one deep learning models. A technique is then applied to the deep learning model to aid interpretability and relate biochemical antecedents to disease classes. In addition, a clinical problem for each dataset was addressed, including the transferability of models developed using multi-centre Raman data taken different on spectrometers of the same make. Many subtleties of data processing were found to be important to the realistic assessment of a machine learning models. In particular, appropriate cross-validation during hyperparameter selection, splitting data into training and test sets according to the inherent structure of biomedical data and addressing the number of samples Abstract " per disease class are all found to be important factors. Additionally, it was found that instrument correction was not needed to ensure system transferability if Raman data is collected with a common protocol on spectrometers of the same make

    A screened predictive model for esophageal squamous cell carcinoma based on salivary flora data

    Get PDF
    Esophageal squamous cell carcinoma (ESCC) is a malignant tumor of the digestive system in the esophageal squamous epithelium. Many studies have linked esophageal cancer (EC) to the imbalance of oral microecology. In this work, different machine learning (ML) models including Random Forest (RF), Gaussian mixture model (GMM), K-nearest neighbor (KNN), logistic regression (LR), support vector machine (SVM) and extreme gradient boosting (XGBoost) based on Genetic Algorithm (GA) optimization was developed to predict the relationship between salivary flora and ESCC by combining the relative abundance data of Bacteroides, Firmicutes, Proteobacteria, Fusobacteria and Actinobacteria in the saliva of patients with ESCC and healthy control. The results showed that the XGBoost model without parameter optimization performed best on the entire dataset for ESCC diagnosis by cross-validation (Accuracy = 73.50%). Accuracy and the other evaluation indicators, including Precision, Recall, F1-score and the area under curve (AUC) of the receiver operating characteristic (ROC), revealed XGBoost optimized by the GA (GA-XGBoost) achieved the best outcome on the testing set (Accuracy = 89.88%, Precision = 89.43%, Recall = 90.75%, F1-score = 90.09%, AUC = 0.97). The predictive ability of GA-XGBoost was validated in phylum-level salivary microbiota data from ESCC patients and controls in an external cohort. The results obtained in this validation (Accuracy = 70.60%, Precision = 46.00%, Recall = 90.55%, F1-score = 61.01%) illustrate the reliability of the predictive performance of the model. The feature importance rankings obtained by XGBoost indicate that Bacteroides and Actinobacteria are the two most important factors in predicting ESCC. Based on these results, GA-XGBoost can predict and diagnose ESCC according to the relative abundance of salivary flora, providing an effective tool for the non-invasive prediction of esophageal malignancies

    Depression Episodes Detection in Unipolar and Bipolar Patients: A Methodology with Feature Extraction and Feature Selection with Genetic Algorithms Using Activity Motion Signal 


    Get PDF
    Depression is a mental disorder which typically includes recurrent sadness and loss of interest in the enjoyment of the positive aspects of life, and in severe cases fatigue, causing inability to perform daily activities, leading to a progressive loss of quality of life. Monitoring depression (unipolar and bipolar patients) stats relays on traditional method reports from patients; however, bias is commonly present, given the patients’ interpretation of the experiences. Nevertheless, to overcome this problem, Ecological Momentary Assessment (EMA) reports have been proposed and widely used. These reports includes data of the behaviour, feelings, and other type of activities recorded almost in real time using different types of portable devices, which nowadays include smartphones and other wearables such as smartwatches. In this study is proposed a methodology to detect depressive patients with the motion data generated by patient activity, recorded with a smartband, obtained from the “Depresjon” database. Using this signal as information source, a feature extraction approach of statistical features, in time and spectral evolution of the signal, is done. Subsequently, a clever feature selection with a genetic algorithm approach is done to reduce the amount of information required to give a fast noninvasive diagnostic. Results show that the feature extraction approach can achieve a value of 0.734 of area under the curve (AUC), and after applying feature selection approach, a model comprised by two features from the motion signal can achieve a 0.647 AUC. These results allow us to conclude that using the activity signal from a smartband, it is possibl

    Identifying individuals at-risk of developing oesophageal adenocarcinoma through symptom, risk factor and salivary biomarker analysis

    Get PDF
    Background: Oesophageal adenocarcinoma (OAC) carries a grave prognosis. Existing early detection strategies are flawed predominately because of reliance upon symptoms known to occur late when the disease is often incurable. Detection of individuals with Barrett’s Oesophagus (BO), a known pre-malignant condition, is problematic and the vast majority will not develop OAC. Aim: To explore novel methods of identifying patients with or at risk of OAC through machine learning (ML) techniques and biomarker identification. Materials and Methods: Initial work utilised novel ML on two existing patient symptom and risk factor questionnaire datasets. Additionally, targeted expression analysis was performed to establish whether transcriptomic biomarkers were present in blood and saliva of affected patients. Optimal RNA extraction techniques and saliva collection strategies for sufficient quality and quantity RNA were determined. Whole mRNA sequencing was performed on patient salivary RNA to identify biomarkers for future assessment. Epigenetic analysis was performed on salivary DNA to identify biomarkers. ML techniques analysed these data to derive a risk prediction tool. Results: ML techniques on questionnaire data produced satisfactory sensitivity (90%), but accuracy not appropriate for population screening (AUC 0.77). Blood and saliva extraction and collection methods were established and samples found to contain biomarkers. Targeted transcriptomic expression analysis demonstrated 12 / 22 tested genes were significantly aberrantly expressed in patients. 5 genes, combined with 6 questionnaire data-points, identified those with or at risk of OAC 93% sensitivity, AUC 0.88. Whole mRNA sequencing identified a further 134 genes implicated in OAC pathogenesis requiring future testing. Epigenetic analysis found 25 differentially methylated regions, when combined, identified those with or at risk of OAC to 99.9% accuracy. 5 Conclusion: Utilisation of salivary biomarkers is a potentially effective means to identify individuals with or at risk of OAC. Further work exploring transcriptomic and epigenetic data established in this thesis should be performed

    Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data

    Get PDF
    [EN]Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to identify these artefacts using data from >1600000 variants from 27 paired FFPE and fresh-frozen breast cancer samples. Using these data, we assembled a series of variant features and evaluated the classification performance of five machine learning algorithms. Using leave-one-sample-out cross-validation, we found that XGBoost (extreme gradient boosting)and random forest obtained AUC (area under the receiver operating characteristic curve) values >0.86. Performance was further tested using two independent datasets that resulted in AUC values of 0.96, whereas a comparison with previously published tools resulted in a maximum AUC value of 0.92. The most discriminating features were read pair orientation bias, genomic context and variant allele frequency. In summary, our results show a promising future for the use of these samples in molecular testing. We built the algorithm into an R package called Ideafix (DEAmination FIXing) that is freely available at https://github.com/mmaitenat/ideafix.Departamento de Educaci ́on, Universidades e Investi- gaci ́on of the Basque Government [PRE 2019 2 0211 to M.T.A]; Ikerbasque, Basque Foundation for Science [to C.L.]; Starmer–Smith Memorial Fund [to C.L.]; Ministerio de Econom ́ıa, Industria y Competitividad (MINECO) of the Spanish Central Government [to C.L., PID2019- 104933GB-10 to B.C.]; ISCIII and FEDER Funds [PI12/00663, PIE13/00048, DTS14/00109, PI15/00275 and PI18/01710 to C.L.]; Departamento de Desarrollo Econ ́omico y Competitividad and Departamento de Sanidad of the Basque Government [to C.L.]; Aso- ciaci ́on Espa ̃nola Contra el Cancer (AECC) [to C.L.]; Diputaci ́on Foral de Guipuzcoa (DFG) [to C.L.]; Depar- tamento de Industria of the Basque Government [ELKA- RTEK Programme, project code: KK-2018/00038 to C.L., ELKARTEK Programme, project code: KK-2020/00049 to B.C., IT-1244-19 to B.C.
    • 

    corecore