218 research outputs found
Feature Selection Method using Genetic Algorithm for Medical Dataset
There is a massive amount of high dimensional data that is pervasive in the healthcare domain. Interpreting these data continues as a challenging problem and it is an active research area due to their nature of high dimensional and low sample size. These problems produce a significant challenge to the existing classification methods in achieving high accuracy. Therefore, a compelling feature selection method is important in this case to improve the correctly classify different diseases and consequently lead to help medical practitioners. The methodology for this paper is adapted from KDD method. In this work, a wrapper-based feature selection using the Genetic Algorithm (GA) is proposed and the classifier is based on Support Vector Machine (SVM). The proposed algorithms was tested on five medical datasets naming the Breast Cancer, Parkinsonâs, Heart Disease, Statlog (Heart), and Hepatitis. The results obtained from this work, which apply GA as feature selection yielded competitive results on most of the datasets. The accuracies of the said datasets are as follows: Breast Cancer - 72.71%, Parkinsonâs â 88.36%, Heart Disease â 86.73%, Statlog (Heart) â 85.48 %, and Hepatitis â 76.95%. This prediction method with GA as feature selection will help medical practitioners to make better diagnose with patientâs disease.Â
Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection
Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schönhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC Bioinformatics. 2019;20(1): 480
Machine Learning Applied to Raman Spectroscopy to Classify Cancers
Cancer diagnosis is notoriously difficult, evident in the inter-rater variability between
histopathologists classifying cancerous sub-types. Although there are many cancer
pathologies, they have in common that earlier diagnosis would maximise treatment
potential. To reduce this variability and expedite diagnosis, there has been a drive to
arm histopathologists with additional tools. One such tool is Raman spectroscopy,
which has demonstrated potential in distinguishing between various cancer types.
However, Raman data has high dimensionality and often contains artefacts and
together with challenges inherent to medical data, classification attempts can be
frustrated. Deep learning has recently emerged with the promise of unlocking many
complex datasets, but it is not clear how this modelling paradigm can best exploit
Raman data for cancer diagnosis.
Three Raman oncology datasets (from ovarian, colonic and oesophageal tissue)
were used to examine various methodological challenges to machine learning applied
to Raman data, in conjunction with a thorough review of the recent literature. The
performance of each dataset is assessed with two traditional and one deep learning
models. A technique is then applied to the deep learning model to aid interpretability
and relate biochemical antecedents to disease classes. In addition, a clinical problem
for each dataset was addressed, including the transferability of models developed
using multi-centre Raman data taken different on spectrometers of the same make.
Many subtleties of data processing were found to be important to the realistic
assessment of a machine learning models. In particular, appropriate cross-validation
during hyperparameter selection, splitting data into training and test sets according
to the inherent structure of biomedical data and addressing the number of samples
Abstract "
per disease class are all found to be important factors. Additionally, it was found that
instrument correction was not needed to ensure system transferability if Raman data
is collected with a common protocol on spectrometers of the same make
A screened predictive model for esophageal squamous cell carcinoma based on salivary flora data
Esophageal squamous cell carcinoma (ESCC) is a malignant tumor of the digestive system in the esophageal squamous epithelium. Many studies have linked esophageal cancer (EC) to the imbalance of oral microecology. In this work, different machine learning (ML) models including Random Forest (RF), Gaussian mixture model (GMM), K-nearest neighbor (KNN), logistic regression (LR), support vector machine (SVM) and extreme gradient boosting (XGBoost) based on Genetic Algorithm (GA) optimization was developed to predict the relationship between salivary flora and ESCC by combining the relative abundance data of Bacteroides, Firmicutes, Proteobacteria, Fusobacteria and Actinobacteria in the saliva of patients with ESCC and healthy control. The results showed that the XGBoost model without parameter optimization performed best on the entire dataset for ESCC diagnosis by cross-validation (Accuracy = 73.50%). Accuracy and the other evaluation indicators, including Precision, Recall, F1-score and the area under curve (AUC) of the receiver operating characteristic (ROC), revealed XGBoost optimized by the GA (GA-XGBoost) achieved the best outcome on the testing set (Accuracy = 89.88%, Precision = 89.43%, Recall = 90.75%, F1-score = 90.09%, AUC = 0.97). The predictive ability of GA-XGBoost was validated in phylum-level salivary microbiota data from ESCC patients and controls in an external cohort. The results obtained in this validation (Accuracy = 70.60%, Precision = 46.00%, Recall = 90.55%, F1-score = 61.01%) illustrate the reliability of the predictive performance of the model. The feature importance rankings obtained by XGBoost indicate that Bacteroides and Actinobacteria are the two most important factors in predicting ESCC. Based on these results, GA-XGBoost can predict and diagnose ESCC according to the relative abundance of salivary flora, providing an effective tool for the non-invasive prediction of esophageal malignancies
Depression Episodes Detection in Unipolar and Bipolar Patients: A Methodology with Feature Extraction and Feature Selection with Genetic Algorithms Using Activity Motion Signal âŠ
Depression is a mental disorder which typically includes recurrent sadness and loss of interest in the enjoyment of the positive aspects of life, and in severe cases fatigue, causing inability to perform daily activities, leading to a progressive loss of quality of life. Monitoring depression (unipolar and bipolar patients) stats relays on traditional method reports from patients; however, bias is commonly present, given the patientsâ interpretation of the experiences. Nevertheless, to overcome this problem, Ecological Momentary Assessment (EMA) reports have been proposed and widely used. These reports includes data of the behaviour, feelings, and other type of activities recorded almost in real time using different types of portable devices, which nowadays include smartphones and other wearables such as smartwatches. In this study is proposed a methodology to detect depressive patients with the motion data generated by patient activity, recorded with a smartband, obtained from the âDepresjonâ database. Using this signal as information source, a feature extraction approach of statistical features, in time and spectral evolution of the signal, is done. Subsequently, a clever feature selection with a genetic algorithm approach is done to reduce the amount of information required to give a fast noninvasive diagnostic. Results show that the feature extraction approach can achieve a value of 0.734 of area under the curve (AUC), and after applying feature selection approach, a model comprised by two features from the motion signal can achieve a 0.647 AUC. These results allow us to conclude that using the activity signal from a smartband, it is possibl
Identifying individuals at-risk of developing oesophageal adenocarcinoma through symptom, risk factor and salivary biomarker analysis
Background: Oesophageal adenocarcinoma (OAC) carries a grave prognosis. Existing early detection strategies are flawed predominately because of reliance upon symptoms known to occur late when the disease is often incurable. Detection of individuals with Barrettâs Oesophagus (BO), a known pre-malignant condition, is problematic and the vast majority will not develop OAC. Aim: To explore novel methods of identifying patients with or at risk of OAC through machine learning (ML) techniques and biomarker identification. Materials and Methods: Initial work utilised novel ML on two existing patient symptom and risk factor questionnaire datasets. Additionally, targeted expression analysis was performed to establish whether transcriptomic biomarkers were present in blood and saliva of affected patients. Optimal RNA extraction techniques and saliva collection strategies for sufficient quality and quantity RNA were determined. Whole mRNA sequencing was performed on patient salivary RNA to identify biomarkers for future assessment. Epigenetic analysis was performed on salivary DNA to identify biomarkers. ML techniques analysed these data to derive a risk prediction tool. Results: ML techniques on questionnaire data produced satisfactory sensitivity (90%), but accuracy not appropriate for population screening (AUC 0.77). Blood and saliva extraction and collection methods were established and samples found to contain biomarkers. Targeted transcriptomic expression analysis demonstrated 12 / 22 tested genes were significantly aberrantly expressed in patients. 5 genes, combined with 6 questionnaire data-points, identified those with or at risk of OAC 93% sensitivity, AUC 0.88. Whole mRNA sequencing identified a further 134 genes implicated in OAC pathogenesis requiring future testing. Epigenetic analysis found 25 differentially methylated regions, when combined, identified those with or at risk of OAC to 99.9% accuracy. 5 Conclusion: Utilisation of salivary biomarkers is a potentially effective means to identify individuals with or at risk of OAC. Further work exploring transcriptomic and epigenetic data established in this thesis should be performed
Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data
[EN]Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to identify these artefacts using data from >1600000 variants from 27 paired FFPE and fresh-frozen breast cancer samples. Using these data, we assembled a series of variant features and evaluated the classification performance of five machine learning algorithms. Using leave-one-sample-out cross-validation, we found that XGBoost (extreme gradient boosting)and random forest obtained AUC (area under the receiver operating characteristic curve) values >0.86. Performance was further tested using two independent datasets that resulted in AUC values of 0.96, whereas a comparison with previously published tools resulted in a maximum AUC value of 0.92. The most discriminating features were read pair orientation bias, genomic context and variant allele frequency. In summary, our results show a promising future for the use of these samples in molecular testing. We built the algorithm into an R package called Ideafix (DEAmination FIXing) that is freely available at https://github.com/mmaitenat/ideafix.Departamento de Educaci Ìon, Universidades e Investi-
gaci Ìon of the Basque Government [PRE 2019 2 0211
to M.T.A]; Ikerbasque, Basque Foundation for Science [to
C.L.]; StarmerâSmith Memorial Fund [to C.L.]; Ministerio
de Econom 퀱a, Industria y Competitividad (MINECO)
of the Spanish Central Government [to C.L., PID2019-
104933GB-10 to B.C.]; ISCIII and FEDER Funds
[PI12/00663, PIE13/00048, DTS14/00109, PI15/00275
and PI18/01710 to C.L.]; Departamento de Desarrollo
Econ Ìomico y Competitividad and Departamento de
Sanidad of the Basque Government [to C.L.]; Aso-
ciaci Ìon Espa Ìnola Contra el Cancer (AECC) [to C.L.]; Diputaci Ìon Foral de Guipuzcoa (DFG) [to C.L.]; Depar-
tamento de Industria of the Basque Government [ELKA-
RTEK Programme, project code: KK-2018/00038 to C.L.,
ELKARTEK Programme, project code: KK-2020/00049
to B.C., IT-1244-19 to B.C.
- âŠ