Search CORE

1,287 research outputs found

Survival Prediction from Imbalance colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

Author: Bahrami Mahsa
Soleimani Sadegh
Vali Mansour
Publication venue
Publication date: 04/09/2023
Field of study

Background and Objective: Colorectal cancer is a high mortality cancer. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. To address this issue, we propose a method that creates a pipeline of some of standard balancing techniques to increase the true positive rate. Evaluation is conducted on a colorectal cancer dataset from the SEER database. Methods: The pre-processing step consists of removing records with missing values and merging categories. The minority class of 1-year and 3-year survival tasks consists of 10% and 20% of the data, respectively. Edited Nearest Neighbor, Repeated edited nearest neighbor (RENN), Synthetic Minority Over-sampling Techniques (SMOTE), and pipelines of SMOTE and RENN approaches were used and compared for balancing the data with tree-based classifiers. Decision Trees, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting (LGBM) are used in this article. Method. Results: The performance evaluation utilizes a 5-fold cross-validation approach. In the case of highly imbalanced datasets (1-year), our proposed method with LGBM outperforms other sampling methods with the sensitivity of 72.30%. For the task of imbalance (3-year survival), the combination of RENN and LGBM achieves a sensitivity of 80.81%, indicating that our proposed method works best for highly imbalanced datasets. Conclusions: Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients.Comment: 19 Pages, 6 Figures, 4 Table

arXiv.org e-Print Archive

A machine learning platform to optimize the translation of personalized network models to the clinic

Author: Gallagher William M.
Johnston Patrick G.
Kay Elaine W.
Laurent-Puig Pierre
Lawler Mark
Longley Daniel B.
McNamara Deborah A.
Prehn Jochen H.M.
Rafferty Mairin
Rahman Arman
Rehm Markus
Resler Alexa J.
Salto-Tellez Manuel
Salvucci Manuela
Udupi Girish M.
Van Schaeybroeck Sandra
Wilson Richard
Publication venue: 'American Society of Clinical Oncology (ASCO)'
Publication date: 01/04/2019
Field of study

PURPOSE Dynamic network models predict clinical prognosis and inform therapeutic intervention by elucidating disease-driven aberrations at the systems level. However, the personalization of model predictions requires the profiling of multiple model inputs, which hampers clinical translation. PATIENTS AND METHODS We applied APOPTO-CELL, a prognostic model of apoptosis signaling, to showcase the establishment of computational platforms that require a reduced set of inputs. We designed two distinct and complementary pipelines: a probabilistic approach to exploit a consistent subpanel of inputs across the whole cohort (Ensemble) and a machine learning approach to identify a reduced protein set tailored for individual patients (Tree). Development was performed on a virtual cohort of 3,200,000 patients, with inputs estimated from clinically relevant protein profiles. Validation was carried out in an in-house stage III colorectal cancer cohort, with inputs profiled in surgical resections by reverse phase protein array (n = 120) and/or immunohistochemistry (n = 117). RESULTS Ensemble and Tree reproduced APOPTO-CELL predictions in the virtual patient cohort with 92% and 99% accuracy while decreasing the number of inputs to a consistent subset of three proteins (40% reduction) or a personalized subset of 2.7 proteins on average (46% reduction), respectively. Ensemble and Tree retained prognostic utility in the in-house colorectal cancer cohort. The association between the Ensemble accuracy and prognostic value (Spearman ρ = 0.43; P = .02) provided a rationale to optimize the input composition for specific clinical settings. Comparison between profiling by reverse phase protein array (gold standard) and immunohistochemistry (clinical routine) revealed that the latter is a suitable technology to quantify model inputs. CONCLUSION This study provides a generalizable framework to optimize the development of network-based prognostic assays and, ultimately, to facilitate their integration in the routine clinical workflow

Queen's University Belfast Research Portal

Enlighten

Institute of Cancer Research Repository

RCSI Repository

Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations

Author: Odeyemi Oluyemi
Publication venue: Chapman University Digital Commons
Publication date: 01/05/2020
Field of study

Cancer arises from the accumulation of somatic mutations and genetic alterations in cell division checkpoints and apoptosis, this often leads to abnormal tumor proliferation. Proper classification of cancer-linked driver mutations will considerably help our understanding of the molecular dynamics of cancer. In this study, we compared several cancer-specific predictive models for prediction of driver mutations in cancer-linked genes that were validated on canonical data sets of functionally validated mutations and applied to a raw cancer genomics data. By analyzing pathogenicity prediction and conservation scores, we have shown that evolutionary conservation scores play a pivotal role in the classification of cancer drivers and were the most informative features in the driver mutation classification. Through extensive comparative analysis with structure-functional experiments and multicenter mutational calling data from PanCancer Atlas studies, we have demonstrated the robustness of our models and addressed the validity of computational predictions. We evaluated the performance of our models using the standard diagnostic metrics such as sensitivity, specificity, area under the curve and F-measure. To address the interpretability of cancer-specific classification models and obtain novel insights about molecular signatures of driver mutations, we have complemented machine learning predictions with structure-functional analysis of cancer driver mutations in several key tumor suppressor genes and oncogenes. Through the experiments carried out in this study, we found that evolutionary-based features have the strongest signal in the machine learning classification VII of driver mutations and provide orthogonal information to the ensembled-based scores that are prominent in the ranking of feature importance

Chapman University Digital Commons

자가건강전략의 클러스터링과 머신러닝 기법을 사용한 암생존자의 삶의 질 및 진행성 암환자의 생존 예측

Author: 정주연
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(박사) -- 서울대학교대학원 : 의과대학 의과학과, 2023. 2. 윤영호.Background: In cancer-care, self-management strategies can help cancer patients improve their health-related quality of life (HRQoL) or survival, irrespective of the cancer stage or their treatment plan. However, there is insufficient research on the clustering of self-management strategies considering cancer stages in natural clinical settings; the prediction model of HRQoL or survival in cancer patients also lacks research. In addition, research that has comprehensively identified the relationship between self-management strategies, HRQoL, and survival still needs to be completed. Hence, we investigated their relationship using clustering methods, machine learning techniques (MLT), and path analysis of structural equation modeling (SEM). Methods: In cancer survivors, cluster analyses using principal component analyses in varimax rotation and clustering of the k-means method were conducted to examine the interrelationship among self-management strategies in smart management strategies for health assessment tool (SAT). Multivariate-adjusted analyses were performed to identify the association of self-management strategies with HRQoL after 6 months. We constructed the HRQoL prediction model and compared the performance of the model with ensemble algorithms including decision tree, random forest, gradient boosting, eXtreme Gradient Boost (XGBoost), and LightGBM. Next, we selected the XGBoost model for further analysis. We demonstrated critical features of HRQoL and extracted the individual prediction result in the XGBoost model using SHAP. In advanced cancer patients, self-management clustering and multivariate-adjusted analyses for examining the association of the strategies with the HRQoL were conducted the same way as in cancer survivors. We performed dimensional multiple Cox proportional hazard regression analyses to determine critical predictors for 1-year survival. We established a survival prediction model with the XGBoost method using MLT with the critical predictors in the Cox regression model. To examine the causal relationship among SAT strategies, HRQoL, and survival, we used a subgroup analysis and a path analysis of structural equation modeling. Results: All cancer survivors and advanced cancer patients experienced two clusters in the self-management strategies concurrently. However, the strategy clusters differed by cancer stage. Advanced-stage cancer patients used core strategies along with preparation and implementation strategies to overcome their crisis. Among all cancer patients, the self-management strategies had a positive association with improved HRQoL, even in advanced cancer patients. In the prediction model development, the XGBoost model for HRQoL showed high performance in cancer survivors. The important variables for each HRQoL factor were different. Moreover, there was a specific method to provide customized healthcare services by employing the individual prediction method with SHAP with a web-based survey study for cancer survivors. In advanced cancer patients, the univariate dimensional Cox model showed that ECOG performance status, marital status, sex, global QoL, dyspnea, pain, appetite loss, constipation, depression at baseline, and clinically meaningful change of emotional functioning were predictive factors with worse survival. In the prediction model using MLT, the XGBoost model of survival showed high performance. The performance was optimum when the model was constructed by combining variables selected by the Cox model and MLT methods: depression, pain, appetite loss, constipation, sex, ECOG performance status, and clinically meaningful change in emotional functioning. We also revealed a causal relationship among SAT strategies, depression, and survival in advanced cancer patients using path analysis. Conclusions: This study is the first to examine the self-management strategy clusters considering cancer stages and different groups of cancer patients, such as cancer survivors and advanced cancer patients. To our knowledge, this is first study to have developed and validated HRQoL prediction models, interpreted the models, and suggested utilization of these results in a clinical setting for cancer survivors. Additionally, we revealed an association of self-management strategies with HRQoL and survival in advanced cancer patients using MLT methods and path analysis. These study results can increase the understanding of self-management strategies and help healthcare providers with healthcare services for cancer patients in the cancer-care continuum.연구 배경: 암 케어 연속선상에서 자가관리전략은 암 병기 또는 치료 계획과 관계없이 암환자의 건강관련 삶의 질 또는 생존을 개선하는데 도움이 될 수 있다. 그러나 실제 임상 현장에서 암 병기를 고려한 자가관리전략이 어떻게 클러스터링 되는지에 대한 연구와 암환자의 건강관련 삶의 질 또는 생존 예측 모델은 부족한 실정이다. 또한 암환자의 자가관리전략과 건강관련 삶의 질, 생존 간의 관계를 종합적으로 살펴본 연구는 아직까지 없는 실정이다. 따라서 본 연구는 클러스터링 통계 방법, 머신러닝 기술 및 구조방정식 모델의 경로분석을 활용하여 암환자의 자가관리전략, 건강관련 삶의 질 및 생존 간의 관계를 규명하고자 하였다. 연구 방법: 암생존자의 경우, 새롭게 개발한 건강경영전략(Smart Management Strategies for Health Assessment Tool, SAT)으로 자가관리전략을 측정하여 SAT 전략들 간의 상호관계를 조사하기 위해 주성분 분석과 K-mean 클러스터링 방법을 사용한 군집 분석을 수행하였다. 또한 SAT 전략과 6개월 후의 HRQoL 간의 연관성을 확인하기 위해 다변량 분석을 수행하였다. 암생존자의 HRQoL 예측 모델 개발 및 검증을 위해서는 예측 모델을 구성하고, 결정 트리, 랜덤 포레스트, 경사 부스팅 (Gradient boosting), XGBoost, and LightGBM의 앙상블 알고리즘을 사용하여 모델의 성능을 비교하였다. 모델 비교 후, 추가 분석을 위해 최종적으로 XGBoost 모델이 선택되었고, XGBoost의 HRQoL 예측 모델의 중요한 변수를 찾고자 SHAP을 사용하여 특성 중요도 (Feature importance) 및 개별 예측 (Individual prediction) 분석을 수행하였다. 진행성 암환자에서 HRQoL과 SAT 전략의 연관성을 조사하기 위한 클러스터링 및 다변량 분석 방법은 암생존자에서 수행했던 방법과 동일하였다. 생존 예측 모델 개발을 위해 기존의 통계분석을 사용하여 차원 다중 Cox 비례 위험 회귀 분석을 수행하였고, 머신러닝 기법의 XGBoost방법으로 생존 예측 모델을 개발하였다. 본 연구에서는 전통적 통계 방법에 의해 선택된 변수와 머신러닝 기법에 의해 선택된 변수 및 두 방법에 의해 선택된 변수를 결합하여 예측모델을 개별적으로 구성하였고, 성능을 비교하였다. 또한 구조방정식 모델을 활용한 경로분석을 통해 SAT 전략과 HRQoL, 생존 간의 인과관계를 규명하고자 하였다. 연구 결과: 암생존자 및 진행성 암환자의 SAT 전략 클러스터링은 암병기에 따라 다르게 나타났다. 중기-말기 단계 암 환자들은 초기 단계 암환자들에 비해 위기를 극복하기 위해 자가관리전략에서 치료 시기 및 암병기에 관계없이 모든 단계에서 중요한 핵심 전략을 준비 및 실행전략과 함께 사용하는 것으로 나타났다. 또한 이러한 SAT 전략은 진행성 암환자를 포함하여 모든 암환자에게서 개선된 HRQoL과 긍정적인 연관성을 보여주었다. 머신러닝을 활용한 HRQoL의 예측 모델은 암생존자에서 높은 예측 성능을 보여주었다. 그러나, 각 HRQoL 요인에 대한 중요 변수는 서로 다르게 나타났다. 또한 본 연구는 암생존자를 대상으로 한 웹 기반 설문 조사 연구와 새롭게 찾아낸 SHAP을 통한 개인 예측 방법을 접목함으로써 암생존자를 대상으로 한 개인 맞춤형 의료 서비스 제공 방안을 구체적으로 제시하였다. 진행성 암환자에서 차원별 단변량 Cox 모델에서는 ECOG 수행 상태, 성별, 결혼상태, 진단시점에서의 일반적 삶의 질 저하, 호흡곤란, 통증, 식욕감퇴, 변비, 우울, 12주 동안의 임상적으로 의미 있는 정서적 기능 및 사회적 지지의 변화가 최종적으로 더 저하된 생존과 관련이 있는 요인으로 나타났다. 머신러닝방법을 활용한 예측 모형에서도 높은 생존 예측 성능이 나타났고, BorutaSHAP을 통해서는 우울, 통증, 식욕감퇴, 변비, 성별이 생존과 연관된 중요한 요인으로 선별되었다. 기존의 전통적 통계방법과 머신러닝 기법으로 선정된 변수를 결합하여 모델을 구성하였을 때, 생존 예측 모형에서 가장 높은 성능이 발견되었다. 경로분석에서는 SAT전략, 우울, 생존 간의 인과관계를 밝혔으며, 우울 변수를 완전 매개로 SAT 전략의 생존에 대한 간접효과가 있는 것이 발견되었다. 연구 결론: 본 연구는 처음으로 암생존자 및 진행성 암환자를 모두 포함하여 암병기를 고려한 자가관리전략 사용 군집 분석을 시도하였다. 또한 본 연구는 처음으로 암생존자에게 중요한 건강관련 삶의 질을 예측하는 단순한 모델을 개발 및 검증하였고, 설명 가능한 인공지능 알고리즘을 활용하여 모델을 해석하고, 암생존자를 위해 임상환경에서 본 연구의 결과 활용할 수 있는 방안을 제안하였다. 또한 본 연구에서는 머신러닝 기법과 경로분석을 사용하여 진행성 암환자의 자가관리전략과 건강관련 삶의 질 및 생존 간에 직·간접적으로 긍정적인 연관성이 있음을 발견하였다. 이러한 연구결과는 새롭게 개발한 SAT 자가관리전략이 임상장면에서 암환자에게 유용한 개입 도구로 사용될 수 있음을 보여준다. 종합적으로 본 연구는 암환자의 자가관리전략 사용 및 그 효과성에 대한 이해의 폭을 넓혔고, 의료제공자가 암 케어 연속선상에서 암환자에게 도움이 되는 의료 서비스를 제공하는데 자가관리전략을 어떻게 활용할 수 있을지 종합적인 결과 및 임상적 활용방안을 제시하였다는데 의의가 있다.Chapter 1. Introduction 1 1.1. Study Background 1 1.2. Literature Review 7 1.3. Research Objectives and Hypothesis 16 1.4. Definition of cancer survivors and advanced cancer patients in this study 19 Chapter 2. Methods 21 2.1. Study Design 21 2.2. Study Participants 23 2.3. Measurements 25 2.4. Statistical Methods 30 Chapter 3. Results 42 3.1. Study Participantscharacteristics 42 3.2. Self-management clustering results 45 3.3. The association of self-management clustering with HRQoL 51 3.4. HRQoL prediction model development and validation 55 3.5. Survival prediction model development and validation 72 3.6. Causal relationship among SAT, HRQoL, and Survival 92 Chapter 4. Discussion 96 Chapter 5. Conclusion 104 Bibliography 105 Abstract in Korean 113 Supplementary Information 116박

SNU Open Repository and Archive

Developing an individualized survival prediction model for rectal cancer

Author: Neves José
Novais Paulo
Oliveira T.
Silva Ana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

This work presents a survivability prediction model for rectal cancer patients developed through machine learning techniques. The model was based on the most complete worldwide cancer dataset known, the SEER dataset. After preprocessing, the training data consisted of 12,818 records of rectal cancer patients. Six features were extracted from a feature selection process, finding the most relevant characteristics which affect the survivability of rectal cancer. The model constructed with six features was compared with another one with 18 features indicated by a physician. The results show that the performance of the six-feature model is close to that of the model using 18 features, which indicates that the first may be a good compromise between usability and performance.FCT - Fuel Cell Technologies Program (SFRH/BD/85291/2012)info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

Crossref

Proc Mach Learn Res

Author
Publication venue
Publication date
Field of study

Research in oncology quality of care and health outcomes has been limited by the difficulty of identifying cancer stage in health care claims data. Using linked cancer registry and Medicare claims data, we develop a tool for classifying lung cancer patients receiving chemotherapy into early vs. late stage cancer by (|) deploying ensemble machine learning for prediction, (|) establishing a set of classification rules for the predicted probabilities, and (|) considering an augmented set of administrative claims data. We find our ensemble machine learning algorithm with a classification rule defined by the median substantially outperforms an existing clinical decision tree for this problem, yielding full sample performance of 93% sensitivity, 92% specificity, and 93% accuracy. This work has the potential for broad applicability as provider organizations, payers, and policy makers seek to measure quality and outcomes of cancer care and improve on risk adjustment methods.HHSN261201000140C/CA/NCI NIH HHS/United StatesHHSN261201000035C/CA/NCI NIH HHS/United StatesT32 MH019733/MH/NIMH NIH HHS/United StatesHHSN261201000035I/CA/NCI NIH HHS/United StatesHHSN261201000034C/CA/NCI NIH HHS/United StatesU58 DP003862/DP/NCCDPHP CDC HHS/United States2018-12-10T00:00:00Z30542673PMC6287925vault:3125

CDC Stacks

A novel integrative risk index of papillary thyroid cancer progression combining genomic alterations and clinical factors.

Author: Acharya Chaitanya R
Cheng Qing
Hyslop Terry
Li Xuechan
Sosa Julie Ann
Publication venue: eScholarship, University of California
Publication date: 06/02/2017
Field of study

Although the majority of papillary thyroid cancer (PTC) is indolent, a subset of PTC behaves aggressively despite the best available treatment. A major clinical challenge is to reliably distinguish early on between those patients who need aggressive treatment from those who do not. Using a large cohort of PTC samples obtained from The Cancer Genome Atlas (TCGA), we analyzed the association between disease progression and multiple forms of genomic data, such as transcriptome, somatic mutations, and somatic copy number alterations, and found that genes related to FOXM1 signaling pathway were significantly associated with PTC progression. Integrative genomic modeling was performed, controlling for demographic and clinical characteristics, which included patient age, gender, TNM stages, histological subtypes, and history of other malignancy, using a leave-one-out elastic net model and 10-fold cross validation. For each subject, the model from the remaining subjects was used to determine the risk index, defined as a linear combination of the clinical and genomic variables from the elastic net model, and the stability of the risk index distribution was assessed through 2,000 bootstrap resampling. We developed a novel approach to combine genomic alterations and patient-related clinical factors that delineates the subset of patients who have more aggressive disease from those whose tumors are indolent and likely will require less aggressive treatment and surveillance (p = 4.62 × 10-10, log-rank test). Our results suggest that risk index modeling that combines genomic alterations with current staging systems provides an opportunity for more effective anticipation of disease prognosis and therefore enhanced precision management of PTC

PubMed Central

eScholarship - University of California

Efficient Feature Selection and ML Algorithm for Accurate Diagnostics

Author: El-Omari Nidhal Kamel Taha
Nyakina Judith Nyakanga
Nyangaresi Vincent Omollo
Publication venue: 'Bilingual Publishing Co.'
Publication date: 25/01/2022
Field of study

Machine learning algorithms have been deployed in numerous optimization, prediction and classification problems. This has endeared them for application in fields such as computer networks and medical diagnosis. Although these machine learning algorithms achieve convincing results in these fields, they face numerous challenges when deployed on imbalanced dataset. Consequently, these algorithms are often biased towards majority class, hence unable to generalize the learning process. In addition, they are unable to effectively deal with high-dimensional datasets. Moreover, the utilization of conventional feature selection techniques from a dataset based on attribute significance render them ineffective for majority of the diagnosis applications. In this paper, feature selection is executed using the more effective Neighbour Components Analysis (NCA). During the classification process, an ensemble classifier comprising of K-Nearest Neighbours (KNN), Naive Bayes (NB), Decision Tree (DT) and Support Vector Machine (SVM) is built, trained and tested. Finally, cross validation is carried out to evaluate the developed ensemble model. The results shows that the proposed classifier has the best performance in terms of precision, recall, F-measure and classification accuracy

Bilingual Publishing Co. (BPC): E-Journals

A Comparative Study for Methodologies and Algorithms Used In Colon Cancer Diagnoses and Detection

Author: abdelhamid laila mohamed
Nasr Mona Mohamed
Shehata Naglaa
Publication venue: Arab Journals Platform
Publication date: 29/09/2020
Field of study

Colon cancer is also referred to as colorectal cancer; it is a kind of cancer that starts with colon damage to the large intestine in the last section of the digestive tract. Elderly people typically suffer from colon cancer, but this may occur at any age. It normally starts as a little, noncancerous (benign) mass of cells named polyps that structure within the colon. After a period of time these polyps can turn into advanced malignant tumors that attack the human body and some of these polyps can become colon cancers. So far, no concrete causes have been identified and the complete cancer treatment is very difficult to be detected by doctors in the medical field. Colon cancer often has no symptoms in an early stage so detecting it at this stage is curable but colorectal cancer diagnosis in the final stages (stage IV), gives it the opportunity to spread into different pieces of the body, which are difficult to treat successfully, and the person\u27s opportunities of survival become much lower. False diagnosis of colorectal cancer which means wrong treatment for patients with long-term infections and they will be suffering from colon cancer this causing the death for these patients. Also, cancer treatment needs more time and a lot of money. This paper provides a comparative study for methodologies and algorithms used in the colon cancer diagnoses and detection this can help for proposing a prediction for risk levels of colon cancer disease using CNN algorithm of deep learning (Convolutional Neural Networks Algorithm)

Arab Journals Platform