114 research outputs found

    Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification

    Get PDF
    Background: Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results: This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions: It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. Keywords: Gene expression profiles; Gene selection; Tumor classification; Heuristic breadth-first search; Power-law distributio

    A Machine Learning Decision Support System (DSS) for Neuroendocrine Tumor Patients Treated with Somatostatin Analog (SSA) Therapy

    Get PDF
    The application of machine learning (ML) techniques could facilitate the identification of predictive biomarkers of somatostatin analog (SSA) efficacy in patients with neuroendocrine tumors (NETs). We collected data from 74 patients with a pancreatic or gastrointestinal NET who received SSA as first-line therapy. We developed three classification models to predict whether the patient would experience a progressive disease (PD) after 12 or 18 months based on clinic-pathological factors at the baseline. The dataset included 70 samples and 15 features. We initially developed three classification models with accuracy ranging from 55% to 70%. We then compared ten different ML algorithms. In all but one case, the performance of the Multinomial Naive Bayes algorithm (80%) was the highest. The support vector machine classifier (SVC) had a higher performance for the recall metric of the progression-free outcome (97% vs. 94%). Overall, for the first time, we documented that the factors that mainly influenced progression-free survival (PFS) included age, the number of metastatic sites and the primary site. In addition, the following factors were also isolated as important: adverse events G3-G4, sex, Ki67, metastatic site (liver), functioning NET, the primary site and the stage. In patients with advanced NETs, ML provides a predictive model that could potentially be used to differentiate prognostic groups and to identify patients for whom SSA therapy as a single agent may not be sufficient to achieve a long-lasting PFS

    Credit card fraud detection using a hierarchical behavior-knowledge space model

    Get PDF
    Data Availability: All relevant benchmark data are within the manuscript, given in references [24], [25], and [26]. Relevant real data records are available from a public repository: https://doi.org/10.6084/m9.figshare.17030138.Copyright: © 2022 Nandi et al. With the advancement in machine learning, researchers continue to devise and implement effective intelligent methods for fraud detection in the financial sector. Indeed, credit card fraud leads to billions of dollars in losses for merchants every year. In this paper, a multi-classifier framework is designed to address the challenges of credit card fraud detections. An ensemble model with multiple machine learning classification algorithms is designed, in which the Behavior-Knowledge Space (BKS) is leveraged to combine the predictions from multiple classifiers. To ascertain the effectiveness of the developed ensemble model, publicly available data sets as well as real financial records are employed for performance evaluations. Through statistical tests, the results positively indicate the effectiveness of the developed model as compared with the commonly used majority voting method for combination of predictions from multiple classifiers in tackling noisy data classification as well as credit card fraud detection problems.Funding: The author(s) received no specific funding for this work

    Developing genomic models for cancer prevention and treatment stratification

    Get PDF
    Malignant tumors remain one of the leading causes of mortality with over 8.2 million deaths worldwide in 2012. Over the last two decades, high-throughput profiling of the human transcriptome has become an essential tool to investigate molecular processes involved in carcinogenesis. In this thesis I explore how gene expression profiling (GEP) can be used in multiple aspects of cancer research, including prevention, patient stratification and subtype discovery. The first part details how GEP could be used to supplement or even replace the current gold standard assay for testing the carcinogenic potential of chemicals. This toxicogenomic approach coupled with a Random Forest algorithm allowed me to build models capable of predicting carcinogenicity with an area under the curve of up to 86.8% and provided valuable insights into the underlying mechanisms that may contribute to cancer development. The second part describes how GEP could be used to stratify heterogeneous populations of lymphoma patients into therapeutically relevant disease sub-classes, with a particular focus on diffuse large B-cell lymphoma (DLBCL). Here, I successfully translated established biomarkers from the Affymetrix platform to the clinically relevant Nanostring nCounter© assay. This translation allowed us to profile custom sets of transcripts from formalin-fixed samples, transforming these biomarkers into clinically relevant diagnostic tools. Finally, I describe my effort to discover tumor samples dependent on altered metabolism driven by oxidative phosphorylation (OxPhos) across multiple tissue types. This work was motivated by previous studies that identified a therapeutically relevant OxPhos sub-type in DLBCL, and by the hypothesis that this stratification might be applicable to other solid tumor types. To that end, I carried out a transcriptomics-based pan-cancer analysis, derived a generalized PanOxPhos gene signature, and identified mTOR as a potential regulator in primary tumor samples. High throughput GEP coupled with statistical machine learning methods represent an important toolbox in modern cancer research. It provides a cost effective and promising new approach for predicting cancer risk associated to chemical exposure, it can reduce the cost of the ever increasing drug development process by identifying therapeutically actionable disease subtypes, and it can increase patients’ survival by matching them with the most effective drugs.2016-12-01T00:00:00

    Diagnostic prediction of complex diseases using phase-only correlation based on virtual sample template

    Get PDF
    Motivation: Complex diseases induce perturbations to interaction and regulation networks in living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is of great benefit to the diagnosis and treatment of complex diseases. However, it remains a major challenge to deal with the high dimensionality and small size of available complex disease gene expression datasets currently used for discovering gene expression patterns. Results: Here we present a phase-only correlation (POC) based classification method for recognizing the type of complex diseases. First, a virtual sample template is constructed for each subclass by averaging all samples of each subclass in a training dataset. Then the label of a test sample is determined by measuring the similarity between the test sample and each template. This novel method can detect the similarity of overall patterns emerged from the differentially expressed genes or proteins while ignoring small mismatches. Conclusions: The experimental results obtained on seven publicly available complex disease datasets including microarray and protein array data demonstrate that the proposed POC-based disease classification method is effective and robust for diagnosing complex diseases with regard to the number of initially selected features, and its recognition accuracy is better than or comparable to other state-of-the-art machine learning methods. In addition, the proposed method does not require parameter tuning and data scaling, which can effectively reduce the occurrence of over-fitting and bias

    On Random Subspace Optimization-Based Hybrid Computing Models Predicting the California Bearing Ratio of Soils

    Get PDF
    The California Bearing Ratio (CBR) is an important index for evaluating the bearing capacity of pavement subgrade materials. In this research, random subspace optimization-based hybrid computing models were trained and developed for the prediction of the CBR of soil. Three models were developed, namely reduced error pruning trees (REPTs), random subsurface-based REPT (RSS-REPT), and RSS-based extra tree (RSS-ET). An experimental database was compiled from a total of 214 soil samples, which were classified according to AASHTO M 145, and included 26 samples of A-2-6 (clayey gravel and sand soil), 3 samples of A-4 (silty soil), 89 samples of A-6 (clayey soil), and 96 samples of A-7-6 (clayey soil). All CBR tests were performed in soaked conditions. The input parameters of the models included the particle size distribution, gravel content (G), coarse sand content (CS), fine sand content (FS), silt clay content (SC), organic content (O), liquid limit (LL), plastic limit (PL), plasticity index (PI), optimum moisture content (OMC), and maximum dry density (MDD). The accuracy of the developed models was assessed using numerous performance indexes, such as the coefficient of determination, relative error, MAE, and RMSE. The results show that the highest prediction accuracy was obtained using the RSS-based extra tree optimization technique
    • …
    corecore