23 research outputs found

    Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.

    Get PDF
    Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem

    Investigating biomarkers in Parkinson's disease using machine learning

    Get PDF
    Genome-Wide Association Studies (GWAS) identify genetic variations in individuals affected with diseases such as Parkinson's disease (PD), whose allele or genotype frequencies are significantly different between the affected individuals and individuals who are free of the disease. GWAS data can be used to identify genetic variations associated with the disease of interest. However, GWAS datasets are extensive and contain many more Single Nucleotide Polymorphisms (SNPs pronounced “snips”) than individual samples. To address these challenges, we used Singular-Vectors Feature Selection (SVFS) and applied it to PD GWAS datasets. We discovered a group of SNPs that are potentially novel PD biomarkers as we found indirect links between them and PD in the literature but have not directly been associated with PD before. Direct association means that current literature directly links a SNP with PD; while an indirect link means that current literature suggests the involvement of a SNP in a disease other than PD but this other disease co-occurs with PD in a significant number of PD patients. These indirectly-linked SNPs open new potential lines of investigation. Directly-linked SNPs identified by our method are rs11248060, rs239748, rs999473, and rs2313982. One can see the full list of identified SNPs in Section 4.4

    Leveraging big data resources and data integration in biology: applying computational systems analyses and machine learning to gain insights into the biology of cancers

    Get PDF
    Recently, many "molecular profiling" projects have yielded vast amounts of genetic, epigenetic, transcription, protein expression, metabolic and drug response data for cancerous tumours, healthy tissues, and cell lines. We aim to facilitate a multi-scale understanding of these high-dimensional biological data and the complexity of the relationships between the different data types taken from human tumours. Further, we intend to identify molecular disease subtypes of various cancers, uncover the subtype-specific drug targets and identify sets of therapeutic molecules that could potentially be used to inhibit these targets. We collected data from over 20 publicly available resources. We then leverage integrative computational systems analyses, network analyses and machine learning, to gain insights into the pathophysiology of pancreatic cancer and 32 other human cancer types. Here, we uncover aberrations in multiple cell signalling and metabolic pathways that implicate regulatory kinases and the Warburg effect as the likely drivers of the distinct molecular signatures of three established pancreatic cancer subtypes. Then, we apply an integrative clustering method to four different types of molecular data to reveal that pancreatic tumours can be segregated into two distinct subtypes. We define sets of proteins, mRNAs, miRNAs and DNA methylation patterns that could serve as biomarkers to accurately differentiate between the two pancreatic cancer subtypes. Then we confirm the biological relevance of the identified biomarkers by showing that these can be used together with pattern-recognition algorithms to infer the drug sensitivity of pancreatic cancer cell lines accurately. Further, we evaluate the alterations of metabolic pathway genes across 32 human cancers. We find that while alterations of metabolic genes are pervasive across all human cancers, the extent of these gene alterations varies between them. Based on these gene alterations, we define two distinct cancer supertypes that tend to be associated with different clinical outcomes and show that these supertypes are likely to respond differently to anticancer drugs. Overall, we show that the time has already arrived where we can leverage available data resources to potentially elicit more precise and personalised cancer therapies that would yield better clinical outcomes at a much lower cost than is currently being achieved

    Development of ultrasound to measure deformation of functional spinal units in cervical spine

    Full text link
    Neck pain is a pervasive problem in the general population, especially in those working in vibrating environments, e.g. military troops and truck drivers. Previous studies showed neck pain was strongly associated with the degeneration of intervertebral disc, which is commonly caused by repetitive loading in the work place. Currently, there is no existing method to measure the in-vivo displacement and loading condition of cervical spine on the site. Therefore, there is little knowledge about the alternation of cervical spine functionality and biomechanics in dynamic environments. In this thesis, a portable ultrasound system was explored as a tool to measure the vertebral motion and functional spinal unit deformation. It is hypothesized that the time sequences of ultrasound imaging signals can be used to characterize the deformation of cervical spine functional spinal units in response to applied displacements and loading. Specifically, a multi-frame tracking algorithm is developed to measure the dynamic movement of vertebrae, which is validated in ex-vivo models. The planar kinematics of the functional spinal units is derived from a dual ultrasound system, which applies two ultrasound systems to image C-spine anteriorly and posteriorly. The kinematics is reconstructed from the results of the multi-frame movement tracking algorithm and a method to co-register ultrasound vertebrae images to MRI scan. Using the dual ultrasound, it is shown that the dynamic deformation of functional spinal unit is affected by the biomechanics properties of intervertebral disc ex-vivo and different applied loading in activities in-vivo. It is concluded that ultrasound is capable of measuring functional spinal units motion, which allows rapid in-vivo evaluation of C-spine in dynamic environments where X-Ray, CT or MRI cannot be used.2020-02-20T00:00:00

    Statistical Learning Methods for Electronic Health Record Data

    Full text link
    In the current era of electronic health records (EHR), use of data to make informed clinical decisions is at an all-time high. Although the collection, upkeep and accessibility of EHR data continues to grow, statistical methodology focused on aiding real-time clinical decision making is lacking. Improved decision making tools generally lead to improved patient outcomes and lower healthcare costs. In this dissertation, we propose three statistical learning methods to improve clinical decision making based on EHR data. In the first chapter we propose a new classifier: SVM-CART, that combines features of Support Vector Machines (SVM) and Classification and Regression Trees (CART) to produce a flexible classifier that outperforms either method in terms of prediction accuracy and ease of use. The method is especially powerful in situations where the disease-exposure mechanisms may be different across subgroups of the population. Through simulation, under settings with high levels of interaction, the SVM-CART classifier resulted in significant prediction accuracy improvements. We illustrate our method to diagnose neuropathy using various components of the metabolic syndrome. In predicting neuropathy, SVM-CART outperformed CART in terms of prediction accuracy and provided improved interpretability compared to SVM. In the second chapter, we develop regression tree and ensemble methods for multivariate outcomes. We propose two general approaches to develop multivariate regression trees by: (1) minimizing within-node homogeneity, and (2) maximizing between-node separation. Within-node homogeneity is measured using the average Mahalanobis distance and the determinant of the covariance matrix. For between-node separation, we propose using the Mahalanobis and Euclidean distances. The proposed multivariate regression trees are illustrated using two clinical datasets of neuropathy and pediatric cardiac surgery. In high variance scenarios or when the dimension of the outcome was large, the Mahalanobis distance split trees had the best prediction performance. The determinant split trees generally had a simple structure and the Euclidean distance metrics performed well in large sample settings. In both applications, the resulting multivariate trees improve usability and validity compared to predictions made using multiple univariate regression trees. In the third chapter we develop a sequential method to make prediction using shallow (large-scale EHR) data in tandem with deep (health system specific) patient data. Specifically, we utilize machine learning based methods to first give prediction based on a large-scale EHR, then for a select group of patients, refine prediction based on the deep EHR data. We develop a novel framework that is time and cost-effective, for identifying patient subgroups that would most benefit from a second-stage prediction refinement. Final tandem prediction is obtained by combining predictions from both the first and second stage classifiers. We apply our tandem approach to predict extubation failure for pediatric patients that have undergone a critical cardiac operation using shallow data from a national registry and deep continuously streamed data captured in the intensive care unit. Using these two EHR data sources in tandem increased our ability to identify extubation failures in terms of the area under the ROC curve (AUC: 0.639) compared to using just the national registry (AUC: 0.607) or physiologic ICU data (AUC: 0.634) alone. Additionally, identifying a specific patient subgroup for second stage prediction refinement resulted in additional prediction improvement, as opposed to giving each patient a deep-data prediction (AUC: 0.682).PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149829/1/evanlr_1.pd

    16th SC@RUG 2019 proceedings 2018-2019

    Get PDF
    corecore