25 research outputs found

    Primal Estimated Subgradient Solver for SVM for Imbalanced Classification

    Full text link
    We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or we choose over representative or under representative training/test data. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.Comment: 10 pages, 4 tables, 3 figure

    Machine Learning Approaches for Healthcare Analysis

    Get PDF
    Machine learning (ML)is a division of artificial intelligence that teaches computers how to discover difficult-to-distinguish patterns from huge or complex data sets and learn from previous cases by utilizing a range of statistical, probabilistic, data processing, and optimization methods. Nowadays, ML plays a vital role in many fields, such as finance, self-driving cars, image processing, medicine, and Speech recognition. In healthcare, ML has been used in applications such as the detection, prognosis, diagnosis, and treatment of diseases due to Its capability to handle large data. Moreover, ML has exceptional abilities to predict disease by uncovering patterns from medical datasets. Machine learning and deep learning are better suited for analyzing medical datasets than traditional methods because of the nature of these datasets. They are mostly large and complex heterogeneous data coming from different sources, requiring more efficient computational techniques to handle them. This dissertation presents several machine-learning techniques to tackle medical issues such as data imbalance, classification and upgrading tumor stages, and multi-omics integration. In the second chapter, we introduce a novel method to handle class-imbalanced dilemmas, a common issue in bioinformatics datasets. In class-imbalanced data, the number of samples in each class is unequal. Since most data sets contain usual versus unusual cases, e.g., cancer versus normal or miRNAs versus other noncoding RNA, the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest, and k-NN, are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced datasets has gained researchers’ interest recently. A combination of proper feature selection, a cost-sensitive classifier, and ensembling based on the random forest method (BCECSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the remaining out-bagged samples. Samples in each ensemble will be classified using a class-sensitive classifier incorporating random forest. The sample will be classified by selecting the most often class that has been voted for in all sample appearances in all the formed ensembles. A set of performance measurements, including a geometric measurement, suggests that the model can improve the classification of the minority class samples. In the third chapter, we introduce a novel study to predict the upgrading of the Gleason score on confirmatory magnetic resonance imaging-guided targeted biopsy (MRI-TB) of the prostate in candidates for active surveillance based on clinical features. MRI of the prostate is not accessible to many patients due to difficulty contacting patients, insurance denials, and African-American patients are disproportionately affected by barriers to MRI of the prostate during Active surveillance [6,7]. Modeling clinical variables with advanced methods, such as machine learning, could allow us to manage patients in resource-limited environments with limited technological access. Upgrading to significant prostate cancer on MRI-TB was defined as upgrading to G 3+4 (definition 1 - DF1) and 4+3 (DF2). For upgrading prediction, the AdaBoost model was highly predictive of upgrading DF1 (AUC 0.952), while for prediction of upgrading DF2, the Random Forest model had a lower but excellent prediction performance (AUC 0.947). In the fourth chapter, we introduce a multi-omics data integration method to analyze multi-omics data for biomedical applications, including disease prediction, disease subtypes, biomarker prediction, and others. Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Our method is constructed using the combination of gene similarity network (GSN) based on Uniform Manifold Approximation and Projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients. The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SO

    Identification of urban sectors prone to solid waste accumulation: A machine learning approach based on social indicators

    Get PDF
    In the last decades, the accumulation of municipal solid waste in urban areas has become a latent concern in our society due to its implications for the exposed population and the possible health and environmental issues it may cause. In this sense, this research study contributes to the timely identification of these sectors according to the anthropogenic characteristics of their residents as dictated by 10 social indicators (i.e., age, education, income, among others) sorted into three assessment categories (sociodemographic, sociocultural, and socioeconomic). Then, the data collected was processed and analyzed using two machine learning algorithms (random forest (RF) and logistic regression (LR)). The primary information that fed the machine learning model was collected through field visits and local/national reports. For this research, the Puente Piedra and Chaclacayo districts, both located in the province of Lima, Peru, were selected as case studies. Results suggest that the most relevant social indicators that help identifying these sectors are monthly income, consumption patterns, age, and household population density. The experiments showed that the RF algorithm has the best performance, since it efficiently identified 63% of the possible solid waste accumulation zones. In addition, both models were capable of determining different classes (AUC – RF = 0.65, AUC – LR = 0.71). Finally, the proposed approach is applicable and reproducible in different sectors of the national Peruvian territory.Campus Lima Centr

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Relational data clustering algorithms with biomedical applications

    Get PDF

    Kernel Methods for Machine Learning with Life Science Applications

    Get PDF

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Human lower limb activity recognition techniques, databases, challenges and its applications using sEMG signal: an overview

    Get PDF
    Human lower limb activity recognition (HLLAR) has grown in popularity over the last decade mainly because to its applications in the identification and control of neuromuscular disorders, security, robotics, and prosthetics. Surface electromyography (sEMG) sensors provide various advantages over other wearable or visual sensors for HLLAR applications, including quick response, pervasiveness, no medical monitoring, and negligible infection. Recognizing lower limb activity from sEMG signals is also challenging owing to the noise in the sEMG signal. Pre- processing of sEMG signals is extremely desirable before the classification because they allow a more consistent and precise evaluation in the above applications. This article provides a segment-by-segment overview of: (1) Techniques for eliminating artifacts from sEMG signals from the lower limb. (2) A survey of existing datasets of lower limb sEMG. (3) A concise description of the various techniques for processing and classifying sEMG data for various applications involving lower limb activity. Finally, an open discussion is presented, which may result in the identification of a variety of future research possibilities for human lower limb activity recognition. Therefore, it is possible to anticipate that the framework presented in this study can aid in the advancement of sEMG-based recognition of human lower limb activity

    Novel techniques of computational intelligence for analysis of astronomical structures

    Get PDF
    Gravitational forces cause the formation and evolution of a variety of cosmological structures. The detailed investigation and study of these structures is a crucial step towards our understanding of the universe. This thesis provides several solutions for the detection and classification of such structures. In the first part of the thesis, we focus on astronomical simulations, and we propose two algorithms to extract stellar structures. Although they follow different strategies (while the first one is a downsampling method, the second one keeps all samples), both techniques help to build more effective probabilistic models. In the second part, we consider observational data, and the goal is to overcome some of the common challenges in observational data such as noisy features and imbalanced classes. For instance, when not enough examples are present in the training set, two different strategies are used: a) nearest neighbor technique and b) outlier detection technique. In summary, both parts of the thesis show the effectiveness of automated algorithms in extracting valuable information from astronomical databases
    corecore