1,052 research outputs found

    Machine learning based data pre-processing for the purpose of medical data mining and decision support

    Get PDF
    Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. Sometimes, improved data quality is itself the goal of the analysis, usually to improve processes in a production database and the designing of decision support. As medicine moves forward there is a need for sophisticated decision support systems that make use of data mining to support more orthodox knowledge engineering and Health Informatics practice. However, the real-life medical data rarely complies with the requirements of various data mining tools. It is often inconsistent, noisy, containing redundant attributes, in an unsuitable format, containing missing values and imbalanced with regards to the outcome class label.Many real-life data sets are incomplete, with missing values. In medical data mining the problem with missing values has become a challenging issue. In many clinical trials, the medical report pro-forma allow some attributes to be left blank, because they are inappropriate for some class of illness or the person providing the information feels that it is not appropriate to record the values for some attributes. The research reported in this thesis has explored the use of machine learning techniques as missing value imputation methods. The thesis also proposed a new way of imputing missing value by supervised learning. A classifier was used to learn the data patterns from a complete data sub-set and the model was later used to predict the missing values for the full dataset. The proposed machine learning based missing value imputation was applied on the thesis data and the results are compared with traditional Mean/Mode imputation. Experimental results show that all the machine learning methods which we explored outperformed the statistical method (Mean/Mode).The class imbalance problem has been found to hinder the performance of learning systems. In fact, most of the medical datasets are found to be highly imbalance in their class label. The solution to this problem is to reduce the gap between the minority class samples and the majority class samples. Over-sampling can be applied to increase the number of minority class sample to balance the data. The alternative to over-sampling is under-sampling where the size of majority class sample is reduced. The thesis proposed one cluster based under-sampling technique to reduce the gap between the majority and minority samples. Different under-sampling and over-sampling techniques were explored as ways to balance the data. The experimental results show that for the thesis data the new proposed modified cluster based under-sampling technique performed better than other class balancing techniques.In further research it is found that the class imbalance problem not only affects the classification performance but also has an adverse effect on feature selection. The thesis proposed a new framework for feature selection for class imbalanced datasets. The research found that, using the proposed framework the classifier needs less attributes to show high accuracy, and more attributes are needed if the data is highly imbalanced.The research described in the thesis contains the flowing four novel main contributions.a) Improved data mining methodology for mining medical datab) Machine learning based missing value imputation methodc) Cluster Based semi-supervised class balancing methodd) Feature selection framework for class imbalance datasetsThe performance analysis and comparative study show that the use of proposed method of missing value imputation, class balancing and feature selection framework can provide an effective approach to data preparation for building medical decision support

    EFN-SMOTE: An effective oversampling technique for credit card fraud detection by utilizing noise filtering and fuzzy c-means clustering

    Get PDF
    Credit card fraud poses a significant challenge for both consumers and organizations worldwide, particularly with the increasing reliance on credit cards for financial transactions. Therefore, it is crucial to establish effective mechanisms to detect credit card fraud. However, the uneven distribution of instances between the two classes in the credit card dataset hinders traditional machine learning techniques, as they tend to prioritize the majority class, leading to inaccurate fraud pre- dictions. To address this issue, this paper focuses on the use of the Elbow Fuzzy Noise Filtering SMOTE (EFN-SMOTE) technique, an oversampling approach, to handle unbalanced data. EFN-SMOTE partitions the dataset into multiple clusters using the Elbow method, applies noise filtering to each cluster, and then employs SMOTE to synthesize new minority instances based on the nearest majority instance to each minority instance, thereby improving the model’s ability to perceive the decision boundary. EFN-SMOTE’s performance was evaluated using an Artificial Neural Network model with four hidden layers, resulting in significant improvements in classification performance, achieving an accuracy of 0.999, precision of 0.998, sensitivity of 0.999, specificity of 0.998, F-measure of 0.999, and G-Mean of 0.999

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Local matching learning of large scale biomedical ontologies

    Get PDF
    Les larges ontologies biomédicales décrivent généralement le même domaine d'intérêt, mais en utilisant des modèles de modélisation et des vocabulaires différents. Aligner ces ontologies qui sont complexes et hétérogènes est une tâche fastidieuse. Les systèmes de matching doivent fournir des résultats de haute qualité en tenant compte de la grande taille de ces ressources. Les systèmes de matching d'ontologies doivent résoudre deux problèmes: (i) intégrer la grande taille d'ontologies, (ii) automatiser le processus d'alignement. Le matching d'ontologies est une tâche difficile en raison de la large taille des ontologies. Les systèmes de matching d'ontologies combinent différents types de matcher pour résoudre ces problèmes. Les principaux problèmes de l'alignement de larges ontologies biomédicales sont: l'hétérogénéité conceptuelle, l'espace de recherche élevé et la qualité réduite des alignements résultants. Les systèmes d'alignement d'ontologies combinent différents matchers afin de réduire l'hétérogénéité. Cette combinaison devrait définir le choix des matchers à combiner et le poids. Différents matchers traitent différents types d'hétérogénéité. Par conséquent, le paramétrage d'un matcher devrait être automatisé par les systèmes d'alignement d'ontologies afin d'obtenir une bonne qualité de correspondance. Nous avons proposé une approche appele "local matching learning" pour faire face à la fois à la grande taille des ontologies et au problème de l'automatisation. Nous divisons un gros problème d'alignement en un ensemble de problèmes d'alignement locaux plus petits. Chaque problème d'alignement local est indépendamment aligné par une approche d'apprentissage automatique. Nous réduisons l'énorme espace de recherche en un ensemble de taches de recherche de corresondances locales plus petites. Nous pouvons aligner efficacement chaque tache de recherche de corresondances locale pour obtenir une meilleure qualité de correspondance. Notre approche de partitionnement se base sur une nouvelle stratégie à découpes multiples générant des partitions non volumineuses et non isolées. Par conséquence, nous pouvons surmonter le problème de l'hétérogénéité conceptuelle. Le nouvel algorithme de partitionnement est basé sur le clustering hiérarchique par agglomération (CHA). Cette approche génère un ensemble de tâches de correspondance locale avec un taux de couverture suffisant avec aucune partition isolée. Chaque tâche d'alignement local est automatiquement alignée en se basant sur les techniques d'apprentissage automatique. Un classificateur local aligne une seule tâche d'alignement local. Les classificateurs locaux sont basés sur des features élémentaires et structurelles. L'attribut class de chaque set de donne d'apprentissage " training set" est automatiquement étiqueté à l'aide d'une base de connaissances externe. Nous avons appliqué une technique de sélection de features pour chaque classificateur local afin de sélectionner les matchers appropriés pour chaque tâche d'alignement local. Cette approche réduit la complexité d'alignement et augmente la précision globale par rapport aux méthodes d'apprentissage traditionnelles. Nous avons prouvé que l'approche de partitionnement est meilleure que les approches actuelles en terme de précision, de taux de couverture et d'absence de partitions isolées. Nous avons évalué l'approche d'apprentissage d'alignement local à l'aide de diverses expériences basées sur des jeux de données d'OAEI 2018. Nous avons déduit qu'il est avantageux de diviser une grande tâche d'alignement d'ontologies en un ensemble de tâches d'alignement locaux. L'espace de recherche est réduit, ce qui réduit le nombre de faux négatifs et de faux positifs. L'application de techniques de sélection de caractéristiques à chaque classificateur local augmente la valeur de rappel pour chaque tâche d'alignement local.Although a considerable body of research work has addressed the problem of ontology matching, few studies have tackled the large ontologies used in the biomedical domain. We introduce a fully automated local matching learning approach that breaks down a large ontology matching task into a set of independent local sub-matching tasks. This approach integrates a novel partitioning algorithm as well as a set of matching learning techniques. The partitioning method is based on hierarchical clustering and does not generate isolated partitions. The matching learning approach employs different techniques: (i) local matching tasks are independently and automatically aligned using their local classifiers, which are based on local training sets built from element level and structure level features, (ii) resampling techniques are used to balance each local training set, and (iii) feature selection techniques are used to automatically select the appropriate tuning parameters for each local matching context. Our local matching learning approach generates a set of combined alignments from each local matching task, and experiments show that a multiple local classifier approach outperforms conventional, state-of-the-art approaches: these use a single classifier for the whole ontology matching task. In addition, focusing on context-aware local training sets based on local feature selection and resampling techniques significantly enhances the obtained results

    Towards Data-centric Graph Machine Learning: Review and Outlook

    Full text link
    Data-centric AI, with its primary focus on the collection, management, and utilization of data to drive AI models and applications, has attracted increasing attention in recent years. In this article, we conduct an in-depth and comprehensive review, offering a forward-looking outlook on the current efforts in data-centric AI pertaining to graph data-the fundamental data structure for representing and capturing intricate dependencies among massive and diverse real-life entities. We introduce a systematic framework, Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of the graph data lifecycle, including graph data collection, exploration, improvement, exploitation, and maintenance. A thorough taxonomy of each stage is presented to answer three critical graph-centric questions: (1) how to enhance graph data availability and quality; (2) how to learn from graph data with limited-availability and low-quality; (3) how to build graph MLOps systems from the graph data-centric view. Lastly, we pinpoint the future prospects of the DC-GML domain, providing insights to navigate its advancements and applications.Comment: 42 pages, 9 figure

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
    • …
    corecore