384 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Rails Quality Data Modelling via Machine Learning-Based Paradigms

    Get PDF

    Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

    Full text link
    With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods

    Automating the decision making process of Todd’s age estimation method from the pubic symphysis with explainable machine learning

    Get PDF
    Age estimation is a fundamental task in forensic anthropology for both the living and the dead. The procedure consists of analyzing properties such as appearance, ossification patterns, and morphology in different skeletonized remains. The pubic symphysis is extensively used to assess adults’ age-at-death due to its reliability. Nevertheless, most methods currently used for skeleton-based age estimation are carried out manually, even though their automation has the potential to lead to a considerable improvement in terms of economic resources, effectiveness, and execution time. In particular, explainable machine learning emerges as a promising means of addressing this challenge by engaging forensic experts to refine and audit the extracted knowledge and discover unknown patterns hidden in the complex and uncertain available data. In this contribution we address the automation of the decision making process of Todd’s pioneering age assessment method to assist the forensic practitioner in its application. To do so, we make use of the pubic bone data base available at the Physical Anthropology lab of the University of Granada. The machine learning task is significantly complex as it becomes an imbalanced ordinal classification problem with a small sample size and a high dimension. We tackle it with the combination of an ordinal classification method and oversampling techniques through an extensive experimental setup. Two forensic anthropologists refine and validate the derived rule base according to their own expertise and the knowledge available in the area. The resulting automatic system, finally composed of 34 interpretable rules, outperforms the state-of-the-art accuracy. In addition, and more importantly, it allows the forensic experts to uncover novel and interesting insights about how Todd’s method works, in particular, and the guidelines to estimate age-at-death from pubic symphysis characteristics, generally.Ministry of Science and Innovation, Spain (MICINN) Spanish GovernmentAgencia Estatal de Investigacion (AEI) PID2021-122916NB-I00 Spanish Government PGC2018-101216-B-I00Junta de AndaluciaUniversity of Granada P18 -FR -4262 B-TIC-456-UGR20European CommissionUniversidad de Granada/CBU

    Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection

    Full text link
    Gaining the trust of customers and providing them empathy are very critical in the financial domain. Frequent occurrence of fraudulent activities affects these two factors. Hence, financial organizations and banks must take utmost care to mitigate them. Among them, ATM fraudulent transaction is a common problem faced by banks. There following are the critical challenges involved in fraud datasets: the dataset is highly imbalanced, the fraud pattern is changing, etc. Owing to the rarity of fraudulent activities, Fraud detection can be formulated as either a binary classification problem or One class classification (OCC). In this study, we handled these techniques on an ATM transactions dataset collected from India. In binary classification, we investigated the effectiveness of various over-sampling techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative Adversarial Networks (GAN), to achieve oversampling. Further, we employed various machine learning techniques viz., Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958 AUC. DT is the winner if the complexity and interpretability aspects are considered. Among all the oversampling approaches, SMOTE and its variants were observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM secured second place with 0.947 CR. Further, we incorporated explainable artificial intelligence (XAI) and causal inference (CI) in the fraud detection framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table

    Deficient data classification with fuzzy learning

    Full text link
    This thesis first proposes a novel algorithm for handling both missing values and imbalanced data classification problems. Then, algorithms for addressing the class imbalance problem in Twitter spam detection (Network Security Problem) have been proposed. Finally, the security profile of SVM against deliberate attacks has been simulated and analysed.<br /

    Analysing an Imbalanced Stroke Prediction Dataset Using Machine Learning Techniques

    Get PDF
    A stroke is a medical condition characterized by the rupture of blood vessels within the brain which can lead to brain damage. Various symptoms may be exhibited when the brain's supply of blood and essential nutrients is disrupted. To forecast the possibility of brain stroke occurring at an early stage using Machine Learning (ML) and Deep Learning (DL) is the main objective of this study. Timely detection of the various warning signs of a stroke can significantly reduce its severity. This paper performed a comprehensive analysis of features to enhance stroke prediction effectiveness. A reliable dataset for stroke prediction is taken from the Kaggle website to gauge the effectiveness of the proposed algorithm. The dataset has a class imbalance problem which means the total number of negative samples is higher than the total number of positive samples. The results are reported based on a balanced dataset created using oversampling techniques. The proposed work used Smote and Adasyn to handle imbalanced problem for better evaluation metrics. Additionally, the hybrid Neural Network and Random Forest (NN-RF) utilizing the balanced dataset by Adasyn oversampling achieves the highest F1-score of 75% compared to the original unbalanced dataset and other benchmarking algorithms. The proposed algorithm with balanced data utilizing hybrid NN-RF achieves an accuracy of 84%. Advanced ML techniques coupled with thorough data analysis enhance stroke prediction. This study underscores the significance of data-driven methodologies, resulting in improved accuracy and comprehension of stroke risk factors. Applying these methodologies to medical fields can enhance patient care and public health outcomes. By integrating our discoveries, we can enhance the efficiency and effectiveness of the public health system
    corecore