384 research outputs found
A systematic review of data quality issues in knowledge discovery tasks
Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust
Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification
With the abundance of industrial datasets, imbalanced classification has
become a common problem in several application domains. Oversampling is an
effective method to solve imbalanced classification. One of the main challenges
of the existing oversampling methods is to accurately label the new synthetic
samples. Inaccurate labels of the synthetic samples would distort the
distribution of the dataset and possibly worsen the classification performance.
This paper introduces the idea of weakly supervised learning to handle the
inaccurate labeling of synthetic samples caused by traditional oversampling
methods. Graph semi-supervised SMOTE is developed to improve the credibility of
the synthetic samples' labels. In addition, we propose cost-sensitive
neighborhood components analysis for high dimensional datasets and bootstrap
based ensemble framework for highly imbalanced datasets. The proposed method
has achieved good classification performance on 8 synthetic datasets and 3
real-world datasets, especially for high imbalance and high dimensionality
problems. The average performances and robustness are better than the benchmark
methods
Automating the decision making process of Todd’s age estimation method from the pubic symphysis with explainable machine learning
Age estimation is a fundamental task in forensic anthropology for both the living and the
dead. The procedure consists of analyzing properties such as appearance, ossification patterns,
and morphology in different skeletonized remains. The pubic symphysis is extensively
used to assess adults’ age-at-death due to its reliability. Nevertheless, most
methods currently used for skeleton-based age estimation are carried out manually, even
though their automation has the potential to lead to a considerable improvement in terms
of economic resources, effectiveness, and execution time. In particular, explainable
machine learning emerges as a promising means of addressing this challenge by engaging
forensic experts to refine and audit the extracted knowledge and discover unknown patterns
hidden in the complex and uncertain available data. In this contribution we address
the automation of the decision making process of Todd’s pioneering age assessment
method to assist the forensic practitioner in its application. To do so, we make use of the
pubic bone data base available at the Physical Anthropology lab of the University of
Granada. The machine learning task is significantly complex as it becomes an imbalanced
ordinal classification problem with a small sample size and a high dimension. We tackle it
with the combination of an ordinal classification method and oversampling techniques
through an extensive experimental setup. Two forensic anthropologists refine and validate
the derived rule base according to their own expertise and the knowledge available in the
area. The resulting automatic system, finally composed of 34 interpretable rules, outperforms
the state-of-the-art accuracy. In addition, and more importantly, it allows the forensic
experts to uncover novel and interesting insights about how Todd’s method works, in
particular, and the guidelines to estimate age-at-death from pubic symphysis characteristics,
generally.Ministry of Science and Innovation, Spain (MICINN)
Spanish GovernmentAgencia Estatal de Investigacion (AEI) PID2021-122916NB-I00
Spanish Government PGC2018-101216-B-I00Junta de AndaluciaUniversity of Granada P18 -FR -4262
B-TIC-456-UGR20European CommissionUniversidad de Granada/CBU
Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection
Gaining the trust of customers and providing them empathy are very critical
in the financial domain. Frequent occurrence of fraudulent activities affects
these two factors. Hence, financial organizations and banks must take utmost
care to mitigate them. Among them, ATM fraudulent transaction is a common
problem faced by banks. There following are the critical challenges involved in
fraud datasets: the dataset is highly imbalanced, the fraud pattern is
changing, etc. Owing to the rarity of fraudulent activities, Fraud detection
can be formulated as either a binary classification problem or One class
classification (OCC). In this study, we handled these techniques on an ATM
transactions dataset collected from India. In binary classification, we
investigated the effectiveness of various over-sampling techniques, such as the
Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative
Adversarial Networks (GAN), to achieve oversampling. Further, we employed
various machine learning techniques viz., Naive Bayes (NB), Logistic Regression
(LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF),
Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed
the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958
AUC. DT is the winner if the complexity and interpretability aspects are
considered. Among all the oversampling approaches, SMOTE and its variants were
observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM
secured second place with 0.947 CR. Further, we incorporated explainable
artificial intelligence (XAI) and causal inference (CI) in the fraud detection
framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table
Deficient data classification with fuzzy learning
This thesis first proposes a novel algorithm for handling both missing values and imbalanced data classification problems. Then, algorithms for addressing the class imbalance problem in Twitter spam detection (Network Security Problem) have been proposed. Finally, the security profile of SVM against deliberate attacks has been simulated and analysed.<br /
Analysing an Imbalanced Stroke Prediction Dataset Using Machine Learning Techniques
A stroke is a medical condition characterized by the rupture of blood vessels within the brain which can lead to brain damage. Various symptoms may be exhibited when the brain's supply of blood and essential nutrients is disrupted. To forecast the possibility of brain stroke occurring at an early stage using Machine Learning (ML) and Deep Learning (DL) is the main objective of this study. Timely detection of the various warning signs of a stroke can significantly reduce its severity. This paper performed a comprehensive analysis of features to enhance stroke prediction effectiveness. A reliable dataset for stroke prediction is taken from the Kaggle website to gauge the effectiveness of the proposed algorithm. The dataset has a class imbalance problem which means the total number of negative samples is higher than the total number of positive samples. The results are reported based on a balanced dataset created using oversampling techniques. The proposed work used Smote and Adasyn to handle imbalanced problem for better evaluation metrics. Additionally, the hybrid Neural Network and Random Forest (NN-RF) utilizing the balanced dataset by Adasyn oversampling achieves the highest F1-score of 75% compared to the original unbalanced dataset and other benchmarking algorithms. The proposed algorithm with balanced data utilizing hybrid NN-RF achieves an accuracy of 84%. Advanced ML techniques coupled with thorough data analysis enhance stroke prediction. This study underscores the significance of data-driven methodologies, resulting in improved accuracy and comprehension of stroke risk factors. Applying these methodologies to medical fields can enhance patient care and public health outcomes. By integrating our discoveries, we can enhance the efficiency and effectiveness of the public health system
- …