Search CORE

905 research outputs found

A systematic review of data quality issues in knowledge discovery tasks

Author: Corrales David Camilo
Corrales Juan Carlos
Ledezma Agapito Ismael
Publication venue: 'Universidad de Medellin'
Publication date: 07/11/2015
Field of study

Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Universidad de Medellín: Revistas Científicas

Repositorio Institucional Universidad de Medellín

DIALNET

An enhanced resampling technique for imbalanced data sets

Author: Maisarah Zorkeflee
Publication venue
Publication date: 01/01/2015
Field of study

A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800

Universiti Utara Malaysia: UUM eTheses

A survey on pre-processing techniques: relevant issues in the context of environmental data mining

Author: Gibert Karina
Izquierdo Joaquín
Sànchez-Marrè Miquel
Publication venue: 'IOS Press'
Publication date: 01/01/2016
Field of study

One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

RiuNet

Recommended from our members

GENETIC PROGRAMMING TO OPTIMIZE PERFORMANCE OF MACHINE LEARNING ALGORITHMS ON UNBALANCED DATA SET

Author: Thumpati Asitha
Publication venue: CSUSB ScholarWorks
Publication date: 01/08/2023
Field of study

Data collected from the real world is often imbalanced, meaning that the distribution of data across known classes is biased or skewed. When using machine learning classification models on such imbalanced data, predictive performance tends to be lower because these models are designed with the assumption of balanced classes or a relatively equal number of instances for each class. To address this issue, we employ data preprocessing techniques such as SMOTE (Synthetic Minority Oversampling Technique) for oversampling data and random undersampling for undersampling data on unbalanced datasets. Once the dataset is balanced, genetic programming is utilized for feature selection to enhance performance and efficiency. For this experiment, we consider an imbalanced bank marketing dataset from the UCI Machine Learning Repository. To assess the effectiveness of the technique, it is implemented on four different classification algorithms: Decision Tree, Logistic Regression, KNN (K-Nearest Neighbors), and SVM (Support Vector Machines). Various metrics including accuracy, balanced accuracy, recall, F-score, ROC (Receiver Operating Characteristics) curve, and PR (Precision-Recall) curve are compared for unbalanced data, oversampled data, undersampled data, and cleaned data with Tomek-Links for each algorithm. The results indicate that all four algorithms perform better when oversampling the minority class to half of the majority class and undersampling the majority class examples to match the minority class, followed by performing Tomek-Links on the balanced dataset

CSUSB ScholarWorks

Data selection based on decision tree for SVM classification on large data sets

Author: Cervantes Canales Jair
Cervantes Canales Jair
García Lamont Farid
García Lamont Farid
LOPEZ CHAU ASDRUBAL
LOPEZ CHAU ASDRUBAL
Rodríguez Mazahua Lisbeth
Rodríguez Mazahua Lisbeth
RUIZ CASTILLA JOSE SERGIO
RUIZ CASTILLA JOSE SERGIO
Publication venue: 'Elsevier BV'
Publication date: 18/08/2015
Field of study

Support Vector Machine (SVM) has important properties such as a strong mathematical background and a better generalization capability with respect to other classification methods. On the other hand, the major drawback of SVM occurs in its training phase, which is computationally expensive and highly dependent on the size of input data set. In this study, a new algorithm to speed up the training time of SVM is presented; this method selects a small and representative amount of data from data sets to improve training time of SVM. The novel method uses an induction tree to reduce the training data set for SVM, producing a very fast and high-accuracy algorithm. According to the results, the proposed algorithm produces results with similar accuracy and in a faster way than the current SVM implementations.Proyecto UAEM 3771/2014/C

Red Mexicana de Repositorios Institucionales

Repositorio Institucional de la Universidad Autónoma del Estado de México

Uncertainty Analysis for the Classification of Multispectral Satellite Images Using SVMs and SOMs

Author: C. Thiel
GIACCO FERDINANDO
Luca Pugliese L
MARINARO Maria
SCARPETTA Silvia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract: Classification of multispectral remotely sensed data with textural features is investigated with a special focus on uncertainty analysis in the produced land-cover maps. Much effort has already been directed into the research of satisfactory accuracy-assessment techniques in image classification, but a common approach is not yet universally adopted. We look at the relationship between hard accuracy and the uncertainty on the produced answers, introducing two measures based on maximum probability and a quadratic entropy. Their impact differs depending on the type of classifier. In this paper, we deal with two different classification strategies, based on support vector machines (SVMs) and Kohonen's self-organizingmaps (SOMs), both suitably modified to give soft answers. Once the multiclass probability answer vector is available for each pixel in the image, we studied the behavior of the overall classification accuracy as a function of the uncertainty associated with each vector, given a hard-labeled test set. The experimental results show that the SVM with one-versus-one architecture and linear kernel clearly outperforms the other supervised approaches in terms of overall accuracy. On the other hand, our analysis reveals that the proposed SOM-based classifier, despite its unsupervised learning procedure, is able to provide soft answers which are the best candidates for a fusion with supervised results

Archivio della Ricerca - Università di Salerno

A real-time data mining technique applied for critical ECG rhythm on handheld device

Author: Chin F
Publication venue: RMIT University
Publication date: 01/01/2012
Field of study

Sudden cardiac arrest is often caused by ventricular arrhythmias and these episodes can lead to death for patients with chronic heart disease. Hence, detection of such arrhythmia is crucial in mobile ECG monitoring. In this research, a systematic study is carried out to investigate the possible limitations that are preventing the realisation of a real-time ECG arrhythmia data-mining algorithm suitable for application on mobile devices. Based on the findings, a computationally lightweight algorithm is devised and tested. Ventricular tachycardia (VT) is the most common type of ventricular arrhythmias and is also the deadliest.. A ventricular tachycardia (VT) episode is due to a disorder ofthe regular contractions ofthe heart. It occurs when the human heart ventricles generate a rapid heartbeat which disrupts the regular physiology cycle. The normal sinus rhythm (NSR) of a regular human heart beat signal has its signature PQRST waveform and in regular pattern. Whereas, the characteristics of a ventricular tachycardia (VT) signal waveforms are short R-R intervals, widen QRS duration and the absence of P-waves. Each type of ECG arrhythmia previously mentioned has a unique waveform signature that can be exploited as features to be used for the realization of an automated ECG analysis application. In order to extract this known ECG waveform feature, a time-domain analysis is proposed for feature extraction. Cross-correlation allows the computation of a co-efficient that quantifies the similarity between two times-series. Hence, by cross-correlating known ECG waveform templates with an unknown ECG signal, the coefficient can indicate the similarities. In previous published work, a preliminary study was carried out. The cross-correlation coefficient wave (CCW) technique was introduced for feature extraction. The outcome ofthis work presents CCW as a promising feature to differentiate between NSR, VT and Vfib signals. Moreover, cross-correlation computation does not require high computational overhead. Next, an automated detection algorithm requires a classification mechanism to make sense of the feature extracted. A further study is conducted and published, a fuzzy set k-NN classifier was introduced for the classification of CCW feature extracted from ECG signal segments. A training set of size 180 is used. The outcome of the study indicates that the computationally light-weight fuzzy k-NN classifier can reliably classify between NSR and VT signals, the class detection rate is low for classifying Vfib signal using the fuzzy k-NN classifier. Hence, a modified algorithm known as fuzzy hybrid classifier is proposed. By implementing an expert knowledge based fuzzy inference system for classification of ECG signal; the Vfib signal detection rate was improved. The comparison outcome was that the hybrid fuzzy classifier is able to achieve 91.1% correct rate, 100% sensitivity and 100% specificity. The previously mentioned result outperforms the compared classifiers. The proposed detection and classification algorithm is able to achieve high accuracy in analysing ECG signal feature of NSR, VT and Vfib nature. Moreover, the proposed classifier is successfully implemented on a smart mobile device and it is able to perform data-mining of the ECG signal with satisfiable results

RMIT Research Repository

Twelve numerical, symbolic and hybrid supervised classification methods

Author: Bouchon-Meunier Bernadette
Caraux Gilles
Gallinari Patrick
Gascuel Olivier
Guermeur Yann
Guénoche Alain
Lechevallier Yves
Marsala Christophe
Miclet Laurent
Nicolas Jacques
Nock Richard
Ramdani Mohammed
Sebag Michèle
Tallur Basavanneppa
Venturini Gilles
Vitte Patrick
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/08/1998
Field of study

International audienceSupervised classification has already been the subject of numerous studies in the fields of Statistics, Pattern Recognition and Artificial Intelligence under various appellations which include discriminant analysis, discrimination and concept learning. Many practical applications relating to this field have been developed. New methods have appeared in recent years, due to developments concerning Neural Networks and Machine Learning. These "hybrid" approaches share one common factor in that they combine symbolic and numerical aspects. The former are characterized by the representation of knowledge, the latter by the introduction of frequencies and probabilistic criteria. In the present study, we shall present a certain number of hybrid methods, conceived (or improved) by members of the SYMENU research group. These methods issue mainly from Machine Learning and from research on Classification Trees done in Statistics, and they may also be qualified as "rule-based". They shall be compared with other more classical approaches. This comparison will be based on a detailed description of each of the twelve methods envisaged, and on the results obtained concerning the "Waveform Recognition Problem" proposed by Breiman et al which is difficult for rule based approaches

HAL-CentraleSupelec

HAL AMU

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

HAL-Polytechnique

HAL-Rennes 1