905 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    An enhanced resampling technique for imbalanced data sets

    Get PDF
    A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    Get PDF
    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

    Data selection based on decision tree for SVM classification on large data sets

    Get PDF
    Support Vector Machine (SVM) has important properties such as a strong mathematical background and a better generalization capability with respect to other classification methods. On the other hand, the major drawback of SVM occurs in its training phase, which is computationally expensive and highly dependent on the size of input data set. In this study, a new algorithm to speed up the training time of SVM is presented; this method selects a small and representative amount of data from data sets to improve training time of SVM. The novel method uses an induction tree to reduce the training data set for SVM, producing a very fast and high-accuracy algorithm. According to the results, the proposed algorithm produces results with similar accuracy and in a faster way than the current SVM implementations.Proyecto UAEM 3771/2014/C

    Uncertainty Analysis for the Classification of Multispectral Satellite Images Using SVMs and SOMs

    Get PDF
    Abstract: Classification of multispectral remotely sensed data with textural features is investigated with a special focus on uncertainty analysis in the produced land-cover maps. Much effort has already been directed into the research of satisfactory accuracy-assessment techniques in image classification, but a common approach is not yet universally adopted. We look at the relationship between hard accuracy and the uncertainty on the produced answers, introducing two measures based on maximum probability and a quadratic entropy. Their impact differs depending on the type of classifier. In this paper, we deal with two different classification strategies, based on support vector machines (SVMs) and Kohonen's self-organizingmaps (SOMs), both suitably modified to give soft answers. Once the multiclass probability answer vector is available for each pixel in the image, we studied the behavior of the overall classification accuracy as a function of the uncertainty associated with each vector, given a hard-labeled test set. The experimental results show that the SVM with one-versus-one architecture and linear kernel clearly outperforms the other supervised approaches in terms of overall accuracy. On the other hand, our analysis reveals that the proposed SOM-based classifier, despite its unsupervised learning procedure, is able to provide soft answers which are the best candidates for a fusion with supervised results

    A real-time data mining technique applied for critical ECG rhythm on handheld device

    Get PDF
    Sudden cardiac arrest is often caused by ventricular arrhythmias and these episodes can lead to death for patients with chronic heart disease. Hence, detection of such arrhythmia is crucial in mobile ECG monitoring. In this research, a systematic study is carried out to investigate the possible limitations that are preventing the realisation of a real-time ECG arrhythmia data-mining algorithm suitable for application on mobile devices. Based on the findings, a computationally lightweight algorithm is devised and tested. Ventricular tachycardia (VT) is the most common type of ventricular arrhythmias and is also the deadliest.. A ventricular tachycardia (VT) episode is due to a disorder ofthe regular contractions ofthe heart. It occurs when the human heart ventricles generate a rapid heartbeat which disrupts the regular physiology cycle. The normal sinus rhythm (NSR) of a regular human heart beat signal has its signature PQRST waveform and in regular pattern. Whereas, the characteristics of a ventricular tachycardia (VT) signal waveforms are short R-R intervals, widen QRS duration and the absence of P-waves. Each type of ECG arrhythmia previously mentioned has a unique waveform signature that can be exploited as features to be used for the realization of an automated ECG analysis application. In order to extract this known ECG waveform feature, a time-domain analysis is proposed for feature extraction. Cross-correlation allows the computation of a co-efficient that quantifies the similarity between two times-series. Hence, by cross-correlating known ECG waveform templates with an unknown ECG signal, the coefficient can indicate the similarities. In previous published work, a preliminary study was carried out. The cross-correlation coefficient wave (CCW) technique was introduced for feature extraction. The outcome ofthis work presents CCW as a promising feature to differentiate between NSR, VT and Vfib signals. Moreover, cross-correlation computation does not require high computational overhead. Next, an automated detection algorithm requires a classification mechanism to make sense of the feature extracted. A further study is conducted and published, a fuzzy set k-NN classifier was introduced for the classification of CCW feature extracted from ECG signal segments. A training set of size 180 is used. The outcome of the study indicates that the computationally light-weight fuzzy k-NN classifier can reliably classify between NSR and VT signals, the class detection rate is low for classifying Vfib signal using the fuzzy k-NN classifier. Hence, a modified algorithm known as fuzzy hybrid classifier is proposed. By implementing an expert knowledge based fuzzy inference system for classification of ECG signal; the Vfib signal detection rate was improved. The comparison outcome was that the hybrid fuzzy classifier is able to achieve 91.1% correct rate, 100% sensitivity and 100% specificity. The previously mentioned result outperforms the compared classifiers. The proposed detection and classification algorithm is able to achieve high accuracy in analysing ECG signal feature of NSR, VT and Vfib nature. Moreover, the proposed classifier is successfully implemented on a smart mobile device and it is able to perform data-mining of the ECG signal with satisfiable results

    Twelve numerical, symbolic and hybrid supervised classification methods

    Get PDF
    International audienceSupervised classification has already been the subject of numerous studies in the fields of Statistics, Pattern Recognition and Artificial Intelligence under various appellations which include discriminant analysis, discrimination and concept learning. Many practical applications relating to this field have been developed. New methods have appeared in recent years, due to developments concerning Neural Networks and Machine Learning. These "hybrid" approaches share one common factor in that they combine symbolic and numerical aspects. The former are characterized by the representation of knowledge, the latter by the introduction of frequencies and probabilistic criteria. In the present study, we shall present a certain number of hybrid methods, conceived (or improved) by members of the SYMENU research group. These methods issue mainly from Machine Learning and from research on Classification Trees done in Statistics, and they may also be qualified as "rule-based". They shall be compared with other more classical approaches. This comparison will be based on a detailed description of each of the twelve methods envisaged, and on the results obtained concerning the "Waveform Recognition Problem" proposed by Breiman et al which is difficult for rule based approaches
    corecore