Search CORE

2,278 research outputs found

Relative Unsupervised Discretization for Association Rule Mining

Author: G. Widmer
R. Agrawal
T. Pavlidis
W. Dillon
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Discretization of Continuous Attributes

Author: Muhlenbach Fabrice
Rakotomalala Ricco
Publication venue: Idea Group Reference
Publication date: 01/04/2005
Field of study

7 pagesIn the data mining field, many learning methods -like association rules, Bayesian networks, induction rules (Grzymala-Busse & Stefanowski, 2001)- can handle only discrete attributes. Therefore, before the machine learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals, for example the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 and more (of age). This process, known as discretization, is an essential task of the data preprocessing, not only because some learning methods do not handle continuous attributes, but also for other important reasons: the data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide non-linear relations -e.g., the infants and the elderly people are more sensitive to illness

HAL-UJM

HAL

On the role of pre and post-processing in environmental data mining

Author: Athanasiadis Ioannis
Comas Joaquim
Gibert Karina
Holmes Geoffrey
Izquierdo Joaquin
Sanchez-Marre Miquel
Publication venue: International Environmental Modelling and Software Society
Publication date: 01/01/2008
Field of study

The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

Research Commons@Waikato

Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare

Author: Duchene Florence
Garbay Catherine
Rialle Vincent
Publication venue
Publication date: 25/11/2004
Field of study

For the last years, time-series mining has become a challenging issue for researchers. An important application lies in most monitoring purposes, which require analyzing large sets of time-series for learning usual patterns. Any deviation from this learned profile is then considered as an unexpected situation. Moreover, complex applications may involve the temporal study of several heterogeneous parameters. In that paper, we propose a method for mining heterogeneous multivariate time-series for learning meaningful patterns. The proposed approach allows for mixed time-series -- containing both pattern and non-pattern data -- such as for imprecise matches, outliers, stretching and global translating of patterns instances in time. We present the early results of our approach in the context of monitoring the health status of a person at home. The purpose is to build a behavioral profile of a person by analyzing the time variations of several quantitative or qualitative parameters recorded through a provision of sensors installed in the home

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Improvement of the Accuracy of Prediction Using Unsupervised Discretization Method: Educational Data Set Case Study

Author: Dejan Rančić
Gabrijela Dimić
Ivan Milentijević
Petar Spalević
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2018
Field of study

This paper presents a comparison of the efficacy of unsupervised and supervised discretization methods for educational data from blended learning environment. Naïve Bayes classifier was trained for each discretized data set and comparative analysis of prediction models was conducted. The research goal was to transform numeric features into maximum independent discrete values with minimum loss of information and reduction of classification error. Proposed unsupervised discretization method was based on the histogram distribution and implementation of oversampling technique. The main contribution of this research is improvement of accuracy prediction using the unsupervised discretization method which reduces the effect of ignoring class feature for educational data set

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

PRESISTANT: Learning based assistant for data pre-processing

Author: Abelló Alberto
Aluja-Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue
Publication date: 02/03/2018
Field of study

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Multi-interval discretization of continuous attributes for label ranking

Author: Azevedo Paulo
De Sá Cláudio Rebelo
Fürnkranz Johannes
Higuchi Tomoyuki
Hüllermeier Eyke
Jorge Alípio Mário
Knobbe Arno
Soares Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Label Ranking (LR) problems, such as predicting rankings of financial analysts, are becoming increasingly important in data mining. While there has been a significant amount of work on the development of learning algorithms for LR in recent years, pre-processing methods for LR are still very scarce. However, some methods, like Naive Bayes for LR and APRIORI-LR, cannot deal with real-valued data directly. As a make-shift solution, one could consider conventional discretization methods used in classification, by simply treating each unique ranking as a separate class. In this paper, we show that such an approach has several disadvantages. As an alternative, we propose an adaptation of an existing method, MDLP, specifically for LR problems. We illustrate the advantages of the new method using synthetic data. Additionally, we present results obtained on several benchmark datasets. The results clearly indicate that the discretization is performing as expected and in some cases improves the results of the learning algorithms. © 2013 Springer-Verlag.This work was partially supported by Project Best-Case, which is co-financed by the North Portugal Regional Operational Programme (ON.2 - O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF)

Universidade do Minho: RepositoriUM

Crossref

University of Twente Research Information

Enhancing operational performance of AHUs through an advanced fault detection and diagnosis process based on temporal association and decision rules

Author: Capozzoli A.
Mazzarelli D. M.
Piscitelli M. S.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

The pervasive monitoring of HVAC systems through Building Energy Management Systems (BEMSs) is enabling the full exploitation of data-driven based methodologies for performing advanced energy management strategies. In this context, the implementation of Automated Fault Detection and Diagnosis (AFDD) based on collected operational data of Air Handling Units (AHUs) proved to be particularly effective to prevent anomalous running modes which can lead to significant energy waste over time and discomfort conditions in the built environment. The present work proposes a novel methodology for performing AFDD, based on both unsupervised and supervised data-driven methods tailored according to the operation of an AHU during transient and non-transient periods. The whole process is developed and tested on a sample of real data gathered from monitoring campaigns on two identical AHUs in the framework of the Research Project ASHRAE RP-1312. During the start-up period of operation, the methodology exploits Temporal Association Rules Mining (TARM) algorithm for an early detection of faults, while during non-transient period a number of classification models are developed for the identification of the deviation from the normal operation. The proposed methodology, conceived for quasi real-time implementation, proved to be capable of robustly and promptly identifying the presence of typical faults in AHUs

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)