86 research outputs found
Propositional Satisfiability Method in Rough Classification Modeling for Data Mining
The fundamental problem in data mining is whether the whole information available is
always necessary to represent the information system (IS). The goal of data mining is to
find rules that model the world sufficiently well. These rules consist of conditions over
attributes value pairs called description and classification of decision attribute. However,
the set of all decision rules generated from all conditional attributes can be too large and
can contain many chaotic rules that are not appropriate for unseen object classification.
Therefore the search for the best rules must be performed because it is not possible to
determine the quality of all rules generated from the information systems. In rough set
approach to data mining, the set of interesting rules are determined using a notion of reduct. Rules were generated from reducts through binding the condition attribute values
of the object class from which the reduct is originated to the corresponding attribute. It is
important for the reducts to be minimum in size. The minimal reducts will decrease the
size of the conditional attributes used to generate rules. Smaller size of rules are
expected to classify new cases more properly because of the larger support in data and in
some sense the most stable and frequently appearing reducts gives the best decision
rules.
The main work of the thesis is the generation of classification model that contains
smaller number of rules, shorter length and good accuracy. The propositional
satisfiability method in rough classification model is proposed in this thesis. Two
models, Standard Integer Programming (SIP) and Decision Related Integer
Programming (DRIP) to represent the minimal reduct computation problem were
proposed. The models involved a theoretical formalism of the discemibility relation of a
decision system (DS) into an Integer Programming (IP) model. The proposed models
were embedded within the default rules generation framework and a new rough
classification method was obtained. An improved branch and bound strategy is proposed
to solve the SIP and DRIP models that pruned certain amount of search. The proposed
strategy used the conflict analysis procedure to remove the unnecessary attribute
assignments and determined the branch level for the search to backtrack in a nonchronological
manner.
Five data sets from VCI machine learning repositories and domain theories were
experimented. Total number rules generated for the best classification model is recorded where the 30% of data were used for training and 70% were kept as test data. The
classification accuracy, the number of rules and the maximum length of rules obtained
from the SIPIDRIP method was compared with other rough set method such as Genetic
Algorithm (GA), Johnson, Holte l R, Dynamic and Exhaustive method. Four of the
datasets were then chosen for further experiment. The improved search strategy
implemented the non-chronological backtracking search that potentially prunes the large
portion of search space. The experimental results showed that the proposed SIPIDRIP
method is a successful method in rough classification modeling. The outstanding feature
of this method is the reduced number of rules in all classification models. SIPIDRIP
generated shorter rules among other methods in most dataset. The proposed search
strategy indicated that the best performance can be achieved at the lower level or shorter
path of the tree search. SIPIDRIP method had also shown promising across other
commonly used classifiers such as neural network and statistical method. This model is
expected to be able to represent the knowledge of the system efficiently
Data Preprocessing: Case Study on monthly number of visitors to Taiwan by their residence and purpose
This paper will explain in details on data reports preliminary on dataset, how the pre-processing data mainly for data cleaning and reduction process applied to a dataset. The dataset that will be used is number of visitors to Taiwan by their residence and purpose.Dataset which is obtained based on kaggle, findings from Scraped from Taiwan Tourism Bureau. The surveys have been carried out using Foreign visitor data covers all foreign visitors directly arrived in Taiwan through the airports, ports and land
Indicator selection based on Rough Set Theory
A method for indicator selection is proposed in this paper.The method, which adopts the General Methodology and Design Research approach, consists of four steps: Problem Identification, Requirement Gathering, Indicator Extraction, and Evaluation. Rough Set approach also has been applied in
the Indicator Extraction phase.This phase consists of 5 steps: Data selection, Data Preprocessing, Discretization, Split Data, Reduction, and Classification.A dataset of 427 records have been used for experimentation.The datasets which contains financial information from several companies consists of 30 dependant indicators and one independent indicator.The selection of indicators is based on rough set theory where sets of reducts are computed from a dataset.Based on the sets of reducts, indicators have been ranked and selected based on certain set of criteria.Indicators have been ranked through computation of frequencies in reduct sets.The major contribution of this work is the extraction method for identifying reduced indicators.Results obtained have shown competitive accuracies in classifying new cases, thus showing that the
quality of knowledge is maintained through the use of a reduced set of indicators
Comparative Analysis of Data Mining Techniques for Malaysian Rainfall Prediction
Climate change prediction analyses the behaviours of weather for a specific time. Rainfall forecasting is a climate change task where specific features such as humidity and wind will be used to predict rainfall in specific locations. Rainfall prediction can be achieved using classification task under Data Mining. Different techniques lead to different performances depending on rainfall data representation including representation for long term (months) patterns and short-term (daily) patterns. Selecting an appropriate technique for a specific duration of rainfall is a challenging task. This study analyses multiple classifiers such as Naïve Bayes, Support Vector Machine, Decision Tree, Neural Network and Random Forest for rainfall prediction using Malaysian data. The dataset has been collected from multiple stations in Selangor, Malaysia. Several pre-processing tasks have been applied in order to resolve missing values and eliminating noise. The experimental results show that with small training data (10%) from 1581 instances Random Forest correctly classified 1043 instances. This is the strength of an ensemble of trees in Random Forest where a group of classifiers can jointly beat a single classifier
A comparative study of deep learning algorithms in univariate and multivariate forecasting of the Malaysian stock market
As part of a financial institution, the stock market has been an essential factor in the growth and stability of the national economy. Investment in the stock market is risky because of its price complexity and unpredictable nature. Deep learning is an emerging approach in stock market prediction modeling that can learn the non-linearity and complexity of stock market data. To date, not much study on stock market prediction in Malaysia employs the deep learning prediction model, especially in handling univariate and multivariate data. This study aims to develop a univariate and multivariate stock market forecasting model using three deep learning algorithms and compare the performance of those models. The algorithm intends to predict the close price of the Malaysian stock market using the Axiata Group Berhad and Petronas Gas Berhad from Bursa Malaysia, listed in Kuala Lumpur Composite Index (KLCI) datasets. Three deep learning algorithms, Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM), are used to develop the prediction model. The deep learning models achieved the highest accuracy and outperformed the baseline models in short and long-term forecasts. It also shows that LSTM possessed the best deep learning model for the Malaysian stock market, achieving the lowest prediction error among the other models. Deep learning demonstrates the ability to handle univariate and multivariate data in preserving important information, thus forecasting the stock market. This finding is relatively significant as deep learning works well with high-dimensional datasets
Multi layer perception modelling in the housing market
The study examines the use of multi layer perceptron network (MLP) in predicting the price of terrace houses in Kuala Lumpur (KL). Nine factors that significantly influence the price were used in this attempt. Housing data from 1994 to 1996 were presented to the network for training. Tested results from the model obtained for various years were compared using regression analysis. The study provides the predictive ability of the trained MLP model that can be used as an alternative predictor in real estate analysis
An improved artificial dendrite cell algorithm for abnormal signal detection
In dendrite cell algorithm (DCA), the abnormality of a data point is determined by comparing the multi-context antigen value (MCAV) with anomaly threshold. The limitation of the existing threshold is that the value needs to be determined before mining based on previous information and the existing MCAV is inefficient when exposed to extreme values. This causes the DCA fails to detect new data points if the pattern has distinct behavior from previous information and affects detection accuracy. This paper proposed an improved anomaly threshold solution for DCA using the statistical cumulative sum (CUSUM) with the aim to improve its detection capability. In the proposed approach, the MCAV were normalized with upper CUSUM and the new anomaly threshold was calculated during run time by considering the acceptance value and min MCAV. From the experiments towards 12 benchmark and two outbreak datasets, the improved DCA is proven to have a better detection result than its previous version in terms of sensitivity, specificity, false detection rate and accuracy
An Affective Decision Making Engine Framework for Practical Software Agents
The framework of the Affective Decision Making Engine outlined here provides a blueprint for creating software agents that emulate psychological affect when making decisions in complex and dynamic problem environments. The influence of affect on the agent's decisions is mimicked by measuring the correlation of feature values, possessed by objects and/or events in the environment, against the outcome of goals that are set for measuring the agent's overall performance. The use of correlation in the Affective Decision Making Engine provides a statistical justification for preference when prioritizing goals, particularly when it is not possible to realize all agent goals. The simplification of the agent algorithm retains the function of affect for summarizing feature-rich dynamic environments during decision making. Keywords: Affective decision making, correlative adaptation, affective agent
Nonlinear regression in tax evasion with uncertainty: a variational approach
One of the major problems in today's economy is the phenomenon of tax evasion. The linear regression method is a solution to find a formula to investigate the effect of each variable in the final tax evasion rate. Since the tax evasion data in this study has a great degree of uncertainty and the relationship between variables is nonlinear, Bayesian method is used to address the uncertainty along with 6 nonlinear basis functions to tackle the nonlinearity problem. Furthermore, variational method is applied on Bayesian linear regression in tax evasion data to approximate the model evidence in Bayesian method. The dataset is collected from tax evasion in Malaysia in period from 1963 to 2013 with 8 input variables. Results from variational method are compared with Maximum Likelihood Estimation technique on Bayeisan linear regression and variational method provides more accurate prediction. This study suggests that, in order to reduce the tax evasion, Malaysian government should decrease direct tax and taxpayer income and increase indirect tax and government regulation variables by 5% in the small amount of changes (10%-30%) and reduce direct tax and income on taxpayer and increment indirect tax and government regulation variables by 90% in the large amount of changes (70%-90%) with respect to the current situation to reduce the final tax evasion rate
- …