34 research outputs found
Propositional Satisfiability Method in Rough Classification Modeling for Data Mining
The fundamental problem in data mining is whether the whole information available is
always necessary to represent the information system (IS). The goal of data mining is to
find rules that model the world sufficiently well. These rules consist of conditions over
attributes value pairs called description and classification of decision attribute. However,
the set of all decision rules generated from all conditional attributes can be too large and
can contain many chaotic rules that are not appropriate for unseen object classification.
Therefore the search for the best rules must be performed because it is not possible to
determine the quality of all rules generated from the information systems. In rough set
approach to data mining, the set of interesting rules are determined using a notion of reduct. Rules were generated from reducts through binding the condition attribute values
of the object class from which the reduct is originated to the corresponding attribute. It is
important for the reducts to be minimum in size. The minimal reducts will decrease the
size of the conditional attributes used to generate rules. Smaller size of rules are
expected to classify new cases more properly because of the larger support in data and in
some sense the most stable and frequently appearing reducts gives the best decision
rules.
The main work of the thesis is the generation of classification model that contains
smaller number of rules, shorter length and good accuracy. The propositional
satisfiability method in rough classification model is proposed in this thesis. Two
models, Standard Integer Programming (SIP) and Decision Related Integer
Programming (DRIP) to represent the minimal reduct computation problem were
proposed. The models involved a theoretical formalism of the discemibility relation of a
decision system (DS) into an Integer Programming (IP) model. The proposed models
were embedded within the default rules generation framework and a new rough
classification method was obtained. An improved branch and bound strategy is proposed
to solve the SIP and DRIP models that pruned certain amount of search. The proposed
strategy used the conflict analysis procedure to remove the unnecessary attribute
assignments and determined the branch level for the search to backtrack in a nonchronological
manner.
Five data sets from VCI machine learning repositories and domain theories were
experimented. Total number rules generated for the best classification model is recorded where the 30% of data were used for training and 70% were kept as test data. The
classification accuracy, the number of rules and the maximum length of rules obtained
from the SIPIDRIP method was compared with other rough set method such as Genetic
Algorithm (GA), Johnson, Holte l R, Dynamic and Exhaustive method. Four of the
datasets were then chosen for further experiment. The improved search strategy
implemented the non-chronological backtracking search that potentially prunes the large
portion of search space. The experimental results showed that the proposed SIPIDRIP
method is a successful method in rough classification modeling. The outstanding feature
of this method is the reduced number of rules in all classification models. SIPIDRIP
generated shorter rules among other methods in most dataset. The proposed search
strategy indicated that the best performance can be achieved at the lower level or shorter
path of the tree search. SIPIDRIP method had also shown promising across other
commonly used classifiers such as neural network and statistical method. This model is
expected to be able to represent the knowledge of the system efficiently
Problem Restructuring in Integer Programming for Reduct Searching
Standard Integer Programming / Decision Related Integer Programming (SIP/DRIP) is a
reduct searching system that finds the reducts in an information system. These reducts
are the minimal attributes of the information system that are useful in classificatory task.
They can describe the whole information system when implementing discernment. In
effect, they are very useful in generating rules when solving the classification problem
that is inherent in data mining.
The thesis emphasizes mainly on the improvement of the original SIP/DRIP algorithm in
term of performance. By using problem restructuring, the searching time and memory
are minimized. Simultaneously the approach adheres to an essential criterion of the
original method. That is, to improve performance without sacrificing the quality of the
reduct.Problem restructuring changes the input of the SIP/DRIP without losing any of inpufs
essential properties. In other words, no lost of knowledge occurs with problem
restructuring. Only the structural order changes, with the contents kept intact. This
hypothetically ensures that no adverse distortion transpired within SIP/DRIP.
Restructuring is done by speculating a promising structure for the input to SIP/DRIP
based on the potentiality of the attributes in the information system. It uses a nonexpensive
approach to predict potentiality. Simply, based on the total covering of each
attributes within the information system. Although this measurement is just an
approximation, it can be proven to work.
To implement the experiment, five data sets were taken. They are gathered from the
UCI machine learning repositories. Results are measured by comparing the
performance of SIP/DRIP with and without problem restructuring. In addition, the length
of reducts generated by both approaches are also compared to ensure that no quality
deterioration occurred along the way.
Experimental results have shown that problem restructuring generally improves
SIP/DRIP for all the data sets. This means that on average, it would enhance the
performance of SIP/DRIP. The consumption of time and space were minimized quite
significantly. Furthermore, the quality of the solutions was also successfully maintained.
There was no increase in reduct length when using it.
The concept offered by the approach is an additional component to SIP/DRIP. It
complements the process of searching done. By giving more consideration to the initial problem space and not just the searching of the solution, the performance of SIP/DRIP
has been humbly improved