3 research outputs found
Recommended from our members
Contributions to Ensembles of Models for Predictive Toxicology Applications. On the Representation, Comparison and Combination of Models in Ensembles.
The increasing variety of data mining tools offers a large palette
of types and representation formats for predictive models. Managing
the models then becomes a big challenge, as well as reusing the
models and keeping the consistency of model and data repositories.
Sustainable access and quality assessment of these models become
limited to researchers. The approach for the Data and Model Governance
(DMG) makes easier to process and support complex solutions.
In this thesis, contributions are proposed towards ensembles
of models with a focus on model representation, comparison and
usage.
Predictive Toxicology was chosen as an application field to demonstrate
the proposed approach to represent predictive models linked
to data for DMG. Further analysing methods such as predictive models
comparison and predictive models combination for reusing the
models from a collection of models were studied. Thus in this thesis,
an original structure of the pool of models was proposed to
represent predictive toxicology models called Predictive Toxicology
Markup Language (PTML). PTML offers a representation scheme for
predictive toxicology data and models generated by data mining tools.
In this research, the proposed representation offers possibilities
to compare models and select the relevant models based on different
performance measures using proposed similarity measuring techniques.
The relevant models were selected using a proposed cost
function which is a composite of performance measures such as
Accuracy (Acc), False Negative Rate (FNR) and False Positive Rate
(FPR). The cost function will ensure that only quality models be
selected as the candidate models for an ensemble.
The proposed algorithm for optimisation and combination of Acc,
FNR and FPR of ensemble models using double fault measure as
the diversity measure improves Acc between 0.01 to 0.30 for all toxicology
data sets compared to other ensemble methods such as Bagging,
Stacking, Bayes and Boosting. The highest improvements for
Acc were for data sets Bee (0.30), Oral Quail (0.13) and Daphnia
(0.10). A small improvement (of about 0.01) in Acc was achieved
for Dietary Quail and Trout. Important results by combining all
the three performance measures are also related to reducing the
distance between FNR and FPR for Bee, Daphnia, Oral Quail and
Trout data sets for about 0.17 to 0.28. For Dietary Quail data set
the improvement was about 0.01 though, but this data set is well
known as a difficult learning exercise. For five UCI data sets tested,
similar results were achieved with Acc improvement between 0.10 to
0.11, closing more the gaps between FNR and FPR.
As a conclusion, the results show that by combining performance
measures (Acc, FNR and FPR), as proposed within this thesis, the
Acc increased and the distance between FNR and FPR decreased
Unsupervised classifier selection based on two-sample test
We propose a well-founded method of ranking a pool of m trained classifiers by their suitability for the current input of n instances. It can be used when dynamically selecting a single classifier as well as in weighting the base classifiers in an ensemble. No classifiers are executed during the process. Thus, the n instances, based on which we select the classifier, can as well be unlabeled. This is rare in previous work. The method works by comparing the training distributions of classifiers with the input distribution. Hence, the feasibility for unsupervised classification comes with a price of maintaining a small sample of the training data for each classifier in the pool. In the general case our method takes time O (m(t + n)2) and space O(mt + n), where t is the size of the stored sample from the training distribution for each classifier. However, for commonly used Gaussian and polynomial kernel functions we can execute the method more efficiently. In our experiments the proposed method was found to be accurate.Peer reviewe