Search CORE

3 research outputs found

Recommended from our members

Contributions to Ensembles of Models for Predictive Toxicology Applications. On the Representation, Comparison and Combination of Models in Ensembles.

Author: Makhtar Mokhairi
Publication venue: School of Computing, Informatics and Media
Publication date: 01/01/2012
Field of study

The increasing variety of data mining tools offers a large palette of types and representation formats for predictive models. Managing the models then becomes a big challenge, as well as reusing the models and keeping the consistency of model and data repositories. Sustainable access and quality assessment of these models become limited to researchers. The approach for the Data and Model Governance (DMG) makes easier to process and support complex solutions. In this thesis, contributions are proposed towards ensembles of models with a focus on model representation, comparison and usage. Predictive Toxicology was chosen as an application field to demonstrate the proposed approach to represent predictive models linked to data for DMG. Further analysing methods such as predictive models comparison and predictive models combination for reusing the models from a collection of models were studied. Thus in this thesis, an original structure of the pool of models was proposed to represent predictive toxicology models called Predictive Toxicology Markup Language (PTML). PTML offers a representation scheme for predictive toxicology data and models generated by data mining tools. In this research, the proposed representation offers possibilities to compare models and select the relevant models based on different performance measures using proposed similarity measuring techniques. The relevant models were selected using a proposed cost function which is a composite of performance measures such as Accuracy (Acc), False Negative Rate (FNR) and False Positive Rate (FPR). The cost function will ensure that only quality models be selected as the candidate models for an ensemble. The proposed algorithm for optimisation and combination of Acc, FNR and FPR of ensemble models using double fault measure as the diversity measure improves Acc between 0.01 to 0.30 for all toxicology data sets compared to other ensemble methods such as Bagging, Stacking, Bayes and Boosting. The highest improvements for Acc were for data sets Bee (0.30), Oral Quail (0.13) and Daphnia (0.10). A small improvement (of about 0.01) in Acc was achieved for Dietary Quail and Trout. Important results by combining all the three performance measures are also related to reducing the distance between FNR and FPR for Bee, Daphnia, Oral Quail and Trout data sets for about 0.17 to 0.28. For Dietary Quail data set the improvement was about 0.01 though, but this data set is well known as a difficult learning exercise. For five UCI data sets tested, similar results were achieved with Acc improvement between 0.10 to 0.11, closing more the gaps between FNR and FPR. As a conclusion, the results show that by combining performance measures (Acc, FNR and FPR), as proposed within this thesis, the Acc increased and the distance between FNR and FPR decreased

Bradford Scholars

Unsupervised classifier selection based on two-sample test

Author: A. Andoni
A. Gretton
A. Gretton
A.H. Ko
A.J. Smola
C. Yang
C.J. Merz
D. Lee
D.H. Wolpert
F. Janssen
H. Wang
I. Steinwart
J. Gama
J. Huang
J.Z. Kolter
K.M. Borgwardt
M. Herbster
M.S. Charikar
N. Cesa-Bianchi
O. Watanabe
R. Klinkenberg
S. Ali
T.M. Chan
V.C. Raykar
X. Wu
X. Zhu
Y. Bengio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

We propose a well-founded method of ranking a pool of m trained classifiers by their suitability for the current input of n instances. It can be used when dynamically selecting a single classifier as well as in weighting the base classifiers in an ensemble. No classifiers are executed during the process. Thus, the n instances, based on which we select the classifier, can as well be unlabeled. This is rare in previous work. The method works by comparing the training distributions of classifiers with the input distribution. Hence, the feasibility for unsupervised classification comes with a price of maintaining a small sample of the training data for each classifier in the pool. In the general case our method takes time O (m(t + n)2) and space O(mt + n), where t is the size of the stored sample from the training distribution for each classifier. However, for commonly used Gaussian and polynomial kernel functions we can execute the method more efficiently. In our experiments the proposed method was found to be accurate.Peer reviewe

Crossref

Trepo - Institutional Repository of Tampere University