7 research outputs found
Evaluation Measures for Models Assessment over Imbalanced Data Sets
Imbalanced data learning is one of the challenging problems in data mining; among this matter, founding the right model assessment measures is almost a primary research issue. Skewed class distribution causes a misreading of common evaluation measures as well it lead a biased classification. This article presents a set of alternative for imbalanced data learning assessment, using a combined measures (G-means, likelihood ratios, Discriminant power, F-Measure Balanced Accuracy, Youden index, Matthews correlation coefficient), and graphical performance assessment (ROC curve, Area Under Curve, Partial AUC, Weighted AUC, Cumulative Gains Curve and lift chart, Area Under Lift AUL), that aim to provide a more credible evaluation. We analyze the applications of these measures in churn prediction models evaluation, a well known application of imbalanced data Keywords: imbalanced data, Model assessment, accuracy , G-means, likelihood ratios, F-Measure, Youden index, Matthews correlation coefficient, ROC, AUC, P-AUC,W-AUC, Lift, AU
AGm: A new performance measure for class imbalance learning. Application to Bioinformatics problems.
A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage
This thesis presents the development of a machine learning system, called mycoSORT , for supporting the first step of the biological literature manual curation process, called triage. The manual triage of documents is very demanding, as researchers usually face the time-consuming and error-
prone task of screening a large amount of data to identify relevant information. After querying scientific databases for keywords related to a specific subject, researchers generally find a long list of retrieved results, that has to be carefully analysed to identify only a few documents that show a potential of being relevant to the topic. Such an analysis represents a severe bottleneck in the
knowledge discovery and decision-making processes in scientific research. Hence, biocurators could
greatly benefit from an automatic support when performing the triage task. In order to support the triage of scientific documents, we have used a corpus of document instances manually labeled by biocurators as “selected” or “rejected”, with regards to their potential to indicate relevant information about fungal enzymes. This document collection is characterized by being large, since many results are retrieved and analysed to finally identify potential candidate documents; and also highly imbalanced, concerning the distribution of instances per relevance: the great majority of documents are labeled as rejected, while only a very small portion are labeled as selected. Using this dataset, we studied the design of a classification model to identify the most discriminative features
to automate the triage of scientific literature and to tackle the imbalance between the two classes of documents. To identify the most suitable model, we performed a study of 324 classification models, which demonstrated the results of using 9 different data undersampling factors, 4 sets of features, and the evaluation of 2 feature selection methods as well as 3 machine learning algorithms. Our
results demonstrated that the use of an undersampling technique is effective to handle imbalanced datasets and also help manage large document collections. We also found that the combination of undersampling and feature selection using Odds Ratio can improve the performance of our classification model. Finally, our results demonstrated that the best fitting model to support the triage of scientific documents is composed by domain relevant features, filtered by Odds Ratio scores, the use of dataset undersampling and the Logistic Model Trees algorithm
Personalized program guides for digital television.
Razvoj digitalne televizije je doveo do značajnog porasta broja TV sadržaja
dostupnih korisnicima, ali je otežao izbor onog koji je od interesa. Sve do pojave
personalizovanih programskih vodiča sposobnih da nauče korisnička interesovanja i
preporuče odgovarajuće sadržaje nije postojalo rešenje koje je na adekvatan način razmatralo
ovaj problem. Ranija rešenja, kao što su štampani i elektronski vodiči, su pretežno samo
pretvarala problem viška informacija u drugi oblik. Napredak tehnologije i društva postavlja
sve veće zahteve pred personalizovane programske vodiče za digitalnu televiziju, što zahteva
njihovo pažljivo planiranje i projektovanje. Vodiči moraju da budu u mogućnosti da
modeliraju različite načine donošenja odluka pojedinačnih korisnika, da rade u realnom
vremenu na mobilnim uređajima s ograničenim hardverskim resursima, da vode računa o
karakteristikama prikupljenih podataka, da uzimaju u obzir kontekst u kome se pristupa TV
sadržaju i da štite privatnost svih korisnika, jer neki od njih nisu svesni mogućih opasnosti.
Pažljivim izborom arhitekture i algoritma učenja, lokalno implementiran vodič baziran na
neuralnim mrežama može da ispuni sve ove zahteve. S obzirom na to da korisnici znatno
češće pružaju informacije o sadržajima koji im se dopadaju nego o onim koji im se ne
dopadaju, u ekstremnim slučajevima se dešava to da su prikupljene samo pozitivne
interakcije. Da bi se taj problem prevazišao, predložen je sistem s dva režima rada. U prvom
režimu sistem uči i pruža preporuke samo na osnovu TV sadržaja koje korisnik voli, dok u
drugom izjednačava uticaj sadržaja koje korisnik voli i onih koje ne voli na proces pružanja
preporuka. Povećan uticaj pozitivnih interakcija dovodi do degradacije predikcije sadržaja
koje posmatrač ne želi da gleda, te će se, usled greške u klasifikaciji, neželjeni sadržaji često
pojavljivati u listi preporuka i na taj način smanjiti zadovoljstvo korisnika. Korišćenjem serije
simulacija pokazali smo da je postignuto trajanje treniranja neuralne mreže kratko, čak i na
uređajima s ograničenim hardverskim resursima. Zaključak je da je predloženi vodič veoma
pogodan za implementaciju na mobilnim uređajima od kojih se očekuje da u budućnosti
postanu dominantan način pristupa TV sadržajima.The development of digital television significantly increased the quantity of
media contents available to the users, but made it difficult to make the right choice. Before the
invention of the personalized program guides capable of learning user preferences and
recommending adequate contents, there were no means of properly addressing this problem.
Former solutions, such as printed or electronic program guides, mostly converted the problem
of having to deal with too much information into another form. The advancements in both
technology and society put higher demands to the personalized program guides for digital TV,
which require careful planning and design processes. Guides must be able to model various
individual decision making approaches, work in real-time on mobile devices with limited
hardware resources, take into account the characteristics of the collected data, take into
consideration the program accessing context and protect the privacy of all users, since some
of them are not aware of the possible risks. By carefully choosing the architecture and
learning algorithms, a locally implemented guide based on neural networks can fulfil all the
aforementioned requirements. Due to the fact that the users provide information about the
content they like much more often than about the one they dislike, only positive interactions
are collected in extreme cases. In order to overcome that situation, a system having two
operating modes is proposed. The first mode enables the system to learn and give
recommendations based on preferred TV contents, while the second equalizes the influence of
the liked and disliked contents on the recommending process. The increased influence of
positive interactions degrades the unwanted content prediction process, resulting in
classification error, appearance of unwanted content in the recommendation list and user
dissatisfaction. By applying a series of simulations, we showed the accomplished neural
network training time to be short, even in cases of devices with limited hardware resources. It
can be concluded that the proposed guide is very convenient for implementation on mobile
devices which are expected to become a dominant way of accessing media contents in the
future
Predictive Modelling Approach to Data-Driven Computational Preventive Medicine
This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus.
Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance.
In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics
The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining