7 research outputs found

    Evaluation Measures for Models Assessment over Imbalanced Data Sets

    Get PDF
    Imbalanced data learning is one of the challenging problems in data mining; among this matter, founding the right model assessment measures is almost a primary research issue. Skewed class distribution causes a misreading of common evaluation measures as well it lead a biased classification. This article presents a set of alternative for imbalanced data learning assessment, using a combined measures (G-means, likelihood ratios, Discriminant power, F-Measure Balanced Accuracy, Youden index, Matthews correlation coefficient), and graphical performance assessment (ROC curve, Area Under Curve, Partial AUC, Weighted AUC, Cumulative Gains Curve and lift chart, Area Under Lift AUL), that aim to provide a more credible evaluation. We analyze the applications of these measures in churn prediction models evaluation, a well known application of imbalanced data Keywords: imbalanced data, Model assessment, accuracy , G-means, likelihood ratios, F-Measure, Youden index, Matthews correlation coefficient, ROC, AUC, P-AUC,W-AUC, Lift, AU

    Equilibrating the Recognition of the Minority Class in the Imbalance Context

    Full text link

    A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems

    No full text

    A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage

    Get PDF
    This thesis presents the development of a machine learning system, called mycoSORT , for supporting the first step of the biological literature manual curation process, called triage. The manual triage of documents is very demanding, as researchers usually face the time-consuming and error- prone task of screening a large amount of data to identify relevant information. After querying scientific databases for keywords related to a specific subject, researchers generally find a long list of retrieved results, that has to be carefully analysed to identify only a few documents that show a potential of being relevant to the topic. Such an analysis represents a severe bottleneck in the knowledge discovery and decision-making processes in scientific research. Hence, biocurators could greatly benefit from an automatic support when performing the triage task. In order to support the triage of scientific documents, we have used a corpus of document instances manually labeled by biocurators as “selected” or “rejected”, with regards to their potential to indicate relevant information about fungal enzymes. This document collection is characterized by being large, since many results are retrieved and analysed to finally identify potential candidate documents; and also highly imbalanced, concerning the distribution of instances per relevance: the great majority of documents are labeled as rejected, while only a very small portion are labeled as selected. Using this dataset, we studied the design of a classification model to identify the most discriminative features to automate the triage of scientific literature and to tackle the imbalance between the two classes of documents. To identify the most suitable model, we performed a study of 324 classification models, which demonstrated the results of using 9 different data undersampling factors, 4 sets of features, and the evaluation of 2 feature selection methods as well as 3 machine learning algorithms. Our results demonstrated that the use of an undersampling technique is effective to handle imbalanced datasets and also help manage large document collections. We also found that the combination of undersampling and feature selection using Odds Ratio can improve the performance of our classification model. Finally, our results demonstrated that the best fitting model to support the triage of scientific documents is composed by domain relevant features, filtered by Odds Ratio scores, the use of dataset undersampling and the Logistic Model Trees algorithm

    Personalized program guides for digital television.

    Get PDF
    Razvoj digitalne televizije je doveo do značajnog porasta broja TV sadržaja dostupnih korisnicima, ali je otežao izbor onog koji je od interesa. Sve do pojave personalizovanih programskih vodiča sposobnih da nauče korisnička interesovanja i preporuče odgovarajuće sadržaje nije postojalo rešenje koje je na adekvatan način razmatralo ovaj problem. Ranija rešenja, kao što su štampani i elektronski vodiči, su pretežno samo pretvarala problem viška informacija u drugi oblik. Napredak tehnologije i društva postavlja sve veće zahteve pred personalizovane programske vodiče za digitalnu televiziju, što zahteva njihovo pažljivo planiranje i projektovanje. Vodiči moraju da budu u mogućnosti da modeliraju različite načine donošenja odluka pojedinačnih korisnika, da rade u realnom vremenu na mobilnim uređajima s ograničenim hardverskim resursima, da vode računa o karakteristikama prikupljenih podataka, da uzimaju u obzir kontekst u kome se pristupa TV sadržaju i da štite privatnost svih korisnika, jer neki od njih nisu svesni mogućih opasnosti. Pažljivim izborom arhitekture i algoritma učenja, lokalno implementiran vodič baziran na neuralnim mrežama može da ispuni sve ove zahteve. S obzirom na to da korisnici znatno češće pružaju informacije o sadržajima koji im se dopadaju nego o onim koji im se ne dopadaju, u ekstremnim slučajevima se dešava to da su prikupljene samo pozitivne interakcije. Da bi se taj problem prevazišao, predložen je sistem s dva režima rada. U prvom režimu sistem uči i pruža preporuke samo na osnovu TV sadržaja koje korisnik voli, dok u drugom izjednačava uticaj sadržaja koje korisnik voli i onih koje ne voli na proces pružanja preporuka. Povećan uticaj pozitivnih interakcija dovodi do degradacije predikcije sadržaja koje posmatrač ne želi da gleda, te će se, usled greške u klasifikaciji, neželjeni sadržaji često pojavljivati u listi preporuka i na taj način smanjiti zadovoljstvo korisnika. Korišćenjem serije simulacija pokazali smo da je postignuto trajanje treniranja neuralne mreže kratko, čak i na uređajima s ograničenim hardverskim resursima. Zaključak je da je predloženi vodič veoma pogodan za implementaciju na mobilnim uređajima od kojih se očekuje da u budućnosti postanu dominantan način pristupa TV sadržajima.The development of digital television significantly increased the quantity of media contents available to the users, but made it difficult to make the right choice. Before the invention of the personalized program guides capable of learning user preferences and recommending adequate contents, there were no means of properly addressing this problem. Former solutions, such as printed or electronic program guides, mostly converted the problem of having to deal with too much information into another form. The advancements in both technology and society put higher demands to the personalized program guides for digital TV, which require careful planning and design processes. Guides must be able to model various individual decision making approaches, work in real-time on mobile devices with limited hardware resources, take into account the characteristics of the collected data, take into consideration the program accessing context and protect the privacy of all users, since some of them are not aware of the possible risks. By carefully choosing the architecture and learning algorithms, a locally implemented guide based on neural networks can fulfil all the aforementioned requirements. Due to the fact that the users provide information about the content they like much more often than about the one they dislike, only positive interactions are collected in extreme cases. In order to overcome that situation, a system having two operating modes is proposed. The first mode enables the system to learn and give recommendations based on preferred TV contents, while the second equalizes the influence of the liked and disliked contents on the recommending process. The increased influence of positive interactions degrades the unwanted content prediction process, resulting in classification error, appearance of unwanted content in the recommendation list and user dissatisfaction. By applying a series of simulations, we showed the accomplished neural network training time to be short, even in cases of devices with limited hardware resources. It can be concluded that the proposed guide is very convenient for implementation on mobile devices which are expected to become a dominant way of accessing media contents in the future

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining
    corecore