Search CORE

12 research outputs found

Recommended from our members

Bootstrap methods for the cost-sensitive evaluation of classifiers

Author: Dietterich Thomas Glen
Margineantu Dragos D. (Dragos Dorin)
Oregon State University. Dept. of Computer Science
Publication venue: Corvallis, OR : Oregon State University, Dept. of Computer Science
Publication date
Field of study

Many machine learning applications require classifiers that minimize an asymmetric cost function rather than the misclassification rate, and several recent papers have addressed this problem. However, these papers have either applied no statistical testing or have applied statistical methods that are not appropriate for the cost-sensitive setting. Without good statistical methods, it is difficult to tell whether these new cost-sensitive methods are better than existing methods that ignore costs, and it is also difficult to tell whether one cost-sensitive method is better than another. To rectify this problem, this paper presents two statistical methods for the cost-sensitive setting. The first constructs a confidence interval for the expected cost of a single classifier. The second constructs a confidence interval for the expected difference in costs of two classifiers. In both cases, the basic idea is to separate the problem of estimating the probabilities of each cell in the confusion matrix (which is independent of the cost matrix) from the problem of computing the expected cost. We show experimentally that these bootstrap tests work better than applying standard z tests based on the normal distribution

ScholarsArchive@OSU

Confidence Bands for ROC Curves: Methods and an Empirical Study

Author: Macskassy Sofus
Provost Foster
Publication venue: Proceedings of the First Workshop on ROC Analysis in AI. August 2004.
Publication date: 01/08/2004
Field of study

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning, although evaluating ROC curves has thus far been limited to studying the area under the curve (AUC) or generation of one-dimensional confidence intervals by freezing one variable—the false-positive rate, or threshold on the classification scoring function. Researchers in the medical field have long been using ROC curves and have many well-studied methods for analyzing such curves, including generating confidence intervals as well as simultaneous confidence bands. In this paper we introduce these techniques to the machine learning community and show their empirical fitness on the Covertype data set—a standard machine learning benchmark from the UCI repository. We show how some of these methods work remarkably well, others are too loose, and that existing machine learning methods for generation of 1-dimensional confidence intervals do not translate well to generation of simultanous bands—their bands are too tight.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

New York University Faculty Digital Archive

Cost-Sensitive Boosting

Author: H Masnadi-Shirazi
N Vasconcelos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Application of Cost Matrices and Cost Curves to Enhance Diagnostic Health Management Metrics for Gas Turbine Engines

Author: Chris Drummond
Craig R. Davison
Davison
Drummond
Drummond
Dugas
Margineantu
McDonald
Orsagh
Webb
Publication venue: 'ASME International'
Publication date
Field of study

Crossref

Tumor classification based on gene expression profiles

Author: Görner Melanie
Publication venue
Publication date: 01/01/2010
Field of study

Das Ziel dieser Arbeit ist die Vorhersage der Metastasenbildung von Brustkrebstumoren durch Klassifikation ihrer Genexpressionsdaten. Die dafür benötigten Daten werden mit Hilfe von Mikroarrays gewonnen, einer Technologie die es erlaubt Genexpressionsdaten schnell und effiziente zu extrahiert und dadurch eine solche Klassifikation ermöglicht. Wir untersuchen hier das binäre Klassifikationsproblem der Bestimmung ob ein Tumor innerhalb von fünf Jahren entfernte Metastasen bilden wird oder nicht. Im Gegensatz zu klassischen Studien in diesem Bereich wollen wir nicht die globale Klassifikationsgüte maximieren, sondern versuchen den Fehler zweiter Art (Fehlklassifikation eines Metastasen entwickelnden Patienten) niedrig zu halten und erst an zweiter Stelle den Fehler erster Art zu minimieren. Wir definieren verschiedene nearest centroid Klassifikatoren, wobei die centroids so genannte "Genexpressionsprofile" sind, die aus den durchschnittlichen Genexpressionswerten von Patient jedes Krankheitsbildes bestehen. Danach vergleichen wir die Klassifikationsgüte dieser Klassifikatoren und analysieren, welchen Einfluss Featureselektionsmethoden darauf haben. Es wir gezeigt, dass die Güte der nearest centroid Klassifikation stark von der genauen Definition des Klassifikatores abhängt. Des weiteren zeigen wir, dass die Featuremenge, auf welcher die Klassifikation basiert, einen großen Einfluss auf die Genauigkeit des Klassifikators hat und durch die Wahl einer geeigneten Featureselektionsmethode daher desssen Güte erheblich verbessert werden kann. Das beste Klassifikationsergebnis wird erreicht durch die Kombination eines bes- timmten nearest centroid Klassifikatores mit einem AdaBoost-Featureselektionsalgorithmus: Eine 5-fache Kreuzvalidierung erreicht 89% Sensitivität (sensitivity) und 89% Spezifität (specificity).In this thesis, we aim at predicting whether a breast cancer tumor will develop distant metastasis by classifying the tumor’s gene expression data. This data is obtained from microarrays, which is a technology providing a fast and efficient way of extracting gene expressions, thereby enabling such classification. The binary classification problem studied here is to decide whether a tumor will develop distant metastases within a timescale of five years. In contrast to classical studies in this field, we are not interested in maximizing overall classification performance, but focus on keeping the type II error (misclassification of metastases developing patients) low and only in the second place minimize the type I error. We define different nearest centroid classifiers, where the centroids are given by gene expression profiles consisting of average gene expression values for each outcome group. We then compare their performance and analyze the influence of feature selection methods on classification accuracy. We show that the performance of nearest centroid classification varies a lot depending on the specific definition of the classifier. Furthermore, we demon- strate that the feature set, on which the classification is based, has a big influ- ence on the classifier’s accuracy and choosing an appropriate feature selection method can therefore lead to a huge improvement in performance. The best classification result can be observed when combining a specific nearest centroid classifier with an AdaBoost feature selection algorithm: 5-fold cross-validation showed 89% sensitivity and 86% specificity

OTHES

Cost curves: An improved method for visualizing classifier performance

Author: A. Karwath
A. P. Bradley
B. Efron
B. J. McNeil
C. E. Metz
C. J. van Rijsbergen
Chris Drummond
D. J. Hand
E. J. Halpern
F. P. Preparata
F. Provost
G. M. Weiss
G. Ma
G. Webb
I. H. Witten
J. A. Swets
J. A. Swets
J. Hilden
J. R. Quinlan
J. Tilbury
K. H. Zou
K. Jensen
L. Breiman
L. Saitta
M. Kubat
N. M. Adams
P. Clark
P. D. Turney
R. C. Holte
R. O. Duda
Robert C. Holte
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A framework for smart traffic management using heterogeneous data sources

Author: Jones Angelica Salas
Publication venue: University of Wolverhampton
Publication date: 31/03/2020
Field of study

A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Traffic congestion constitutes a social, economic and environmental issue to modern cities as it can negatively impact travel times, fuel consumption and carbon emissions. Traffic forecasting and incident detection systems are fundamental areas of Intelligent Transportation Systems (ITS) that have been widely researched in the last decade. These systems provide real time information about traffic congestion and other unexpected incidents that can support traffic management agencies to activate strategies and notify users accordingly. However, existing techniques suffer from high false alarm rate and incorrect traffic measurements. In recent years, there has been an increasing interest in integrating different types of data sources to achieve higher precision in traffic forecasting and incident detection techniques. In fact, a considerable amount of literature has grown around the influence of integrating data from heterogeneous data sources into existing traffic management systems. This thesis presents a Smart Traffic Management framework for future cities. The proposed framework fusions different data sources and technologies to improve traffic prediction and incident detection systems. It is composed of two components: social media and simulator component. The social media component consists of a text classification algorithm to identify traffic related tweets. These traffic messages are then geolocated using Natural Language Processing (NLP) techniques. Finally, with the purpose of further analysing user emotions within the tweet, stress and relaxation strength detection is performed. The proposed text classification algorithm outperformed similar studies in the literature and demonstrated to be more accurate than other machine learning algorithms in the same dataset. Results from the stress and relaxation analysis detected a significant amount of stress in 40% of the tweets, while the other portion did not show any emotions associated with them. This information can potentially be used for policy making in transportation, to understand the users�� perception of the transportation network. The simulator component proposes an optimisation procedure for determining missing roundabouts and urban roads flow distribution using constrained optimisation. Existing imputation methodologies have been developed on straight section of highways and their applicability for more complex networks have not been validated. This task presented a solution for the unavailability of roadway sensors in specific parts of the network and was able to successfully predict the missing values with very low percentage error. The proposed imputation methodology can serve as an aid for existing traffic forecasting and incident detection methodologies, as well as for the development of more realistic simulation networks

Wolverhampton Intellectual Repository and E-theses

Recommended from our members

Methods for cost-sensitive learning

Author: Margineantu Dragos D. (Dragos Dorin)
Publication venue: 'Oregon State University'
Publication date
Field of study

Many approaches for achieving intelligent behavior of automated (computer) systems involve components that learn from past experience. This dissertation studies computational methods for learning from examples, for classification and for decision making, when the decisions have different non-zero costs associated with them. Many practical applications of learning algorithms, including transaction monitoring, fraud detection, intrusion detection, and medical diagnosis, have such non-uniform costs, and there is a great need for new methods that can handle them. This dissertation discusses two approaches to cost-sensitive classification: input data weighting and conditional density estimation. The first method assigns a weight to each training example in order to force the learning algorithm (which is otherwise unchanged) to pay more attention to examples with higher misclassification costs. The dissertation discusses several different weighting methods and concludes that a method that gives higher weight to examples from rarer classes works quite well. Another algorithm that gave good results was a wrapper method that applies Powell's gradient-free algorithm to optimize the input weights. The second approach to cost-sensitive classification is conditional density estimation. In this approach, the output of the learning algorithm is a classifier that estimates, for a new data point, the probability that it belongs to each of the classes. These probability estimates can be combined with a cost matrix to make decisions that minimize the expected cost. The dissertation presents a new algorithm, bagged lazy option trees (B-LOTs), that gives better probability estimates than any previous method based on decision trees. In order to evaluate cost-sensitive classification methods, appropriate statistical methods are needed. The dissertation presents two new statistical procedures: BLOTs provides a confidence interval on the expected cost of a classifier, and BDELTACOST provides a confidence interval on the difference in expected costs of two classifiers. These methods are applied to a large set of experimental studies to evaluate and compare the cost-sensitive methods presented in this dissertation. Finally, the dissertation describes the application of the B-LOTs to a problem of predicting the stability of river channels. In this study, B-LOTs were shown to be superior to other methods in cases where the classes have very different frequencies a situation that arises frequently in cost-sensitive classification problems

ScholarsArchive@OSU

Modelos de previsão de valores extremos e raros

Author: Luis Torgo
Rita P. Ribeiro
Publication venue
Publication date: 01/01/2010
Field of study

Repositório Aberto da Universidade do Porto