12 research outputs found
Recommended from our members
Bootstrap methods for the cost-sensitive evaluation of classifiers
Many machine learning applications require
classifiers that minimize an asymmetric cost
function rather than the misclassification
rate, and several recent papers have addressed
this problem. However, these papers
have either applied no statistical testing
or have applied statistical methods that are
not appropriate for the cost-sensitive setting.
Without good statistical methods, it is difficult to tell whether these new cost-sensitive
methods are better than existing methods
that ignore costs, and it is also difficult to tell
whether one cost-sensitive method is better
than another. To rectify this problem, this
paper presents two statistical methods for the
cost-sensitive setting. The first constructs a
confidence interval for the expected cost of a
single classifier. The second constructs a confidence interval for the expected difference in
costs of two classifiers. In both cases, the
basic idea is to separate the problem of estimating
the probabilities of each cell in the
confusion matrix (which is independent of the
cost matrix) from the problem of computing
the expected cost. We show experimentally
that these bootstrap tests work better than
applying standard z tests based on the normal
distribution
Confidence Bands for ROC Curves: Methods and an Empirical Study
In this paper we study techniques for generating
and evaluating confidence bands on ROC curves. ROC
curve evaluation is rapidly becoming a commonly used evaluation
metric in machine learning, although evaluating ROC
curves has thus far been limited to studying the area under
the curve (AUC) or generation of one-dimensional confidence
intervals by freezing one variable—the false-positive rate, or
threshold on the classification scoring function. Researchers in
the medical field have long been using ROC curves and have
many well-studied methods for analyzing such curves, including
generating confidence intervals as well as simultaneous
confidence bands. In this paper we introduce these techniques
to the machine learning community and show their empirical
fitness on the Covertype data set—a standard machine learning
benchmark from the UCI repository. We show how some
of these methods work remarkably well, others are too loose,
and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well to
generation of simultanous bands—their bands are too tight.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Tumor classification based on gene expression profiles
Das Ziel dieser Arbeit ist die Vorhersage der Metastasenbildung von Brustkrebstumoren durch Klassifikation ihrer Genexpressionsdaten. Die dafür benötigten Daten werden mit Hilfe von Mikroarrays gewonnen, einer Technologie die es erlaubt Genexpressionsdaten schnell und effiziente zu extrahiert und dadurch eine solche Klassifikation ermöglicht.
Wir untersuchen hier das binäre Klassifikationsproblem der Bestimmung ob ein Tumor innerhalb von fünf Jahren entfernte Metastasen bilden wird oder nicht. Im Gegensatz zu klassischen Studien in diesem Bereich wollen wir nicht die globale Klassifikationsgüte maximieren, sondern versuchen den Fehler zweiter Art (Fehlklassifikation eines Metastasen entwickelnden Patienten) niedrig zu halten und erst an zweiter Stelle den Fehler erster Art zu minimieren.
Wir definieren verschiedene nearest centroid Klassifikatoren, wobei die centroids so genannte "Genexpressionsprofile" sind, die aus den durchschnittlichen Genexpressionswerten von Patient jedes Krankheitsbildes bestehen. Danach vergleichen wir die KlassifikationsgĂĽte dieser Klassifikatoren und analysieren, welchen Einfluss Featureselektionsmethoden darauf haben.
Es wir gezeigt, dass die Güte der nearest centroid Klassifikation stark von der genauen Definition des Klassifikatores abhängt. Des weiteren zeigen wir, dass die Featuremenge, auf welcher die Klassifikation basiert, einen großen Einfluss auf die Genauigkeit des Klassifikators hat und durch die Wahl einer geeigneten Featureselektionsmethode daher desssen Güte erheblich verbessert werden kann. Das beste Klassifikationsergebnis wird erreicht durch die Kombination eines bes- timmten nearest centroid Klassifikatores mit einem AdaBoost-Featureselektionsalgorithmus: Eine 5-fache Kreuzvalidierung erreicht 89% Sensitivität (sensitivity) und 89% Spezifität (specificity).In this thesis, we aim at predicting whether a breast cancer tumor will develop distant metastasis by classifying the tumor’s gene expression data. This data is obtained from microarrays, which is a technology providing a fast and efficient way of extracting gene expressions, thereby enabling such classification.
The binary classification problem studied here is to decide whether a tumor will develop distant metastases within a timescale of five years. In contrast to classical studies in this field, we are not interested in maximizing overall classification performance, but focus on keeping the type II error (misclassification of metastases developing patients) low and only in the second place minimize the type I error.
We define different nearest centroid classifiers, where the centroids are given by gene expression profiles consisting of average gene expression values for each outcome group. We then compare their performance and analyze the influence of feature selection methods on classification accuracy.
We show that the performance of nearest centroid classification varies a lot depending on the specific definition of the classifier. Furthermore, we demon- strate that the feature set, on which the classification is based, has a big influ- ence on the classifier’s accuracy and choosing an appropriate feature selection method can therefore lead to a huge improvement in performance. The best classification result can be observed when combining a specific nearest centroid classifier with an AdaBoost feature selection algorithm: 5-fold cross-validation showed 89% sensitivity and 86% specificity
A framework for smart traffic management using heterogeneous data sources
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Traffic congestion constitutes a social, economic and environmental issue to modern cities as it can negatively impact travel times, fuel consumption and carbon emissions. Traffic forecasting and incident detection systems are fundamental areas of Intelligent Transportation Systems (ITS) that have been widely researched in the last decade. These systems provide real time information about traffic congestion and other unexpected incidents that can support traffic management agencies to activate strategies and notify users accordingly. However, existing techniques suffer from high false alarm rate and incorrect traffic measurements. In recent years, there has been an increasing interest in integrating different types of data sources to achieve higher precision in traffic forecasting and incident detection techniques. In fact, a considerable amount of literature has grown around the influence of integrating data from heterogeneous data sources into existing traffic management systems.
This thesis presents a Smart Traffic Management framework for future cities. The proposed framework fusions different data sources and technologies to improve traffic prediction and incident detection systems. It is composed of two components: social media and simulator component. The social media component consists of a text classification algorithm to identify traffic related tweets. These traffic messages are then geolocated using Natural Language Processing (NLP) techniques. Finally, with the purpose of further analysing user emotions within the tweet, stress and relaxation strength detection is performed. The proposed text classification algorithm outperformed similar studies in the literature and demonstrated to be more accurate than other machine learning algorithms in the same dataset. Results from the stress and relaxation analysis detected a significant amount of stress in 40% of the tweets, while the other portion did not show any emotions associated with them. This information can potentially be used for policy making in transportation, to understand the users��� perception of the transportation network. The simulator component proposes an optimisation procedure for determining missing roundabouts and urban roads flow distribution using constrained optimisation. Existing imputation methodologies have been developed on straight section of highways and their applicability for more complex networks have not been validated. This task presented a solution for the unavailability of roadway sensors in specific parts of the network and was able to successfully predict the missing values with very low percentage error. The proposed imputation methodology can serve as an aid for existing traffic forecasting and incident detection methodologies, as well as for the development of more realistic simulation networks
Recommended from our members
Methods for cost-sensitive learning
Many approaches for achieving intelligent behavior of automated (computer) systems involve components that learn from past experience. This dissertation studies computational methods for learning from examples, for classification and for decision
making, when the decisions have different non-zero costs associated with them. Many practical applications of learning algorithms, including transaction monitoring, fraud detection, intrusion detection, and medical diagnosis, have such non-uniform costs, and there is a great need for new methods that can handle them. This dissertation discusses two approaches to cost-sensitive classification: input data weighting and conditional density estimation. The first method assigns a weight
to each training example in order to force the learning algorithm (which is otherwise unchanged) to pay more attention to examples with higher misclassification costs. The dissertation discusses several different weighting methods and concludes that a method that gives higher weight to examples from rarer classes works quite well. Another algorithm that gave good results was a wrapper method that applies Powell's gradient-free algorithm to optimize the input weights. The second approach to cost-sensitive classification is conditional density estimation. In this approach, the output of the learning algorithm is a classifier that estimates, for a new data point, the probability that it belongs to each of the classes. These probability estimates can be combined with a cost matrix to make decisions that minimize the expected cost. The dissertation presents a new algorithm, bagged lazy option trees (B-LOTs), that gives better probability estimates than any previous method based on decision trees. In order to evaluate cost-sensitive classification methods, appropriate statistical methods are needed. The dissertation presents two new statistical procedures: BLOTs provides a confidence interval on the expected cost of a classifier, and
BDELTACOST provides a confidence interval on the difference in expected costs of two classifiers. These methods are applied to a large set of experimental studies to evaluate and compare the cost-sensitive methods presented in this dissertation. Finally, the dissertation describes the application of the B-LOTs to a problem of predicting the stability of river channels. In this study, B-LOTs were shown to be superior to other methods in cases where the classes have very different frequencies a situation that arises frequently in cost-sensitive classification problems