94 research outputs found
An Empirical Comparison of Learning Algorithms for Nonparametric Scoring
The TreeRank algorithm was recently proposed as a scoring-based method based on recursive partitioning of the input space. This tree induction algorithm builds orderings by recursively optimizing the Receiver Operating Characteristic (ROC) curve through a one-step optimization procedure called LeafRank. One of the aim of this paper is the indepth analysis of the empirical performance of the variants of TreeRank/LeafRank method. Numerical experiments based on both artificial and real data sets are provided. Further experiments using resampling and randomization, in the spirit of bagging and random forests are developed and we show how they increase both stability and accuracy in bipartite ranking. Moreover, an empirical comparison with other efficient scoring algorithms such as RankBoost and RankSVM is presented on UCI benchmark data sets
Interpretability and Explainability: A Machine Learning Zoo Mini-tour
In this review, we examine the problem of designing interpretable and
explainable machine learning models. Interpretability and explainability lie at
the core of many machine learning and statistical applications in medicine,
economics, law, and natural sciences. Although interpretability and
explainability have escaped a clear universal definition, many techniques
motivated by these properties have been developed over the recent 30 years with
the focus currently shifting towards deep learning methods. In this review, we
emphasise the divide between interpretability and explainability and illustrate
these two different research directions with concrete examples of the
state-of-the-art. The review is intended for a general machine learning
audience with interest in exploring the problems of interpretation and
explanation beyond logistic regression or random forest variable importance.
This work is not an exhaustive literature survey, but rather a primer focusing
selectively on certain lines of research which the authors found interesting or
informative
A Cheat Sheet for Bayesian Prediction
This paper reviews the growing field of Bayesian prediction. Bayes point and
interval prediction are defined and exemplified and situated in statistical
prediction more generally. Then, four general approaches to Bayes prediction
are defined
and we turn to predictor selection. This can be done predictively or
non-predictively and predictors can be based on single models or multiple
models. We call these latter cases unitary predictors and model average
predictors, respectively. Then we turn to the most recent aspect of prediction
to emerge, namely
prediction in the context of large observational data sets and discuss three
further classes of techniques. We conclude with a summary and statement of
several current open problems.Comment: 33 page
Recommended from our members
Physical Activity Classification with Conditional Random Fields
In this thesis we develop methods for classifying physical activity using accelerometer recordings. We cast this as a problem of classification in time series with moderate to high dimensional observations at each time point. Specifically, we observe a vector of summary statistics of the accelerometer signal at each point in time, and we wish to use these observations to estimate the type and intensity of physical activity the individual engaged in as it changes over time.
Our methods are based on Conditional Random Fields, which allow us to capture temporal dependence in an individual’s physical activity type without requiring us to model the distribution of the observed features at each point in time. We develop three novel estimation strategies for Conditional Random Fields, evaluate their performance on classification tasks through simulation studies and demonstrate their use in applications with real physical activity data sets
Artificial Intelligence Based Classification for Urban Surface Water Modelling
Estimations and predictions of surface water runoff can provide very useful insights, regarding flood risks in urban areas. To automatically predict the flow behaviour of the rainfall-runoff water, in real-world satellite images, it is important to precisely identify permeable and impermeable areas. This identification indicates and helps to calculate the amount of surface water, by taking into account the amount of water being absorbed in a permeable area and what remains on the impermeable area. In this research, a model of surface water has been established, to predict the behavioural flow of rainfall-runoff water. This study employs a combination of image processing, artificial intelligence and machine learning techniques, for automatic segmentation and classification of permeable and impermeable areas, in satellite images. These techniques investigate the image classification approaches for classifying three land-use categories (roofs, roads, and pervious areas), commonly found in satellite images of the earth’s surface. Three different classification scenarios are investigated, to select the best classification model. The first scenario involves pixel by pixel classification of images, using Classification Tree and Random Forest classification techniques, in 2 different settings of sequential and parallel execution of algorithms. In the second classification scenario, the image is divided into objects, by using Superpixels (SLIC) segmentation method, while three kinds of feature sets are extracted from the segmented objects. The performance of eight different supervised machine learning classifiers is probed, using 5-fold cross-validation, for multiple SLIC values, while detailed performance comparisons lead to conclusions about the classification into different classes, regarding Object-based and Pixel-based classification schemes. Pareto analysis and Knee point selection are used to select SLIC value and the suitable type of classification, among the aforementioned two. Furthermore, a new diversity and weighted sum-based ensemble classification model, called ParetoEnsemble, is proposed, in this classification scenario. The weights are applied to selected component classifiers of an ensemble, creating a strong classifier, where classification is done based on multiple votes from candidate classifiers of the ensemble, as opposed to individual classifiers, where classification is done based on a single vote, from only one classifier. Unbalanced and balanced data-based classification results are also evaluated, to determine the most suitable mode, for satellite image classifications, in this study. Convolutional Neural Networks, based on semantic segmentation, are also employed in the classification phase, as a third scenario, to evaluate the strength of deep learning model SegNet, in the classification of satellite imaging. The best results, from the three classification scenarios, are compared and the best classification method, among the three scenarios, is used in the next phase of water modelling, with the InfoWorks ICM software, to explore the potential of modelling process, regarding a partially automated surface water network. By using the parameter settings, with a specified amount of simulated rain falling, onto the imaged area, the amount of surface water flow is estimated, to get predictions about runoff situations in urban areas, since runoff, in such a situation, can be high enough to pose a dangerous flood risk. The area of Feock, in Cornwall, is used as a simulation area of study, in this research, where some promising results have been derived, regarding classification and modelling of runoff. The correlation coefficient estimation, between classification and runoff accuracy, provides useful insight, regarding the dependence of runoff performance on classification performance. The trained system was tested on some unknown area images as well, demonstrating a reasonable performance, considering the training and classification limitations and conditions. Furthermore, in these unknown area images, reasonable estimations were derived, regarding surface water runoff. An analysis of unbalanced and balanced data-based classification and runoff estimations, for multiple parameter configurations, provides aid to the selection of classification and modelling parameter values, to be used in future unknown data predictions. This research is founded on the incorporation of satellite imaging into water modelling, using selective images for analysis and assessment of results. This system can be further improved, and runoff predictions of high precision can be better achieved, by adding more high-resolution images to the classifiers training. The added variety, to the trained model, can lead to an even better classification of any unknown image, which could eventually provide better modelling and better insights into surface water modelling. Moreover, the modelling phase can be extended, in future research, to deal with real-time parameters, by calibrating the model, after the classification phase, in order to observe the impact of classification on the actual calibration
A study on the prediction of flight delays of a private aviation airline
The delay is a crucial performance indicator of any transportation system, and flight delays
cause financial and economic consequences to passengers and airlines. Hence, recognizing
them through prediction may improve marketing decisions. The goal is to use machine learning
techniques to predict an aviation challenge: flight delay above 15 minutes on departure of a
private airline. Business and data understanding of this particular segment of aviation are
revised against literature revision, and data preparation, modelling and evaluation are addressed
to lead towards a model that may contribute as support for decision-making in a private aviation
environment. The results show us which algorithms performed better and what variables
contribute the most for the model, thereafter delay on departure.O atraso de voo é um indicador fulcral em toda a indútria de transporte aéreo e esses atrasos
têm consequências económicas e financeiras para passageiros e companhias aéras. Reconhecê-
los através de predição poderá melhorar decisões estratégicas e operacionais. O objectivo é
utilizar técnicas de aprendizagem de máquina (machine learning) para prever um eterno desafio
da aviação: atraso de voo à partida, utilizando dados de uma companhia aérea privada. O
conhecimento do contexto do negócio e dos dados adquiridos, num segmento singular da
aviação, são revistos à luz das literatura vigente e a preparação dos dados, a modelização e
respectiva avaliação são conduzidos de modo a contribuir para uma ferramenta de apoio à
decisão no contexto da aviação privada. Os resultados obtidos revelam quais dos algoritmos
utilizados demonstra uma melhor performance e quais as variáveis dos dados obtidos que mais
contribuem para o modelo e consequentemente para o atraso à partida
Random projection ensemble classification
We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.Both authors are supported by an Engineering and Physical Sciences Research Council Fellowship EP/J017213/1; the second author is also supported by a Philip Leverhulme prize
- …