838 research outputs found
Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization
We aimed to evaluate computer-aided diagnosis (CADx) system for lung nodule
classification focusing on (i) usefulness of gradient tree boosting (XGBoost)
and (ii) effectiveness of parameter optimization using Bayesian optimization
(Tree Parzen Estimator, TPE) and random search. 99 lung nodules (62 lung
cancers and 37 benign lung nodules) were included from public databases of CT
images. A variant of local binary pattern was used for calculating feature
vectors. Support vector machine (SVM) or XGBoost was trained using the feature
vectors and their labels. TPE or random search was used for parameter
optimization of SVM and XGBoost. Leave-one-out cross-validation was used for
optimizing and evaluating the performance of our CADx system. Performance was
evaluated using area under the curve (AUC) of receiver operating characteristic
analysis. AUC was calculated 10 times, and its average was obtained. The best
averaged AUC of SVM and XGBoost were 0.850 and 0.896, respectively; both were
obtained using TPE. XGBoost was generally superior to SVM. Optimal parameters
for achieving high AUC were obtained with fewer numbers of trials when using
TPE, compared with random search. In conclusion, XGBoost was better than SVM
for classifying lung nodules. TPE was more efficient than random search for
parameter optimization.Comment: 29 pages, 4 figure
A Bayesian Perspective of Statistical Machine Learning for Big Data
Statistical Machine Learning (SML) refers to a body of algorithms and methods
by which computers are allowed to discover important features of input data
sets which are often very large in size. The very task of feature discovery
from data is essentially the meaning of the keyword `learning' in SML.
Theoretical justifications for the effectiveness of the SML algorithms are
underpinned by sound principles from different disciplines, such as Computer
Science and Statistics. The theoretical underpinnings particularly justified by
statistical inference methods are together termed as statistical learning
theory.
This paper provides a review of SML from a Bayesian decision theoretic point
of view -- where we argue that many SML techniques are closely connected to
making inference by using the so called Bayesian paradigm. We discuss many
important SML techniques such as supervised and unsupervised learning, deep
learning, online learning and Gaussian processes especially in the context of
very large data sets where these are often employed. We present a dictionary
which maps the key concepts of SML from Computer Science and Statistics. We
illustrate the SML techniques with three moderately large data sets where we
also discuss many practical implementation issues. Thus the review is
especially targeted at statisticians and computer scientists who are aspiring
to understand and apply SML for moderately large to big data sets.Comment: 26 pages, 3 figures, Review pape
Improved credit scoring model using XGBoost with Bayesian hyper-parameter optimization
Several credit-scoring models have been developed using ensemble classifiers in order to improve the accuracy of assessment. However, among the ensemble models, little consideration has been focused on the hyper-parameters tuning of base learners, although these are crucial to constructing ensemble models. This study proposes an improved credit scoring model based on the extreme gradient boosting (XGB) classifier using Bayesian hyper-parameters optimization (XGB-BO). The model comprises two steps. Firstly, data pre-processing is utilized to handle missing values and scale the data. Secondly, Bayesian hyper-parameter optimization is applied to tune the hyper-parameters of the XGB classifier and used to train the model. The model is evaluated on four widely public datasets, i.e., the German, Australia, lending club, and Polish datasets. Several state-of-the-art classification algorithms are implemented for predictive comparison with the proposed method. The results of the proposed model showed promising results, with an improvement in accuracy of 4.10%, 3.03%, and 2.76% on the German, lending club, and Australian datasets, respectively. The proposed model outperformed commonly used techniques, e.g., decision tree, support vector machine, neural network, logistic regression, random forest, and bagging, according to the evaluation results. The experimental results confirmed that the XGB-BO model is suitable for assessing the creditworthiness of applicants
High-Resolution Road Vehicle Collision Prediction for the City of Montreal
Road accidents are an important issue of our modern societies, responsible
for millions of deaths and injuries every year in the world. In Quebec only, in
2018, road accidents are responsible for 359 deaths and 33 thousands of
injuries. In this paper, we show how one can leverage open datasets of a city
like Montreal, Canada, to create high-resolution accident prediction models,
using big data analytics. Compared to other studies in road accident
prediction, we have a much higher prediction resolution, i.e., our models
predict the occurrence of an accident within an hour, on road segments defined
by intersections. Such models could be used in the context of road accident
prevention, but also to identify key factors that can lead to a road accident,
and consequently, help elaborate new policies.
We tested various machine learning methods to deal with the severe class
imbalance inherent to accident prediction problems. In particular, we
implemented the Balanced Random Forest algorithm, a variant of the Random
Forest machine learning algorithm in Apache Spark. Interestingly, we found that
in our case, Balanced Random Forest does not perform significantly better than
Random Forest.
Experimental results show that 85% of road vehicle collisions are detected by
our model with a false positive rate of 13%. The examples identified as
positive are likely to correspond to high-risk situations. In addition, we
identify the most important predictors of vehicle collisions for the area of
Montreal: the count of accidents on the same road segment during previous
years, the temperature, the day of the year, the hour and the visibility
- …