1,601 research outputs found
Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization
We aimed to evaluate computer-aided diagnosis (CADx) system for lung nodule
classification focusing on (i) usefulness of gradient tree boosting (XGBoost)
and (ii) effectiveness of parameter optimization using Bayesian optimization
(Tree Parzen Estimator, TPE) and random search. 99 lung nodules (62 lung
cancers and 37 benign lung nodules) were included from public databases of CT
images. A variant of local binary pattern was used for calculating feature
vectors. Support vector machine (SVM) or XGBoost was trained using the feature
vectors and their labels. TPE or random search was used for parameter
optimization of SVM and XGBoost. Leave-one-out cross-validation was used for
optimizing and evaluating the performance of our CADx system. Performance was
evaluated using area under the curve (AUC) of receiver operating characteristic
analysis. AUC was calculated 10 times, and its average was obtained. The best
averaged AUC of SVM and XGBoost were 0.850 and 0.896, respectively; both were
obtained using TPE. XGBoost was generally superior to SVM. Optimal parameters
for achieving high AUC were obtained with fewer numbers of trials when using
TPE, compared with random search. In conclusion, XGBoost was better than SVM
for classifying lung nodules. TPE was more efficient than random search for
parameter optimization.Comment: 29 pages, 4 figure
High-Resolution Road Vehicle Collision Prediction for the City of Montreal
Road accidents are an important issue of our modern societies, responsible
for millions of deaths and injuries every year in the world. In Quebec only, in
2018, road accidents are responsible for 359 deaths and 33 thousands of
injuries. In this paper, we show how one can leverage open datasets of a city
like Montreal, Canada, to create high-resolution accident prediction models,
using big data analytics. Compared to other studies in road accident
prediction, we have a much higher prediction resolution, i.e., our models
predict the occurrence of an accident within an hour, on road segments defined
by intersections. Such models could be used in the context of road accident
prevention, but also to identify key factors that can lead to a road accident,
and consequently, help elaborate new policies.
We tested various machine learning methods to deal with the severe class
imbalance inherent to accident prediction problems. In particular, we
implemented the Balanced Random Forest algorithm, a variant of the Random
Forest machine learning algorithm in Apache Spark. Interestingly, we found that
in our case, Balanced Random Forest does not perform significantly better than
Random Forest.
Experimental results show that 85% of road vehicle collisions are detected by
our model with a false positive rate of 13%. The examples identified as
positive are likely to correspond to high-risk situations. In addition, we
identify the most important predictors of vehicle collisions for the area of
Montreal: the count of accidents on the same road segment during previous
years, the temperature, the day of the year, the hour and the visibility
Learning Multiple Defaults for Machine Learning Algorithms
The performance of modern machine learning methods highly depends on their
hyperparameter configurations. One simple way of selecting a configuration is
to use default settings, often proposed along with the publication and
implementation of a new algorithm. Those default values are usually chosen in
an ad-hoc manner to work good enough on a wide variety of datasets. To address
this problem, different automatic hyperparameter configuration algorithms have
been proposed, which select an optimal configuration per dataset. This
principled approach usually improves performance, but adds additional
algorithmic complexity and computational costs to the training procedure. As an
alternative to this, we propose learning a set of complementary default values
from a large database of prior empirical results. Selecting an appropriate
configuration on a new dataset then requires only a simple, efficient and
embarrassingly parallel search over this set. We demonstrate the effectiveness
and efficiency of the approach we propose in comparison to random search and
Bayesian Optimization
- …
