415 research outputs found
An investigation into machine learning approaches for forecasting spatio-temporal demand in ride-hailing service
In this paper, we present machine learning approaches for characterizing and
forecasting the short-term demand for on-demand ride-hailing services. We
propose the spatio-temporal estimation of the demand that is a function of
variable effects related to traffic, pricing and weather conditions. With
respect to the methodology, a single decision tree, bootstrap-aggregated
(bagged) decision trees, random forest, boosted decision trees, and artificial
neural network for regression have been adapted and systematically compared
using various statistics, e.g. R-square, Root Mean Square Error (RMSE), and
slope. To better assess the quality of the models, they have been tested on a
real case study using the data of DiDi Chuxing, the main on-demand ride hailing
service provider in China. In the current study, 199,584 time-slots describing
the spatio-temporal ride-hailing demand has been extracted with an
aggregated-time interval of 10 mins. All the methods are trained and validated
on the basis of two independent samples from this dataset. The results revealed
that boosted decision trees provide the best prediction accuracy (RMSE=16.41),
while avoiding the risk of over-fitting, followed by artificial neural network
(20.09), random forest (23.50), bagged decision trees (24.29) and single
decision tree (33.55).Comment: Currently under review for journal publicatio
Feature selection for multi-label learning
Feature Selection plays an important role in machine learning and data mining, and it is often applied as a data pre-processing step. This task can speed up learning algorithms and sometimes improve their performance. In multi-label learning, label dependence is considered another aspect that can contribute to improve learning performance. A replicable and wide systematic review performed by us corroborates this idea. Based on this information, it is believed that considering label dependence during feature selection can lead to better learning performance. The hypothesis of this work is that multi-label feature selection algorithms that consider label dependence will perform better than the ones that disregard it. To this end, we propose multi-label feature selection algorithms that take into account label relations. These algorithms were experimentally compared to the standard approach for feature selection, showing good performance in terms of feature reduction and predictability of the classifiers built using the selected features.São Paulo Research Foundation (FAPESP) (grant 2011/02393-4
Are screening methods useful in feature selection? An empirical study
Filter or screening methods are often used as a preprocessing step for
reducing the number of variables used by a learning algorithm in obtaining a
classification or regression model. While there are many such filter methods,
there is a need for an objective evaluation of these methods. Such an
evaluation is needed to compare them with each other and also to answer whether
they are at all useful, or a learning algorithm could do a better job without
them. For this purpose, many popular screening methods are partnered in this
paper with three regression learners and five classification learners and
evaluated on ten real datasets to obtain accuracy criteria such as R-square and
area under the ROC curve (AUC). The obtained results are compared through curve
plots and comparison tables in order to find out whether screening methods help
improve the performance of learning algorithms and how they fare with each
other. Our findings revealed that the screening methods were useful in
improving the prediction of the best learner on two regression and two
classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table
Filter-wrapper combination and embedded feature selection for gene expression data
Biomedical and bioinformatics datasets are generally large in terms of their number of features - and include redundant and irrelevant features, which affect the effectiveness and efficiency of classification of these datasets. Several different features selection methods have been utilised in various fields, including bioinformatics, to reduce the number of features. This study utilised Filter-Wrapper combination and embedded (LASSO) feature selection methods on both high and low dimensional datasets before classification was performed. The results illustrate that the combination of filter and wrapper feature selection to create a hybrid form of feature selection provides better performance than using filter only. In addition, LASSO performed better on high dimensional data
Nuevos Modelos de Aprendizaje Híbrido para Clasificación y Ordenamiento Multi-Etiqueta
En la última década, el aprendizaje multi-etiqueta se ha convertido en una importante tarea de investigación, debido en gran parte al creciente número de problemas reales que contienen datos multi-etiqueta. En esta tesis se estudiaron dos problemas sobre datos multi-etiqueta, la mejora del rendimiento de los algoritmos en datos multi-etiqueta complejos y la mejora del rendimiento de los algoritmos a partir de datos no etiquetados. El primer problema fue tratado mediante métodos de estimación de atributos. Se evaluó la efectividad de los métodos de estimación de atributos propuestos en la mejora del rendimiento de los algoritmos de vecindad, mediante la parametrización de las funciones de distancias empleadas para recuperar los ejemplos más cercanos. Además, se demostró la efectividad de los métodos de estimación en la tarea de selección de atributos. Por otra parte, se desarrolló un algoritmo de vecindad inspirado en el enfoque de clasifcación basada en gravitación de datos. Este algoritmo garantiza un balance adecuado entre eficiencia y efectividad en su solución ante datos multi-etiqueta complejos. El segundo problema fue resuelto mediante técnicas de aprendizaje activo, lo cual permite reducir los costos del etiquetado de datos y del entrenamiento de un mejor modelo. Se propusieron dos estrategias de aprendizaje activo. La primer estrategia resuelve el problema de aprendizaje activo multi-etiqueta de una manera efectiva y eficiente, para ello se combinaron dos medidas que representan la utilidad de un ejemplo no etiquetado. La segunda estrategia propuesta se enfocó en la resolución del problema de aprendizaje activo multi-etiqueta en modo de lotes, para ello se formuló un problema multi-objetivo donde se optimizan tres medidas, y el problema de optimización planteado se resolvió mediante un algoritmo evolutivo. Como resultados complementarios derivados de esta tesis, se desarrolló una herramienta computacional que favorece la implementación de métodos de aprendizaje activo y la experimentación en esta tarea de estudio. Además, se propusieron dos aproximaciones que permiten evaluar el rendimiento de las técnicas de aprendizaje activo de una manera más adecuada y robusta que la empleada comunmente en la literatura. Todos los métodos propuestos en esta tesis han sido evaluados en un marco experimental
adecuado, se utilizaron numerosos conjuntos de datos y se compararon
los rendimientos de los algoritmos frente a otros métodos del estado del arte. Los
resultados obtenidos, los cuales fueron verificados mediante la aplicación de test
estadísticos no paramétricos, demuestran la efectividad de los métodos propuestos
y de esta manera comprueban las hipótesis planteadas en esta tesis.In the last decade, multi-label learning has become an important area of research
due to the large number of real-world problems that contain multi-label data. This
doctoral thesis is focused on the multi-label learning paradigm. Two problems were
studied, rstly, improving the performance of the algorithms on complex multi-label
data, and secondly, improving the performance through unlabeled data.
The rst problem was solved by means of feature estimation methods. The e ectiveness
of the feature estimation methods proposed was evaluated by improving
the performance of multi-label lazy algorithms. The parametrization of the distance
functions with a weight vector allowed to recover examples with relevant
label sets for classi cation. It was also demonstrated the e ectiveness of the feature
estimation methods in the feature selection task. On the other hand, a lazy
algorithm based on a data gravitation model was proposed. This lazy algorithm
has a good trade-o between e ectiveness and e ciency in the resolution of the
multi-label lazy learning.
The second problem was solved by means of active learning techniques. The active
learning methods allowed to reduce the costs of the data labeling process and
training an accurate model. Two active learning strategies were proposed. The
rst strategy e ectively solves the multi-label active learning problem. In this
strategy, two measures that represent the utility of an unlabeled example were
de ned and combined. On the other hand, the second active learning strategy proposed
resolves the batch-mode active learning problem, where the aim is to select a
batch of unlabeled examples that are informative and the information redundancy
is minimal. The batch-mode active learning was formulated as a multi-objective
problem, where three measures were optimized. The multi-objective problem was
solved through an evolutionary algorithm.
This thesis also derived in the creation of a computational framework to develop
any active learning method and to favor the experimentation process in the active
learning area. On the other hand, a methodology based on non-parametric
tests that allows a more adequate evaluation of active learning performance was
proposed. All methods proposed were evaluated by means of extensive and adequate experimental
studies. Several multi-label datasets from di erent domains were used, and
the methods were compared to the most signi cant state-of-the-art algorithms. The
results were validated using non-parametric statistical tests. The evidence showed
the e ectiveness of the methods proposed, proving the hypotheses formulated at
the beginning of this thesis
- …