682 research outputs found
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones
La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos.
Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador.
En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados.
La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad.
Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404
Doctor of Philosophy
dissertationScene labeling is the problem of assigning an object label to each pixel of a given image. It is the primary step towards image understanding and unifies object recognition and image segmentation in a single framework. A perfect scene labeling framework detects and densely labels every region and every object that exists in an image. This task is of substantial importance in a wide range of applications in computer vision. Contextual information plays an important role in scene labeling frameworks. A contextual model utilizes the relationships among the objects in a scene to facilitate object detection and image segmentation. Using contextual information in an effective way is one of the main questions that should be answered in any scene labeling framework. In this dissertation, we develop two scene labeling frameworks that rely heavily on contextual information to improve the performance over state-of-the-art methods. The first model, called the multiclass multiscale contextual model (MCMS), uses contextual information from multiple objects and at different scales for learning discriminative models in a supervised setting. The MCMS model incorporates crossobject and interobject information into one probabilistic framework, and thus is able to capture geometrical relationships and dependencies among multiple objects in addition to local information from each single object present in an image. The second model, called the contextual hierarchical model (CHM), learns contextual information in a hierarchy for scene labeling. At each level of the hierarchy, a classifier is trained based on downsampled input images and outputs of previous levels. The CHM then incorporates the resulting multiresolution contextual information into a classifier to segment the input image at original resolution. This training strategy allows for optimization of a joint posterior probability at multiple resolutions through the hierarchy. We demonstrate the performance of CHM on different challenging tasks such as outdoor scene labeling and edge detection in natural images and membrane detection in electron microscopy images. We also introduce two novel classification methods. WNS-AdaBoost speeds up the training of AdaBoost by providing a compact representation of a training set. Disjunctive normal random forest (DNRF) is an ensemble method that is able to learn complex decision boundaries and achieves low generalization error by optimizing a single objective function for each weak classifier in the ensemble. Finally, a segmentation framework is introduced that exploits both shape information and regional statistics to segment irregularly shaped intracellular structures such as mitochondria in electron microscopy images
Comparative Study of Different Methods in Vibration-Based Terrain Classification for Wheeled Robots with Shock Absorbers
open access articleAutonomous robots that operate in the field can enhance their security and efficiency by
accurate terrain classification, which can be realized by means of robot-terrain interaction-generated
vibration signals. In this paper, we explore the vibration-based terrain classification (VTC),
in particular for a wheeled robot with shock absorbers. Because the vibration sensors are
usually mounted on the main body of the robot, the vibration signals are dampened significantly,
which results in the vibration signals collected on different terrains being more difficult to
discriminate. Hence, the existing VTC methods applied to a robot with shock absorbers may degrade.
The contributions are two-fold: (1) Several experiments are conducted to exhibit the performance of
the existing feature-engineering and feature-learning classification methods; and (2) According to
the long short-term memory (LSTM) network, we propose a one-dimensional convolutional LSTM
(1DCL)-based VTC method to learn both spatial and temporal characteristics of the dampened
vibration signals. The experiment results demonstrate that: (1) The feature-engineering methods,
which are efficient in VTC of the robot without shock absorbers, are not so accurate in our project;
meanwhile, the feature-learning methods are better choices; and (2) The 1DCL-based VTC method
outperforms the conventional methods with an accuracy of 80.18%, which exceeds the second method
(LSTM) by 8.23%
Wind Power Forecasting Methods Based on Deep Learning: A Survey
Accurate wind power forecasting in wind farm can effectively reduce the enormous impact on grid operation safety when high permeability intermittent power supply is connected to the power grid. Aiming to provide reference strategies for relevant researchers as well as practical applications, this paper attempts to provide the literature investigation and methods analysis of deep learning, enforcement learning and transfer learning in wind speed and wind power forecasting modeling. Usually, wind speed and wind power forecasting around a wind farm requires the calculation of the next moment of the definite state, which is usually achieved based on the state of the atmosphere that encompasses nearby atmospheric pressure, temperature, roughness, and obstacles. As an effective method of high-dimensional feature extraction, deep neural network can theoretically deal with arbitrary nonlinear transformation through proper structural design, such as adding noise to outputs, evolutionary learning used to optimize hidden layer weights, optimize the objective function so as to save information that can improve the output accuracy while filter out the irrelevant or less affected information for forecasting. The establishment of high-precision wind speed and wind power forecasting models is always a challenge due to the randomness, instantaneity and seasonal characteristics
Structural Data Recognition with Graph Model Boosting
This paper presents a novel method for structural data recognition using a
large number of graph models. In general, prevalent methods for structural data
recognition have two shortcomings: 1) Only a single model is used to capture
structural variation. 2) Naive recognition methods are used, such as the
nearest neighbor method. In this paper, we propose strengthening the
recognition performance of these models as well as their ability to capture
structural variation. The proposed method constructs a large number of graph
models and trains decision trees using the models. This paper makes two main
contributions. The first is a novel graph model that can quickly perform
calculations, which allows us to construct several models in a feasible amount
of time. The second contribution is a novel approach to structural data
recognition: graph model boosting. Comprehensive structural variations can be
captured with a large number of graph models constructed in a boosting
framework, and a sophisticated classifier can be formed by aggregating the
decision trees. Consequently, we can carry out structural data recognition with
powerful recognition capability in the face of comprehensive structural
variation. The experiments shows that the proposed method achieves impressive
results and outperforms existing methods on datasets of IAM graph database
repository.Comment: 8 page
Automatic robust estimation for exponential smoothing: Perspectives from statistics and machine learning
A major challenge in automating the production of a large number of forecasts, as often required in many business applications, is the need for robust and reliable predictions. Increased noise, outliers and structural changes in the series, all too common in practice, can severely affect the quality of forecasting. We investigate ways to increase the reliability of exponential smoothing forecasts, the most widely used family of forecasting models in business forecasting. We consider two alternative sets of approaches, one stemming from statistics and one from machine learning. To this end, we adapt M-estimators, boosting and inverse boosting to parameter estimation for exponential smoothing. We propose appropriate modifications that are necessary for time series forecasting while aiming to obtain scalable algorithms. We evaluate the various estimation methods using multiple real datasets and find that several approaches outperform the widely used maximum likelihood estimation. The novelty of this work lies in (1) demonstrating the usefulness of M-estimators, (2) and of inverse boosting, which outperforms standard boosting approaches, and (3) a comparative look at statistics versus machine learning inspired approaches
Essays On Random Forest Ensembles
A random forest is a popular machine learning ensemble method that has proven successful in solving a wide range of classification problems. While other successful classifiers, such as boosting algorithms or neural networks, admit natural interpretations as maximum likelihood, a suitable statistical interpretation is much more elusive for a random forest. In the first part of this thesis, we demonstrate that a random forest is a fruitful framework in which to study AdaBoost and deep neural networks. We explore the concept and utility of interpolation, the ability of a classifier to perfectly fit its training data. In the second part of this thesis, we place a random forest on more sound statistical footing by framing it as kernel regression with the proximity kernel. We then analyze the parameters that control the bandwidth of this kernel and discuss useful generalizations
- …