63,327 research outputs found
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
GENESIM : genetic extraction of a single, interpretable model
Models obtained by decision tree induction techniques excel in being
interpretable.However, they can be prone to overfitting, which results in a low
predictive performance. Ensemble techniques are able to achieve a higher
accuracy. However, this comes at a cost of losing interpretability of the
resulting model. This makes ensemble techniques impractical in applications
where decision support, instead of decision making, is crucial.
To bridge this gap, we present the GENESIM algorithm that transforms an
ensemble of decision trees to a single decision tree with an enhanced
predictive performance by using a genetic algorithm. We compared GENESIM to
prevalent decision tree induction and ensemble techniques using twelve publicly
available data sets. The results show that GENESIM achieves a better predictive
performance on most of these data sets than decision tree induction techniques
and a predictive performance in the same order of magnitude as the ensemble
techniques. Moreover, the resulting model of GENESIM has a very low complexity,
making it very interpretable, in contrast to ensemble techniques.Comment: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in
Complex System
Forecasting with time series imaging
Feature-based time series representations have attracted substantial
attention in a wide range of time series analysis methods. Recently, the use of
time series features for forecast model averaging has been an emerging research
focus in the forecasting community. Nonetheless, most of the existing
approaches depend on the manual choice of an appropriate set of features.
Exploiting machine learning methods to extract features from time series
automatically becomes crucial in state-of-the-art time series analysis. In this
paper, we introduce an automated approach to extract time series features based
on time series imaging. We first transform time series into recurrence plots,
from which local features can be extracted using computer vision algorithms.
The extracted features are used for forecast model averaging. Our experiments
show that forecasting based on automatically extracted features, with less
human intervention and a more comprehensive view of the raw time series data,
yields highly comparable performances with the best methods in the largest
forecasting competition dataset (M4) and outperforms the top methods in the
Tourism forecasting competition dataset
- …