thesis

Reliability estimation of regressional predictions on data streams

Abstract

With today's technology it is easy to collect data continuously. Still, how to extract knowledge from potentially infinite data streams remains an open problem. Because of specific constraints, stream processing methods have to be well designed, space-efficient, computationally simple and fast. Typically, data analysis is done on a fixed history of the data stream defined by a sliding window. We usually define the quality of predictions by their average accuracy. However, when dealing with real-time data it can be also important to know the reliability of the models’ output values. In this thesis we deal with online reliability estimation of individual predictions on data streams. We consider different interval reliability estimators based on maximum likelihood, bootstrap and local neighborhood approach for working on continuous dynamic data. We implement these methods on different regression models and test them on several real and artificial regression problems with various sizes of the sliding window. Performance of the interval estimates are evaluated using the estimates of prediction interval coverage probability, the relative mean prediction interval and the combined statistic. We compare the execution times of learning algorithms with and without the reliability estimates as well as their prediction accuracy when given the same time constraint. We also analyze results visually

    Similar works