With today's technology it is easy to collect data continuously. Still, how to extract knowledge from potentially infinite data streams remains an open problem. Because of specific constraints, stream processing methods have to be well designed, space-efficient, computationally simple and fast. Typically, data analysis is done on a fixed history of the data stream defined by a sliding window. We usually define the quality of predictions by their average accuracy. However, when dealing with real-time data it can be also important to know the reliability of the models’ output values.



In this thesis we deal with online reliability estimation of individual predictions on data streams. We consider different interval reliability estimators based on maximum likelihood, bootstrap and local neighborhood approach for working on continuous dynamic data.

 

We implement these methods on different regression models and test them on several real and artificial regression problems with various sizes of the sliding window. Performance of the interval estimates are evaluated using the estimates of prediction interval coverage probability, the relative mean prediction interval and the combined statistic. We compare the execution times of learning algorithms with and without the reliability estimates as well as their prediction accuracy when given the same time constraint. We also analyze results visually

Hren, Boštjan

Z današnjo tehnologijo je mogoče preprosto neprekinjeno zbiranje podatkov. Kljub temu predstavlja pridobivanje znanja iz potencialno neskončnih podatkovnih tokov odprt problem. Zaradi določenih omejitev morajo biti metode za procesiranje podatkov dobro zasnovane, prostorsko učinkovite, računsko enostavne in hitre. Pogosto se analiza opravi na fiksni zgodovini podatkovnega toka, ki je določena z drsečim oknom. Kvaliteto napovedi algoritmov običajno ocenimo glede na njihovo povprečno točnost. Vendar, ko imamo opravka s podatki v realnem času, je lahko prav tako pomembna zanesljivost izhodnih vrednosti.

V diplomskem delu obravnavamo ocenjevanje zanesljivosti posameznih napovedi pri učenju na podatkovnih tokovih. Raziskali smo različne metode, ki tvorijo intervalne ocene zanesljivosti s pristopom maksimalnega verjetja, stremljenja in lokalnih okolic, za delo na neprekinjenih dinamičnih podatkih. 

Metode smo implementirali z različnimi algoritmi strojnega učenja in jih preizkusili na več realnih in umetnih regresijskih problemih pri različnih velikostih drsečega okna. Uspešnost intervalnih cenilk smo ovrednotili z ocenami pokrivna verjetnost, relativni povprečni napovedni interval in kombinirana statistika. Primerjali smo izvajalne čase učnih algoritmov z in brez ocen zanesljivosti ter dosežene točnosti napovedi pri enakih časovnih omejitvah. Rezultate analiziramo tudi vizualno.With today\u27s technology it is easy to collect data continuously. Still, how to extract knowledge from potentially infinite data streams remains an open problem. Because of specific constraints, stream processing methods have to be well designed, space-efficient, computationally simple and fast. Typically, data analysis is done on a fixed history of the data stream defined by a sliding window. We usually define the quality of predictions by their average accuracy. However, when dealing with real-time data it can be also important to know the reliability of the models’ output values.

In this thesis we deal with online reliability estimation of individual predictions on data streams. We consider different interval reliability estimators based on maximum likelihood, bootstrap and local neighborhood approach for working on continuous dynamic data.
 
We implement these methods on different regression models and test them on several real and artificial regression problems with various sizes of the sliding window. Performance of the interval estimates are evaluated using the estimates of prediction interval coverage probability, the relative mean prediction interval and the combined statistic. We compare the execution times of learning algorithms with and without the reliability estimates as well as their prediction accuracy when given the same time constraint. We also analyze results visually

Reliability estimation of regressional predictions on data streams

Abstract

Similar works

Full text

Available Versions

Repository of the University of Ljubljana

University of Ljubljana Computer and Information Science ePrints.fri

ePrints.FRI