Search CORE

89 research outputs found

Evaluation methods and decision theory for classification of streaming data with temporal dependence

Author: Bifet Albert
Holmes Geoffrey
Pfahringer Bernhard
Read Jesse
Žliobaitė Indrė
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data

Research Commons@Waikato

Learning from Data Streams: An Overview and Update

Author: Read Jesse
Žliobaitė Indrė
Publication venue
Publication date: 03/08/2023
Field of study

The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings

arXiv.org e-Print Archive

Clustering based active learning for evolving data streams

Author: Bifet Albert
Ienco Dino
Pfahringer Bernhard
Žliobaitė Indrė
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported

Research Commons@Waikato

HAL Descartes

HAL-CIRAD

On calibrating the completometer for the mammalian fossil record

Author: Fortelius Mikael
Žliobaitė Indre
Publication venue
Publication date: 01/02/2022
Field of study

We know that the fossil record is incomplete. But how incomplete? Here we very coarsely estimate the completeness of the mammalian record in the Miocene, assuming that the duration of a mammalian species is about 1 Myr and the species diversity has stayed constant and is structurally comparable to the taxonomic diversity today. The overall completeness under these assumptions appears to be around 4%, but there are large differences across taxonomic groups. We find that the fossil record of proboscideans and perissodactyls as we know it for the Miocene must be close to complete, while we might know less than 15% of the species of artiodactyl or carnivore fossil species and only about 1% of primate species of the Miocene. The record of small mammals appears much less complete than that of large mammals.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

The NOW Database of Fossil Mammals

Author: Minwer-Barakat Requena Raef
Žliobaitė Indrė
Publication venue: Springer Nature
Publication date: 09/08/2023
Field of study

This chapter was completed and accepted after revision in August 2021. NOW does not have dedicated institutional funding. The database and data development are funded from regular research projects of the NOW Community members. Current and recent (last 5 years) funding sources include: The Ella and Georg Ehrnrooth Foundation and The Academy of Finland. ICP researchers acknowledge funding from the “Generalitat de Catalunya (CERCA Programme)”, R+D+I projects “PID2020-117289GB-I00” and “PID2020-116908GB-I00” (MCIN/AEI/10.13039/501100011033/) and consolidated research group from the Generalitat de Catalunya “2022 SGR 00620”. This is Bernor’s NSF FuTRES publication 35. L. K. Säilä acknowledges Academy of Finland Postdoctoral grant (275551). We thank three reviewers for helpful suggestions regarding the manuscript text. Contributions from the Valio Armas Korvenkontio Unit of Dental Anatomy in Relation to Evolutionary Theory are acknowledged.NOW (New and Old Worlds) is a global database of fossil mammal occurrences, currently containing around 68,000 locality-species entries. The database spans the last 66 million years, with its primary focus on the last 23 million years. Whereas the database contains records from all continents, the main focus and coverage of the database historically has been on Eurasia. The database includes primarily, but not exclusively, terrestrial mammals. It covers a large part of the currently known mammalian fossil record, focusing on classical and actively researched fossil localities. The database is managed in collaboration with an international advisory board of experts. Rather than a static archive, it emphasizes the continuous integration of new knowledge of the community, data curation, and consistency of scientific interpretations. The database records species occurrences at localities worldwide, as well as ecological characteristics of fossil species, geological contexts of localities and more. The NOW database is primarily used for two purposes: (1) queries about occurrences of particular taxa, their characteristics and properties of localities in the spirit of an encyclopedia; and (2) large scale research and quantitative analyses of evolutionary processes, patterns, reconstructing past environments, as well as interpreting evolutionary contexts. The data are fully open, no logging in or community membership is necessary for using the data for any purpose.Georg Ehrnrooth FoundationNational Science Foundation NSFAcademy of Finland 275551 AKAGeneralitat de Catalunya 2022 SGR 00620MCIN/AEI/10.13039/501100011033: PID2020-117289GB-I00, PID2020-116908GB-I00Academy of Finland Postdoctoral (275551

Repositorio Institucional Universidad de Granada

Multi-output regression with structurally incomplete target labels : A case study of modelling global vegetation cover

Author: Beigaitė Rita
Read Jesse
Žliobaitė Indrė
Publication venue
Publication date: 01/12/2022
Field of study

Publisher Copyright: © 2022 The AuthorsWeakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Efficient estimation of AUC in a sliding window

Author: A Bifet
C Ferri
D Brzezinski
DJ Hand
I Žliobaitė
J Gama
J Gama
J Gama
Remco R. Bouckaert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2019
Field of study

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length

k

. More specifically, we propose an algorithm that, given

\epsilon

, estimates AUC within

\epsilon / 2

, and can maintain this estimate in

O((\log k) / \epsilon)

time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires

O(k)

time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (

i

) the groups are small enough, so that the error stays small, (

ii

) the number of groups is small, so that enumerating them is not expensive, and (

iii

) the definition is flexible enough so that we can maintain the groups efficiently. Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee

\epsilon / 2

, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy

arXiv.org e-Print Archive

Crossref

Guest Editors introduction: special issue of the ECMLPKDD 2015 journal track

Author: Alípio Jorge
Concha Bielza
Indrė Žliobaitė
João Gama
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

From Sensor Readings to Predictions: On the Process of Developing Practical Soft Sensors.

Author: B. Lin
C. Han
C. Willmott
I. Žliobaitė
J. Qin
K. Warne
P. Kadlec
P. Kadlec
P. Kadlec
S. Park
T. Netzeva
V. Chandola
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Automatic data acquisition systems provide large amounts of streaming data generated by physical sensors. This data forms an input to computational models (soft sensors) routinely used for monitoring and control of industrial processes, traffic patterns, environment and natural hazards, and many more. The majority of these models assume that the data comes in a cleaned and pre-processed form, ready to be fed directly into a predictive model. In practice, to ensure appropriate data quality, most of the modelling efforts concentrate on preparing data from raw sensor readings to be used as model inputs. This study analyzes the process of data preparation for predictive models with streaming sensor data. We present the challenges of data preparation as a four-step process, identify the key challenges in each step, and provide recommendations for handling these issues. The discussion is focused on the approaches that are less commonly used, while, based on our experience, may contribute particularly well to solving practical soft sensor tasks. Our arguments are illustrated with a case study in the chemical production industry

Crossref

Bournemouth University Research Online

A taxonomic look at instance-based stream classifiers

Author: Aha
Beringer
Bose
Cruz-Vega
Ditzler
Dyer
Gama
Gama
Garcia
García-Pedrajas
Gomes
Hulten
Iain A.D. Gunn
Kohonen
Krawczyk
Kubat
Kuncheva
Law
Lu
Ludmila I. Kuncheva
Mena-Torres
Nova
Ramírez-Gallego
Salganicoff
Salganicoff
Shaker
Shaker
Shao
Triguero
Webb
Widmer
Widmer
Wilson
Xu
Yang
Zhao
Álvar Arnaiz-González
Álvar Arnaiz-González
Žliobaitė
Publication venue: 'Elsevier BV'
Publication date: 01/04/2020
Field of study

Large numbers of data streams are today generated in many fields. A key challenge when learning from such streams is the problem of concept drift. Many methods, including many prototype methods, have been proposed in recent years to address this problem. This paper presents a refined taxonomy of instance selection and generation methods for the classification of data streams subject to concept drift. The taxonomy allows discrimination among a large number of methods which pre-existing taxonomies for offline instance selection methods did not distinguish. This makes possible a valuable new perspective on experimental results, and provides a framework for discussion of the concepts behind different algorithm-design approaches. We review a selection of modern algorithms for the purpose of illustrating the distinctions made by the taxonomy. We present the results of a numerical experiment which examined the performance of a number of representative methods on both synthetic and real-world data sets with and without concept drift, and discuss the implications for the directions of future research in light of the taxonomy. On the basis of the experimental results, we are able to give recommendations for the experimental evaluation of algorithms which may be proposed in the future.project RPG-2015-188 funded by The Leverhulme Trust, UK, and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731593

Crossref

Repositorio Institucional de la Universidad de Burgos