89 research outputs found
Evaluation methods and decision theory for classification of streaming data with temporal dependence
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data
Learning from Data Streams: An Overview and Update
The literature on machine learning in the context of data streams is vast and
growing. However, many of the defining assumptions regarding data-stream
learning tasks are too strong to hold in practice, or are even contradictory
such that they cannot be met in the contexts of supervised learning. Algorithms
are chosen and designed based on criteria which are often not clearly stated,
for problem settings not clearly defined, tested in unrealistic settings,
and/or in isolation from related approaches in the wider literature. This puts
into question the potential for real-world impact of many approaches conceived
in such contexts, and risks propagating a misguided research focus. We propose
to tackle these issues by reformulating the fundamental definitions and
settings of supervised data-stream learning with regard to contemporary
considerations of concept drift and temporal dependence; and we take a fresh
look at what constitutes a supervised data-stream learning task, and a
reconsideration of algorithms that may be applied to tackle such tasks. Through
and in reflection of this formulation and overview, helped by an informal
survey of industrial players dealing with real-world data streams, we provide
recommendations. Our main emphasis is that learning from data streams does not
impose a single-pass or online-learning approach, or any particular learning
regime; and any constraints on memory and time are not specific to streaming.
Meanwhile, there exist established techniques for dealing with temporal
dependence and concept drift, in other areas of the literature. For the data
streams community, we thus encourage a shift in research focus, from dealing
with often-artificial constraints and assumptions on the learning mode, to
issues such as robustness, privacy, and interpretability which are increasingly
relevant to learning in data streams in academic and industrial settings
Clustering based active learning for evolving data streams
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported
On calibrating the completometer for the mammalian fossil record
We know that the fossil record is incomplete. But how incomplete? Here we very coarsely estimate the completeness of the mammalian record in the Miocene, assuming that the duration of a mammalian species is about 1 Myr and the species diversity has stayed constant and is structurally comparable to the taxonomic diversity today. The overall completeness under these assumptions appears to be around 4%, but there are large differences across taxonomic groups. We find that the fossil record of proboscideans and perissodactyls as we know it for the Miocene must be close to complete, while we might know less than 15% of the species of artiodactyl or carnivore fossil species and only about 1% of primate species of the Miocene. The record of small mammals appears much less complete than that of large mammals.Peer reviewe
The NOW Database of Fossil Mammals
This chapter was completed and accepted after
revision in August 2021. NOW does not have dedicated institutional
funding. The database and data development are funded from regular
research projects of the NOW Community members. Current and recent
(last 5 years) funding sources include: The Ella and Georg Ehrnrooth
Foundation and The Academy of Finland. ICP researchers acknowledge funding from the “Generalitat de Catalunya (CERCA
Programme)”, R+D+I projects “PID2020-117289GB-I00” and
“PID2020-116908GB-I00” (MCIN/AEI/10.13039/501100011033/) and
consolidated research group from the Generalitat de Catalunya “2022
SGR 00620”. This is Bernor’s NSF FuTRES publication 35. L. K. Säilä
acknowledges Academy of Finland Postdoctoral grant (275551). We
thank three reviewers for helpful suggestions regarding the manuscript
text. Contributions from the Valio Armas Korvenkontio Unit of Dental
Anatomy in Relation to Evolutionary Theory are acknowledged.NOW (New and Old Worlds) is a global database of fossil mammal occurrences, currently containing around 68,000 locality-species entries. The database spans the last 66 million years, with its primary focus on the last 23 million years. Whereas the database contains records from all continents, the main focus and coverage of the database historically has been on Eurasia. The database includes primarily, but not exclusively, terrestrial mammals. It covers a large part of the currently known mammalian fossil record, focusing on classical and actively researched fossil localities. The database is managed in collaboration with an international advisory board of experts. Rather than a static archive, it emphasizes the continuous integration of new knowledge of the community, data curation, and consistency of scientific interpretations. The database records species occurrences at localities worldwide, as well as ecological characteristics of fossil species, geological contexts of localities and more. The NOW database is primarily used for two purposes: (1) queries about occurrences of particular taxa, their characteristics and properties of localities in the spirit of an encyclopedia; and (2) large scale research and quantitative analyses of evolutionary processes, patterns, reconstructing past environments, as well as interpreting evolutionary contexts. The data are fully open, no logging in or community membership is necessary for using the data for any purpose.Georg Ehrnrooth FoundationNational Science Foundation
NSFAcademy of Finland
275551 AKAGeneralitat de Catalunya
2022 SGR 00620MCIN/AEI/10.13039/501100011033: PID2020-117289GB-I00, PID2020-116908GB-I00Academy of Finland Postdoctoral (275551
Multi-output regression with structurally incomplete target labels : A case study of modelling global vegetation cover
Publisher Copyright: © 2022 The AuthorsWeakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.Peer reviewe
Efficient estimation of AUC in a sliding window
In many applications, monitoring area under the ROC curve (AUC) in a sliding
window over a data stream is a natural way of detecting changes in the system.
The drawback is that computing AUC in a sliding window is expensive, especially
if the window size is large and the data flow is significant.
In this paper we propose a scheme for maintaining an approximate AUC in a
sliding window of length . More specifically, we propose an algorithm that,
given , estimates AUC within , and can maintain this
estimate in time, per update, as the window slides.
This provides a speed-up over the exact computation of AUC, which requires
time, per update. The speed-up becomes more significant as the size of
the window increases. Our estimate is based on grouping the data points
together, and using these groups to calculate AUC. The grouping is designed
carefully such that () the groups are small enough, so that the error stays
small, () the number of groups is small, so that enumerating them is not
expensive, and () the definition is flexible enough so that we can
maintain the groups efficiently.
Our experimental evaluation demonstrates that the average approximation error
in practice is much smaller than the approximation guarantee ,
and that we can achieve significant speed-ups with only a modest sacrifice in
accuracy
From Sensor Readings to Predictions: On the Process of Developing Practical Soft Sensors.
Automatic data acquisition systems provide large amounts of streaming data generated by physical sensors. This data forms an input to computational models (soft sensors) routinely used for monitoring and control of industrial processes, traffic patterns, environment and natural hazards, and many more. The majority of these models assume that the data comes in a cleaned and pre-processed form, ready to be fed directly into a predictive model. In practice, to ensure appropriate data quality, most of the modelling efforts concentrate on preparing data from raw sensor readings to be used as model inputs. This study analyzes the process of data preparation for predictive models with streaming sensor data. We present the challenges of data preparation as a four-step process, identify the key challenges in each step, and provide recommendations for handling these issues. The discussion is focused on the approaches that are less commonly used, while, based on our experience, may contribute particularly well to solving practical soft sensor tasks. Our arguments are illustrated with a case study in the chemical production industry
A taxonomic look at instance-based stream classifiers
Large numbers of data streams are today generated in many fields. A key challenge when learning from such streams is the problem of concept drift. Many methods, including many prototype methods, have been proposed in recent years to address this problem. This paper presents a refined taxonomy of instance selection and generation methods for the classification of data streams subject to concept drift. The taxonomy allows discrimination among a large number of methods which pre-existing taxonomies for offline instance selection methods did not distinguish. This makes possible a valuable new perspective on experimental results, and provides a framework for discussion of the concepts behind different algorithm-design approaches. We review a selection of modern algorithms for the purpose of illustrating the distinctions made by the taxonomy. We present the results of a numerical experiment which examined the performance of a number of representative methods on both synthetic and real-world data sets with and without concept drift, and discuss the implications for the directions of future research in light of the taxonomy. On the basis of the experimental results, we are able to give recommendations for the experimental evaluation of algorithms which may be proposed in the future.project RPG-2015-188 funded by The Leverhulme Trust, UK, and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731593
- …