89 research outputs found

    Evaluation methods and decision theory for classification of streaming data with temporal dependence

    Get PDF
    Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data

    Learning from Data Streams: An Overview and Update

    Full text link
    The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings

    Clustering based active learning for evolving data streams

    Get PDF
    Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported

    On calibrating the completometer for the mammalian fossil record

    Get PDF
    We know that the fossil record is incomplete. But how incomplete? Here we very coarsely estimate the completeness of the mammalian record in the Miocene, assuming that the duration of a mammalian species is about 1 Myr and the species diversity has stayed constant and is structurally comparable to the taxonomic diversity today. The overall completeness under these assumptions appears to be around 4%, but there are large differences across taxonomic groups. We find that the fossil record of proboscideans and perissodactyls as we know it for the Miocene must be close to complete, while we might know less than 15% of the species of artiodactyl or carnivore fossil species and only about 1% of primate species of the Miocene. The record of small mammals appears much less complete than that of large mammals.Peer reviewe

    The NOW Database of Fossil Mammals

    Get PDF
    This chapter was completed and accepted after revision in August 2021. NOW does not have dedicated institutional funding. The database and data development are funded from regular research projects of the NOW Community members. Current and recent (last 5 years) funding sources include: The Ella and Georg Ehrnrooth Foundation and The Academy of Finland. ICP researchers acknowledge funding from the “Generalitat de Catalunya (CERCA Programme)”, R+D+I projects “PID2020-117289GB-I00” and “PID2020-116908GB-I00” (MCIN/AEI/10.13039/501100011033/) and consolidated research group from the Generalitat de Catalunya “2022 SGR 00620”. This is Bernor’s NSF FuTRES publication 35. L. K. Säilä acknowledges Academy of Finland Postdoctoral grant (275551). We thank three reviewers for helpful suggestions regarding the manuscript text. Contributions from the Valio Armas Korvenkontio Unit of Dental Anatomy in Relation to Evolutionary Theory are acknowledged.NOW (New and Old Worlds) is a global database of fossil mammal occurrences, currently containing around 68,000 locality-species entries. The database spans the last 66 million years, with its primary focus on the last 23 million years. Whereas the database contains records from all continents, the main focus and coverage of the database historically has been on Eurasia. The database includes primarily, but not exclusively, terrestrial mammals. It covers a large part of the currently known mammalian fossil record, focusing on classical and actively researched fossil localities. The database is managed in collaboration with an international advisory board of experts. Rather than a static archive, it emphasizes the continuous integration of new knowledge of the community, data curation, and consistency of scientific interpretations. The database records species occurrences at localities worldwide, as well as ecological characteristics of fossil species, geological contexts of localities and more. The NOW database is primarily used for two purposes: (1) queries about occurrences of particular taxa, their characteristics and properties of localities in the spirit of an encyclopedia; and (2) large scale research and quantitative analyses of evolutionary processes, patterns, reconstructing past environments, as well as interpreting evolutionary contexts. The data are fully open, no logging in or community membership is necessary for using the data for any purpose.Georg Ehrnrooth FoundationNational Science Foundation NSFAcademy of Finland 275551 AKAGeneralitat de Catalunya 2022 SGR 00620MCIN/AEI/10.13039/501100011033: PID2020-117289GB-I00, PID2020-116908GB-I00Academy of Finland Postdoctoral (275551

    Multi-output regression with structurally incomplete target labels : A case study of modelling global vegetation cover

    Get PDF
    Publisher Copyright: © 2022 The AuthorsWeakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.Peer reviewe

    Efficient estimation of AUC in a sliding window

    Full text link
    In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length kk. More specifically, we propose an algorithm that, given ϵ\epsilon, estimates AUC within ϵ/2\epsilon / 2, and can maintain this estimate in O((logk)/ϵ)O((\log k) / \epsilon) time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires O(k)O(k) time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (ii) the groups are small enough, so that the error stays small, (iiii) the number of groups is small, so that enumerating them is not expensive, and (iiiiii) the definition is flexible enough so that we can maintain the groups efficiently. Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee ϵ/2\epsilon / 2, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy

    From Sensor Readings to Predictions: On the Process of Developing Practical Soft Sensors.

    Get PDF
    Automatic data acquisition systems provide large amounts of streaming data generated by physical sensors. This data forms an input to computational models (soft sensors) routinely used for monitoring and control of industrial processes, traffic patterns, environment and natural hazards, and many more. The majority of these models assume that the data comes in a cleaned and pre-processed form, ready to be fed directly into a predictive model. In practice, to ensure appropriate data quality, most of the modelling efforts concentrate on preparing data from raw sensor readings to be used as model inputs. This study analyzes the process of data preparation for predictive models with streaming sensor data. We present the challenges of data preparation as a four-step process, identify the key challenges in each step, and provide recommendations for handling these issues. The discussion is focused on the approaches that are less commonly used, while, based on our experience, may contribute particularly well to solving practical soft sensor tasks. Our arguments are illustrated with a case study in the chemical production industry

    A taxonomic look at instance-based stream classifiers

    Get PDF
    Large numbers of data streams are today generated in many fields. A key challenge when learning from such streams is the problem of concept drift. Many methods, including many prototype methods, have been proposed in recent years to address this problem. This paper presents a refined taxonomy of instance selection and generation methods for the classification of data streams subject to concept drift. The taxonomy allows discrimination among a large number of methods which pre-existing taxonomies for offline instance selection methods did not distinguish. This makes possible a valuable new perspective on experimental results, and provides a framework for discussion of the concepts behind different algorithm-design approaches. We review a selection of modern algorithms for the purpose of illustrating the distinctions made by the taxonomy. We present the results of a numerical experiment which examined the performance of a number of representative methods on both synthetic and real-world data sets with and without concept drift, and discuss the implications for the directions of future research in light of the taxonomy. On the basis of the experimental results, we are able to give recommendations for the experimental evaluation of algorithms which may be proposed in the future.project RPG-2015-188 funded by The Leverhulme Trust, UK, and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731593
    corecore