155 research outputs found

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p

    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

    Full text link
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets

    Memory Models for Incremental Learning Architectures

    Get PDF
    Losing V. Memory Models for Incremental Learning Architectures. Bielefeld: Universität Bielefeld; 2019.Technological advancement leads constantly to an exponential growth of generated data in basically every domain, drastically increasing the burden of data storage and maintenance. Most of the data is instantaneously extracted and available in form of endless streams that contain the most current information. Machine learning methods constitute one fundamental way of processing such data in an automatic way, as they generate models that capture the processes behind the data. They are omnipresent in our everyday life as their applications include personalized advertising, recommendations, fraud detection, surveillance, credit ratings, high-speed trading and smart-home devices. Thereby, batch learning, denoting the offline construction of a static model based on large datasets, is the predominant scheme. However, it is increasingly unfit to deal with the accumulating masses of data in given time and in particularly its static nature cannot handle changing patterns. In contrast, incremental learning constitutes one attractive alternative that is a very natural fit for the current demands. Its dynamic adaptation allows continuous processing of data streams, without the necessity to store all data from the past, and results in always up-to-date models, even able to perform in non-stationary environments. In this thesis, we will tackle crucial research questions in the domain of incremental learning by contributing new algorithms or significantly extending existing ones. Thereby, we consider stationary and non-stationary environments and present multiple real-world applications that showcase merits of the methods as well as their versatility. The main contributions are the following: One novel approach that addresses the question of how to extend a model for prototype-based algorithms based on cost minimization. We propose local split-time prediction for incremental decision trees to mitigate the trade-off between adaptation speed versus model complexity and run time. An extensive survey of the strengths and weaknesses of state-of-the-art methods that provides guidance for choosing a suitable algorithm for a given task. One new approach to extract valuable information about the type of change in a dataset. We contribute a biologically inspired architecture, able to handle different types of drift using dedicated memories that are kept consistent. Application of the novel methods within three diverse real-world tasks, highlighting their robustness and versatility. Investigation of personalized online models in the context of two real-world applications

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Deep Learning in Medical Image Analysis

    Get PDF
    The accelerating power of deep learning in diagnosing diseases will empower physicians and speed up decision making in clinical environments. Applications of modern medical instruments and digitalization of medical care have generated enormous amounts of medical images in recent years. In this big data arena, new deep learning methods and computational models for efficient data processing, analysis, and modeling of the generated data are crucially important for clinical applications and understanding the underlying biological process. This book presents and highlights novel algorithms, architectures, techniques, and applications of deep learning for medical image analysis

    The detection of fraudulent financial statements using textual and financial data

    Get PDF
    Das Vertrauen in die Korrektheit veröffentlichter Jahresabschlüsse bildet ein Fundament für funktionierende Kapitalmärkte. Prominente Bilanzskandale erschüttern immer wieder das Vertrauen der Marktteilnehmer in die Glaubwürdigkeit der veröffentlichten Informationen und führen dadurch zu einer ineffizienten Ressourcenallokation. Zuverlässige, automatisierte Betrugserkennungssysteme, die auf öffentlich zugänglichen Daten basieren, können dazu beitragen, die Prüfungsressourcen effizienter zuzuweisen und stärken die Resilienz der Kapitalmärkte indem Marktteilnehmer stärker vor Bilanzbetrug geschützt werden. In dieser Studie steht die Entwicklung eines Betrugserkennungsmodells im Vordergrund, welches aus textuelle und numerische Bestandteile von Jahresabschlüssen typische Muster für betrügerische Manipulationen extrahiert und diese in einem umfangreichen Aufdeckungsmodell vereint. Die Untersuchung stützt sich dabei auf einen umfassenden methodischen Ansatz, welcher wichtige Probleme und Fragestellungen im Prozess der Erstellung, Erweiterung und Testung der Modelle aufgreift. Die Analyse der textuellen Bestandteile der Jahresabschlüsse wird dabei auf Basis von Mehrwortphrasen durchgeführt, einschließlich einer umfassenden Sprachstandardisierung, um erzählerische Besonderheiten und Kontext besser verarbeiten zu können. Weiterhin wird die Musterextraktion um erfolgreiche Finanzprädiktoren aus den Rechenwerken wie Bilanz oder Gewinn- und Verlustrechnung angereichert und somit der Jahresabschluss in seiner Breite erfasst und möglichst viele Hinweise identifiziert. Die Ergebnisse deuten auf eine zuverlässige und robuste Erkennungsleistung über einen Zeitraum von 15 Jahren hin. Darüber hinaus implizieren die Ergebnisse, dass textbasierte Prädiktoren den Finanzkennzahlen überlegen sind und eine Kombination aus beiden erforderlich ist, um die bestmöglichen Ergebnisse zu erzielen. Außerdem zeigen textbasierte Prädiktoren im Laufe der Zeit eine starke Variation, was die Wichtigkeit einer regelmäßigen Aktualisierung der Modelle unterstreicht. Die insgesamt erzielte Erkennungsleistung konnte sich im Durchschnitt gegen vergleichbare Ansätze durchsetzen.Fraudulent financial statements inhibit markets allocating resources efficiently and induce considerable economic cost. Therefore, market participants strive to identify fraudulent financial statements. Reliable automated fraud detection systems based on publically available data may help to allocate audit resources more effectively. This study examines how quantitative data (financials) and corporate narratives, both can be used to identify accounting fraud (proxied by SEC’s AAERs). Thereby, the detection models are based upon a sound foundation from fraud theory, highlighting how accounting fraud is carried out and discussing the causes for companies to engage in fraudulent alteration of financial records. The study relies on a comprehensive methodological approach to create the detection model. Therefore, the design process is divided into eight design and three enhancing questions, shedding light onto important issues during model creation, improving and testing. The corporate narratives are analysed using multi-word phrases, including an extensive language standardisation that allows to capture narrative peculiarities more precisely and partly address context. The narrative clues are enriched by successful predictors from company financials found in previous studies. The results indicate a reliable and robust detection performance over a timeframe of 15 years. Furthermore, they suggest that text-based predictors are superior to financial ratios and a combination of both is required to achieve the best results possible. Moreover, it is found that text-based predictors vary considerably over time, which shows the importance of updating fraud detection systems frequently. The achieved detection performance was slightly higher on average than for comparable approaches
    • …
    corecore