120 research outputs found

    Advanced Methods in Business Process Deviance Mining

    Get PDF
    Äriprotsessi hälve on nähtus, kus alamhulk äriprotsessi täitmistest erinevad soovitud või ettenähtud tulemusest, kas positiivses või negatiivses mõttes. Äriprotsesside hälbega täitmised sisaldavad endas täitmisi, mis ei vasta ettekirjutatud reeglitele või täitmised, mis on jäävad alla või ületavad tulemuslikkuse eesmärke. Hälbekaevandus tegeleb hälbe põhjuste otsimisega, analüüsides selleks äriprotsesside sündmuste logisid.Antud töös lähenetakse protsessihälvete põhjuste otsimise ülesandele, esmalt kasutades järjestikkudel põhinevaid või deklaratiivseid mustreid ning nende kombinatsiooni. Hälbekaevandusest saadud põhjendusi saab parendada, kasutades sündmustes ja sündmusjälgede atribuutides sisalduvaid andmelaste. Andmelastidest konstrueeritakse uued tunnused nii otsekoheselt atribuute ekstraheerides ja agregeerides kui ka andmeteadlike deklaratiivseid piiranguid kasutades. Hälbeid iseloomustavad põhjendused ekstraheeritakse kasutades kaudset ja otsest meetodit reeglite induktsiooniks. Kasutades sünteetilisi ja reaalseid logisid, hinnatakse erinevaid tunnuseid ja tulemuseks saadud otsustusreegleid nii nende võimekuses täpselt eristada hälbega ja hälbeta protsesside täitmiseid kui ka kasutajatele antud lõpptulemustes.Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to its expected or desirable outcomes. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. In this thesis, the problem of explaining deviations in business processes is first investigated by using features based on sequential and declarative patterns, and a combination of them. The explanations are further improved by leveraging the data payload of events and traces in event logs through features based on pure data attribute values and data-aware declare constraints. The explanations characterizing the deviances are then extracted by direct and indirect methods for rule induction. Using synthetic and real-life logs from multiple domains, a range of feature types and different forms of decision rules are evaluated in terms of their ability to accurately discriminate between non-deviant and deviant executions of a process as well as in terms of the final outcome returned to the users

    Predictive Process Monitoring Methods: Which One Suits Me Best?

    Full text link
    Predictive process monitoring has recently gained traction in academia and is maturing also in companies. However, with the growing body of research, it might be daunting for companies to navigate in this domain in order to find, provided certain data, what can be predicted and what methods to use. The main objective of this paper is developing a value-driven framework for classifying existing work on predictive process monitoring. This objective is achieved by systematically identifying, categorizing, and analyzing existing approaches for predictive process monitoring. The review is then used to develop a value-driven framework that can support organizations to navigate in the predictive process monitoring field and help them to find value and exploit the opportunities enabled by these analysis techniques

    Performance Evaluation of Network Anomaly Detection Systems

    Get PDF
    Nowadays, there is a huge and growing concern about security in information and communication technology (ICT) among the scientific community because any attack or anomaly in the network can greatly affect many domains such as national security, private data storage, social welfare, economic issues, and so on. Therefore, the anomaly detection domain is a broad research area, and many different techniques and approaches for this purpose have emerged through the years. Attacks, problems, and internal failures when not detected early may badly harm an entire Network system. Thus, this thesis presents an autonomous profile-based anomaly detection system based on the statistical method Principal Component Analysis (PCADS-AD). This approach creates a network profile called Digital Signature of Network Segment using Flow Analysis (DSNSF) that denotes the predicted normal behavior of a network traffic activity through historical data analysis. That digital signature is used as a threshold for volume anomaly detection to detect disparities in the normal traffic trend. The proposed system uses seven traffic flow attributes: Bits, Packets and Number of Flows to detect problems, and Source and Destination IP addresses and Ports, to provides the network administrator necessary information to solve them. Via evaluation techniques, addition of a different anomaly detection approach, and comparisons to other methods performed in this thesis using real network traffic data, results showed good traffic prediction by the DSNSF and encouraging false alarm generation and detection accuracy on the detection schema. The observed results seek to contribute to the advance of the state of the art in methods and strategies for anomaly detection that aim to surpass some challenges that emerge from the constant growth in complexity, speed and size of today’s large scale networks, also providing high-value results for a better detection in real time.Atualmente, existe uma enorme e crescente preocupação com segurança em tecnologia da informação e comunicação (TIC) entre a comunidade científica. Isto porque qualquer ataque ou anomalia na rede pode afetar a qualidade, interoperabilidade, disponibilidade, e integridade em muitos domínios, como segurança nacional, armazenamento de dados privados, bem-estar social, questões econômicas, e assim por diante. Portanto, a deteção de anomalias é uma ampla área de pesquisa, e muitas técnicas e abordagens diferentes para esse propósito surgiram ao longo dos anos. Ataques, problemas e falhas internas quando não detetados precocemente podem prejudicar gravemente todo um sistema de rede. Assim, esta Tese apresenta um sistema autônomo de deteção de anomalias baseado em perfil utilizando o método estatístico Análise de Componentes Principais (PCADS-AD). Essa abordagem cria um perfil de rede chamado Assinatura Digital do Segmento de Rede usando Análise de Fluxos (DSNSF) que denota o comportamento normal previsto de uma atividade de tráfego de rede por meio da análise de dados históricos. Essa assinatura digital é utilizada como um limiar para deteção de anomalia de volume e identificar disparidades na tendência de tráfego normal. O sistema proposto utiliza sete atributos de fluxo de tráfego: bits, pacotes e número de fluxos para detetar problemas, além de endereços IP e portas de origem e destino para fornecer ao administrador de rede as informações necessárias para resolvê-los. Por meio da utilização de métricas de avaliação, do acrescimento de uma abordagem de deteção distinta da proposta principal e comparações com outros métodos realizados nesta tese usando dados reais de tráfego de rede, os resultados mostraram boas previsões de tráfego pelo DSNSF e resultados encorajadores quanto a geração de alarmes falsos e precisão de deteção. Com os resultados observados nesta tese, este trabalho de doutoramento busca contribuir para o avanço do estado da arte em métodos e estratégias de deteção de anomalias, visando superar alguns desafios que emergem do constante crescimento em complexidade, velocidade e tamanho das redes de grande porte da atualidade, proporcionando também alta performance. Ainda, a baixa complexidade e agilidade do sistema proposto contribuem para que possa ser aplicado a deteção em tempo real

    Statistical Foundations of Actuarial Learning and its Applications

    Get PDF
    This open access book discusses the statistical modeling of insurance problems, a process which comprises data collection, data analysis and statistical model building to forecast insured events that may happen in the future. It presents the mathematical foundations behind these fundamental statistical concepts and how they can be applied in daily actuarial practice. Statistical modeling has a wide range of applications, and, depending on the application, the theoretical aspects may be weighted differently: here the main focus is on prediction rather than explanation. Starting with a presentation of state-of-the-art actuarial models, such as generalized linear models, the book then dives into modern machine learning tools such as neural networks and text recognition to improve predictive modeling with complex features. Providing practitioners with detailed guidance on how to apply machine learning methods to real-world data sets, and how to interpret the results without losing sight of the mathematical assumptions on which these methods are based, the book can serve as a modern basis for an actuarial education syllabus

    Statistical Foundations of Actuarial Learning and its Applications

    Get PDF
    This open access book discusses the statistical modeling of insurance problems, a process which comprises data collection, data analysis and statistical model building to forecast insured events that may happen in the future. It presents the mathematical foundations behind these fundamental statistical concepts and how they can be applied in daily actuarial practice. Statistical modeling has a wide range of applications, and, depending on the application, the theoretical aspects may be weighted differently: here the main focus is on prediction rather than explanation. Starting with a presentation of state-of-the-art actuarial models, such as generalized linear models, the book then dives into modern machine learning tools such as neural networks and text recognition to improve predictive modeling with complex features. Providing practitioners with detailed guidance on how to apply machine learning methods to real-world data sets, and how to interpret the results without losing sight of the mathematical assumptions on which these methods are based, the book can serve as a modern basis for an actuarial education syllabus

    On semiparametric regression and data mining

    Full text link
    Semiparametric regression is playing an increasingly large role in the analysis of datasets exhibiting various complications (Ruppert, Wand & Carroll, 2003). In particular semiparametric regression a plays prominent role in the area of data mining where such complications are numerous (Hastie, Tibshirani & Friedman, 2001). In this thesis we develop fast, interpretable methods addressing many of the difficulties associated with data mining applications including: model selection, missing value analysis, outliers and heteroscedastic noise. We focus on function estimation using penalised splines via mixed model methodology (Wahba 1990; Speed 1991; Ruppert et al. 2003). In dealing with the difficulties associated with data mining applications many of the models we consider deviate from typical normality assumptions. These models lead to likelihoods involving analytically intractable integrals. Thus, in keeping with the aim of speed, we seek analytic approximations to such integrals which are typically faster than numeric alternatives. These analytic approximations not only include popular penalised quasi-likelihood (PQL) approximations (Breslow & Clayton, 1993) but variational approximations. Originating in physics, variational approximations are a relatively new class of approximations (to statistics) which are simple, fast, flexible and effective. They have recently been applied to statistical problems in machine learning where they are rapidly gaining popularity (Jordan, Ghahramani, Jaakkola & Sau11999; Corduneanu & Bishop, 2001; Ueda & Ghahramani, 2002; Bishop & Winn, 2003; Winn & Bishop 2005). We develop variational approximations to: generalized linear mixed models (GLMMs); Bayesian GLMMs; simple missing values models; and for outlier and heteroscedastic noise models, which are, to the best of our knowledge, new. These methods are quite effective and extremely fast, with fitting taking minutes if not seconds on a typical 2008 computer. We also make a contribution to variational methods themselves. Variational approximations often underestimate the variance of posterior densities in Bayesian models (Humphreys & Titterington, 2000; Consonni & Marin, 2004; Wang & Titterington, 2005). We develop grid-based variational posterior approximations. These approximations combine a sequence of variational posterior approximations, can be extremely accurate and are reasonably fast

    Bayesian Multi-Model Frameworks - Properly Addressing Conceptual Uncertainty in Applied Modelling

    Get PDF
    We use models to understand or predict a system. Often, there are multiple plausible but competing model concepts. Hence, modelling is associated with conceptual uncertainty, i.e., the question about proper handling of such model alternatives. For mathematical models, it is possible to quantify their plausibility based on data and rate them accordingly. Bayesian probability calculus offers several formal multi-model frameworks to rate models in a finite set and to quantify their conceptual uncertainty as model weights. These frameworks are Bayesian model selection and averaging (BMS/BMA), Pseudo-BMS/BMA and Bayesian Stacking. The goal of this dissertation is to facilitate proper utilization of these Bayesian multi-model frameworks. They follow different principles in model rating, which is why derived model weights have to be interpreted differently, too. These principles always concern the model setting, i.e., how the models in the set relate to one another and the true model of the system that generated observed data. This relation is formalized in model scores that are used for model weighting within each framework. The scores resemble framework-specific compromises between the ability of a model to fit the data and the therefore required model complexity. Hence, first, the scores are investigated systematically regarding their respective take on model complexity and are allocated in a developed classification scheme. This shows that BMS/BMA always pursues to identify the true model in the set, that Pseudo-BMS/BMA searches the model with largest predictive power despite none of the models being the true one, and that, on that condition, Bayesian Stacking seeks reliability in prediction by combining predictive distributions of multiple models. An application example with numerical models illustrates these behaviours and demonstrates which misinterpretations of model weights impend, if a certain framework is applied despite being unsuitable for the underlying model setting. Regarding applied modelling, first, a new setting is proposed that allows to identify a ``quasi-true'' model in a set. Second, Bayesian Bootstrapping is employed to take into account that rating of predictive capability is based on only limited data. To ensure that the Bayesian multi-model frameworks are employed properly and goal-oriented, a guideline is set up. With respect to a clearly defined modelling goal and the allocation of available models to the respective setting, it leads to the suitable multi-model framework. Aside of the three investigated frameworks, this guideline further contains an additional one that allows to identify a (quasi-)true model if it is composed of a linear combination of the model alternatives in the set. The gained insights enable a broad range of users in science practice to properly employ Bayesian multi-model frameworks in order to quantify and handle conceptual uncertainty. Thus, maximum reliability in system understanding and prediction with multiple models can be achieved. Further, the insights pave the way for systematic model development and improvement.Wir benutzen Modelle, um ein System zu verstehen oder vorherzusagen. Oft gibt es dabei mehrere plausible aber konkurrierende Modellkonzepte. Daher geht Modellierung einher mit konzeptioneller Unsicherheit, also der Frage nach dem angemessenen Umgang mit solchen Modellalternativen. Bei mathematischen Modellen ist es möglich, die Plausibilität jedes Modells anhand von Daten des Systems zu quantifizieren und Modelle entsprechend zu bewerten. Bayes'sche Wahrscheinlichkeitsrechnung bietet dazu verschiedene formale Multi-Modellrahmen, um Modellalternativen in einem endlichen Set zu bewerten und ihre konzeptionelle Unsicherheit als Modellgewichte zu beziffern. Diese Rahmen sind Bayes'sche Modellwahl und -mittelung (BMS/BMA), Pseudo-BMS/BMA und Bayes'sche Modellstapelung. Das Ziel dieser Dissertation ist es, den adäquaten Umgang mit diesen Bayes'schen Multi-Modellrahmen zu ermöglichen. Sie folgen unterschiedlichen Prinzipien in der Modellbewertung weshalb die abgeleiteten Modellgewichte auch unterschiedlich zu interpretieren sind. Diese Prinzipien beziehen sich immer auf das Modellsetting, also darauf, wie sich die Modelle im Set zueinander und auf das wahre Modell des Systems beziehen, welches bereits gemessene Daten erzeugt hat. Dieser Bezug ist in Kenngrößen formalisiert, die innerhalb jedes Rahmens der Modellgewichtung dienen. Die Kenngrößen stellen rahmenspezifische Kompromisse dar, zwischen der Fähigkeit eines Modells die Daten zu treffen und der dazu benötigten Modellkomplexität. Daher werden die Kenngrößen zunächst systematisch auf ihre jeweilige Bewertung von Modellkomplexität untersucht und in einem entsprechend entwickelten Klassifikationschema zugeordnet. Dabei zeigt sich, dass BMS/BMA stets verfolgt das wahre Modell im Set zu identifizieren, dass Pseudo-BMS/BMA das Modell mit der höchsten Vorsagekraft sucht, obwohl kein wahres Modell verfügbar ist, und dass Bayes'sche Modellstapelung unter dieser Bedingung Verlässlichkeit von Vorhersagen anstrebt, indem die Vorhersageverteilungen mehrerer Modelle kombiniert werden. Ein Anwendungsbeispiel mit numerischen Modellen verdeutlicht diese Verhaltenweisen und zeigt auf, welche Fehlinterpretationen der Modellgewichte drohen, wenn ein bestimmter Rahmen angewandt wird, obwohl er nicht zum zugrundeliegenden Modellsetting passt. Mit Bezug auf anwendungsorientierte Modellierung wird dabei erstens ein neues Setting vorgestellt, das es ermöglicht, ein ``quasi-wahres'' Modell in einem Set zu identifizieren. Zweitens wird Bayes'sches Bootstrapping eingesetzt um bei der Bewertung der Vorhersagegüte zu berücksichtigen, dass diese auf Basis weniger Daten erfolgt. Um zu gewährleisten, dass die Bayes'schen Multi-Modellrahmen angemessen und zielführend eingesetzt werden, wird schließlich ein Leitfaden erstellt. Anhand eines klar definierten Modellierungszieles und der Einordnung der gegebenen Modelle in das entspechende Setting leitet dieser zum geeigneten Multi-Modellrahmen. Neben den drei untersuchten Rahmen enthält dieser Leitfaden zudem einen weiteren, der es ermöglicht ein (quasi-)wahres Modell zu identifizieren, wenn dieses aus einer Linearkombination der Modellalternativen im Set besteht. Die gewonnenen Erkenntnisse ermöglichen es einer breiten Anwenderschaft in Wissenschaft und Praxis, Bayes'sche Multi-Modellrahmen zur Quantifizierung und Handhabung konzeptioneller Unsicherheit adäquat einzusetzen. Dadurch lässt sich maximale Verlässlichkeit in Systemverständis und -vorhersage durch mehrere Modelle erreichen. Die Erkenntnisse ebnen darüber hinaus den Weg für systematische Modellentwicklung und -verbesserung

    Big data clustering: Data preprocessing, variable selection, and dimension reduction

    Get PDF
    [no abstract available
    corecore