120 research outputs found
Advanced Methods in Business Process Deviance Mining
Äriprotsessi hälve on nähtus, kus alamhulk äriprotsessi täitmistest erinevad soovitud või ettenähtud tulemusest, kas positiivses või negatiivses mõttes. Äriprotsesside hälbega täitmised sisaldavad endas täitmisi, mis ei vasta ettekirjutatud reeglitele või täitmised, mis on jäävad alla või ületavad tulemuslikkuse eesmärke. Hälbekaevandus tegeleb hälbe põhjuste otsimisega, analüüsides selleks äriprotsesside sündmuste logisid.Antud töös lähenetakse protsessihälvete põhjuste otsimise ülesandele, esmalt kasutades järjestikkudel põhinevaid või deklaratiivseid mustreid ning nende kombinatsiooni. Hälbekaevandusest saadud põhjendusi saab parendada, kasutades sündmustes ja sündmusjälgede atribuutides sisalduvaid andmelaste. Andmelastidest konstrueeritakse uued tunnused nii otsekoheselt atribuute ekstraheerides ja agregeerides kui ka andmeteadlike deklaratiivseid piiranguid kasutades. Hälbeid iseloomustavad põhjendused ekstraheeritakse kasutades kaudset ja otsest meetodit reeglite induktsiooniks. Kasutades sünteetilisi ja reaalseid logisid, hinnatakse erinevaid tunnuseid ja tulemuseks saadud otsustusreegleid nii nende võimekuses täpselt eristada hälbega ja hälbeta protsesside täitmiseid kui ka kasutajatele antud lõpptulemustes.Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to its expected or desirable outcomes. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. In this thesis, the problem of explaining deviations in business processes is first investigated by using features based on sequential and declarative patterns, and a combination of them. The explanations are further improved by leveraging the data payload of events and traces in event logs through features based on pure data attribute values and data-aware declare constraints. The explanations characterizing the deviances are then extracted by direct and indirect methods for rule induction. Using synthetic and real-life logs from multiple domains, a range of feature types and different forms of decision rules are evaluated in terms of their ability to accurately discriminate between non-deviant and deviant executions of a process as well as in terms of the final outcome returned to the users
Predictive Process Monitoring Methods: Which One Suits Me Best?
Predictive process monitoring has recently gained traction in academia and is
maturing also in companies. However, with the growing body of research, it
might be daunting for companies to navigate in this domain in order to find,
provided certain data, what can be predicted and what methods to use. The main
objective of this paper is developing a value-driven framework for classifying
existing work on predictive process monitoring. This objective is achieved by
systematically identifying, categorizing, and analyzing existing approaches for
predictive process monitoring. The review is then used to develop a
value-driven framework that can support organizations to navigate in the
predictive process monitoring field and help them to find value and exploit the
opportunities enabled by these analysis techniques
Performance Evaluation of Network Anomaly Detection Systems
Nowadays, there is a huge and growing concern about security in information and communication
technology (ICT) among the scientific community because any attack or anomaly in
the network can greatly affect many domains such as national security, private data storage,
social welfare, economic issues, and so on. Therefore, the anomaly detection domain is a broad
research area, and many different techniques and approaches for this purpose have emerged
through the years.
Attacks, problems, and internal failures when not detected early may badly harm an
entire Network system. Thus, this thesis presents an autonomous profile-based anomaly detection
system based on the statistical method Principal Component Analysis (PCADS-AD). This
approach creates a network profile called Digital Signature of Network Segment using Flow Analysis
(DSNSF) that denotes the predicted normal behavior of a network traffic activity through
historical data analysis. That digital signature is used as a threshold for volume anomaly detection
to detect disparities in the normal traffic trend. The proposed system uses seven traffic flow
attributes: Bits, Packets and Number of Flows to detect problems, and Source and Destination IP
addresses and Ports, to provides the network administrator necessary information to solve them.
Via evaluation techniques, addition of a different anomaly detection approach, and
comparisons to other methods performed in this thesis using real network traffic data, results
showed good traffic prediction by the DSNSF and encouraging false alarm generation and detection
accuracy on the detection schema.
The observed results seek to contribute to the advance of the state of the art in methods
and strategies for anomaly detection that aim to surpass some challenges that emerge from
the constant growth in complexity, speed and size of today’s large scale networks, also providing
high-value results for a better detection in real time.Atualmente, existe uma enorme e crescente preocupação com segurança em tecnologia
da informação e comunicação (TIC) entre a comunidade científica. Isto porque qualquer
ataque ou anomalia na rede pode afetar a qualidade, interoperabilidade, disponibilidade, e integridade
em muitos domínios, como segurança nacional, armazenamento de dados privados,
bem-estar social, questões econômicas, e assim por diante. Portanto, a deteção de anomalias
é uma ampla área de pesquisa, e muitas técnicas e abordagens diferentes para esse propósito
surgiram ao longo dos anos.
Ataques, problemas e falhas internas quando não detetados precocemente podem prejudicar
gravemente todo um sistema de rede. Assim, esta Tese apresenta um sistema autônomo
de deteção de anomalias baseado em perfil utilizando o método estatístico Análise de Componentes
Principais (PCADS-AD). Essa abordagem cria um perfil de rede chamado Assinatura Digital
do Segmento de Rede usando Análise de Fluxos (DSNSF) que denota o comportamento normal
previsto de uma atividade de tráfego de rede por meio da análise de dados históricos. Essa
assinatura digital é utilizada como um limiar para deteção de anomalia de volume e identificar
disparidades na tendência de tráfego normal. O sistema proposto utiliza sete atributos de fluxo
de tráfego: bits, pacotes e número de fluxos para detetar problemas, além de endereços IP e
portas de origem e destino para fornecer ao administrador de rede as informações necessárias
para resolvê-los.
Por meio da utilização de métricas de avaliação, do acrescimento de uma abordagem
de deteção distinta da proposta principal e comparações com outros métodos realizados nesta
tese usando dados reais de tráfego de rede, os resultados mostraram boas previsões de tráfego
pelo DSNSF e resultados encorajadores quanto a geração de alarmes falsos e precisão de deteção.
Com os resultados observados nesta tese, este trabalho de doutoramento busca contribuir
para o avanço do estado da arte em métodos e estratégias de deteção de anomalias,
visando superar alguns desafios que emergem do constante crescimento em complexidade, velocidade
e tamanho das redes de grande porte da atualidade, proporcionando também alta
performance. Ainda, a baixa complexidade e agilidade do sistema proposto contribuem para
que possa ser aplicado a deteção em tempo real
Statistical Foundations of Actuarial Learning and its Applications
This open access book discusses the statistical modeling of insurance problems, a process which comprises data collection, data analysis and statistical model building to forecast insured events that may happen in the future. It presents the mathematical foundations behind these fundamental statistical concepts and how they can be applied in daily actuarial practice. Statistical modeling has a wide range of applications, and, depending on the application, the theoretical aspects may be weighted differently: here the main focus is on prediction rather than explanation. Starting with a presentation of state-of-the-art actuarial models, such as generalized linear models, the book then dives into modern machine learning tools such as neural networks and text recognition to improve predictive modeling with complex features. Providing practitioners with detailed guidance on how to apply machine learning methods to real-world data sets, and how to interpret the results without losing sight of the mathematical assumptions on which these methods are based, the book can serve as a modern basis for an actuarial education syllabus
Statistical Foundations of Actuarial Learning and its Applications
This open access book discusses the statistical modeling of insurance problems, a process which comprises data collection, data analysis and statistical model building to forecast insured events that may happen in the future. It presents the mathematical foundations behind these fundamental statistical concepts and how they can be applied in daily actuarial practice. Statistical modeling has a wide range of applications, and, depending on the application, the theoretical aspects may be weighted differently: here the main focus is on prediction rather than explanation. Starting with a presentation of state-of-the-art actuarial models, such as generalized linear models, the book then dives into modern machine learning tools such as neural networks and text recognition to improve predictive modeling with complex features. Providing practitioners with detailed guidance on how to apply machine learning methods to real-world data sets, and how to interpret the results without losing sight of the mathematical assumptions on which these methods are based, the book can serve as a modern basis for an actuarial education syllabus
On semiparametric regression and data mining
Semiparametric regression is playing an increasingly large role in the analysis of datasets
exhibiting various complications (Ruppert, Wand & Carroll, 2003). In particular semiparametric
regression a plays prominent role in the area of data mining where such
complications are numerous (Hastie, Tibshirani & Friedman, 2001). In this thesis we
develop fast, interpretable methods addressing many of the difficulties associated with
data mining applications including: model selection, missing value analysis, outliers and
heteroscedastic noise.
We focus on function estimation using penalised splines via mixed model methodology
(Wahba 1990; Speed 1991; Ruppert et al. 2003). In dealing with the difficulties
associated with data mining applications many of the models we consider deviate from
typical normality assumptions. These models lead to likelihoods involving analytically
intractable integrals. Thus, in keeping with the aim of speed, we seek analytic approximations
to such integrals which are typically faster than numeric alternatives.
These analytic approximations not only include popular penalised quasi-likelihood
(PQL) approximations (Breslow & Clayton, 1993) but variational approximations. Originating
in physics, variational approximations are a relatively new class of approximations
(to statistics) which are simple, fast, flexible and effective. They have recently been
applied to statistical problems in machine learning where they are rapidly gaining popularity
(Jordan, Ghahramani, Jaakkola & Sau11999; Corduneanu & Bishop, 2001; Ueda &
Ghahramani, 2002; Bishop & Winn, 2003; Winn & Bishop 2005).
We develop variational approximations to: generalized linear mixed models
(GLMMs); Bayesian GLMMs; simple missing values models; and for outlier and heteroscedastic
noise models, which are, to the best of our knowledge, new. These methods
are quite effective and extremely fast, with fitting taking minutes if not seconds on a
typical 2008 computer.
We also make a contribution to variational methods themselves. Variational approximations
often underestimate the variance of posterior densities in Bayesian models
(Humphreys & Titterington, 2000; Consonni & Marin, 2004; Wang & Titterington, 2005).
We develop grid-based variational posterior approximations. These approximations combine
a sequence of variational posterior approximations, can be extremely accurate and are
reasonably fast
Bayesian Multi-Model Frameworks - Properly Addressing Conceptual Uncertainty in Applied Modelling
We use models to understand or predict a system. Often, there are multiple plausible but competing model concepts. Hence, modelling is associated with conceptual uncertainty, i.e., the question about proper handling of such model alternatives. For mathematical models, it is possible to quantify their plausibility based on data and rate them accordingly. Bayesian probability calculus offers several formal multi-model frameworks to rate models in a finite set and to quantify their conceptual uncertainty as model weights. These frameworks are Bayesian model selection and averaging (BMS/BMA), Pseudo-BMS/BMA and Bayesian Stacking.
The goal of this dissertation is to facilitate proper utilization of these Bayesian multi-model frameworks. They follow different principles in model rating, which is why derived model weights have to be interpreted differently, too. These principles always concern the model setting, i.e., how the models in the set relate to one another and the true model of the system that generated observed data. This relation is formalized in model scores that are used for model weighting within each framework. The scores resemble framework-specific compromises between the ability of a model to fit the data and the therefore required model complexity.
Hence, first, the scores are investigated systematically regarding their respective take on model complexity and are allocated in a developed classification scheme. This shows that BMS/BMA always pursues to identify the true model in the set, that Pseudo-BMS/BMA searches the model with largest predictive power despite none of the models being the true one, and that, on that condition, Bayesian Stacking seeks reliability in prediction by combining predictive distributions of multiple models.
An application example with numerical models illustrates these behaviours and demonstrates which misinterpretations of model weights impend, if a certain framework is applied despite being unsuitable for the underlying model setting. Regarding applied modelling, first, a new setting is proposed that allows to identify a ``quasi-true'' model in a set. Second, Bayesian Bootstrapping is employed to take into account that rating of predictive capability is based on only limited data.
To ensure that the Bayesian multi-model frameworks are employed properly and goal-oriented, a guideline is set up. With respect to a clearly defined modelling goal and the allocation of available models to the respective setting, it leads to the suitable multi-model framework. Aside of the three investigated frameworks, this guideline further contains an additional one that allows to identify a (quasi-)true model if it is composed of a linear combination of the model alternatives in the set.
The gained insights enable a broad range of users in science practice to properly employ Bayesian multi-model frameworks in order to quantify and handle conceptual uncertainty. Thus, maximum reliability in system understanding and prediction with multiple models can be achieved. Further, the insights pave the way for systematic model development and improvement.Wir benutzen Modelle, um ein System zu verstehen oder vorherzusagen. Oft gibt es dabei mehrere plausible aber konkurrierende Modellkonzepte. Daher geht Modellierung einher mit konzeptioneller Unsicherheit, also der Frage nach dem angemessenen Umgang mit solchen Modellalternativen. Bei mathematischen Modellen ist es möglich, die Plausibilität jedes Modells anhand von Daten des Systems zu quantifizieren und Modelle entsprechend zu bewerten. Bayes'sche Wahrscheinlichkeitsrechnung bietet dazu verschiedene formale Multi-Modellrahmen, um Modellalternativen in einem endlichen Set zu bewerten und ihre konzeptionelle Unsicherheit als Modellgewichte zu beziffern. Diese Rahmen sind Bayes'sche Modellwahl und -mittelung (BMS/BMA), Pseudo-BMS/BMA und Bayes'sche Modellstapelung.
Das Ziel dieser Dissertation ist es, den adäquaten Umgang mit diesen Bayes'schen Multi-Modellrahmen zu ermöglichen. Sie folgen unterschiedlichen Prinzipien in der Modellbewertung weshalb die abgeleiteten Modellgewichte auch unterschiedlich zu interpretieren sind. Diese Prinzipien beziehen sich immer auf das Modellsetting, also darauf, wie sich die Modelle im Set zueinander und auf das wahre Modell des Systems beziehen, welches bereits gemessene Daten erzeugt hat. Dieser Bezug ist in Kenngrößen formalisiert, die innerhalb jedes Rahmens der Modellgewichtung dienen. Die Kenngrößen stellen rahmenspezifische Kompromisse dar, zwischen der Fähigkeit eines Modells die Daten zu treffen und der dazu benötigten Modellkomplexität.
Daher werden die Kenngrößen zunächst systematisch auf ihre jeweilige Bewertung von Modellkomplexität untersucht und in einem entsprechend entwickelten Klassifikationschema zugeordnet. Dabei zeigt sich, dass BMS/BMA stets verfolgt das wahre Modell im Set zu identifizieren, dass Pseudo-BMS/BMA das Modell mit der höchsten Vorsagekraft sucht, obwohl kein wahres Modell verfügbar ist, und dass Bayes'sche Modellstapelung unter dieser Bedingung Verlässlichkeit von Vorhersagen anstrebt, indem die Vorhersageverteilungen mehrerer Modelle kombiniert werden.
Ein Anwendungsbeispiel mit numerischen Modellen verdeutlicht diese Verhaltenweisen und zeigt auf, welche Fehlinterpretationen der Modellgewichte drohen, wenn ein bestimmter Rahmen angewandt wird, obwohl er nicht zum zugrundeliegenden Modellsetting passt. Mit Bezug auf anwendungsorientierte Modellierung wird dabei erstens ein neues Setting vorgestellt, das es ermöglicht, ein ``quasi-wahres'' Modell in einem Set zu identifizieren. Zweitens wird Bayes'sches Bootstrapping eingesetzt um bei der Bewertung der Vorhersagegüte zu berücksichtigen, dass diese auf Basis weniger Daten erfolgt.
Um zu gewährleisten, dass die Bayes'schen Multi-Modellrahmen angemessen und zielführend eingesetzt werden, wird schließlich ein Leitfaden erstellt. Anhand eines klar definierten Modellierungszieles und der Einordnung der gegebenen Modelle in das entspechende Setting leitet dieser zum geeigneten Multi-Modellrahmen. Neben den drei untersuchten Rahmen enthält dieser Leitfaden zudem einen weiteren, der es ermöglicht ein (quasi-)wahres Modell zu identifizieren, wenn dieses aus einer Linearkombination der Modellalternativen im Set besteht.
Die gewonnenen Erkenntnisse ermöglichen es einer breiten Anwenderschaft in Wissenschaft und Praxis, Bayes'sche Multi-Modellrahmen zur Quantifizierung und Handhabung konzeptioneller Unsicherheit adäquat einzusetzen. Dadurch lässt sich maximale Verlässlichkeit in Systemverständis und -vorhersage durch mehrere Modelle erreichen. Die Erkenntnisse ebnen darüber hinaus den Weg für systematische Modellentwicklung und -verbesserung
Big data clustering: Data preprocessing, variable selection, and dimension reduction
[no abstract available
- …