    HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks

    The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical sequences. However, we often have access to time series data that represent paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies, we must account for the fact that such data contain a large number of independent observations of paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem, we introduce HYPA, a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph. HYPA provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM Data Mining (SDM 2020

    Artificial Sequences and Complexity Measures

    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Hidden Markov Models

    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Sequence Analysis and Related Approaches

    This open access book provides innovative methods and original applications of sequence analysis (SA) and related methods for analysing longitudinal data describing life trajectories such as professional careers, family paths, the succession of health statuses, or the time use. The applications as well as the methodological contributions proposed in this book pay special attention to the combined use of SA and other methods for longitudinal data such as event history analysis, Markov modelling, and sequence network. The methodological contributions in this book include among others original propositions for measuring the precarity of work trajectories, Markov-based methods for clustering sequences, fuzzy and monothetic clustering of sequences, network-based SA, joint use of SA and hidden Markov models, and of SA and survival models. The applications cover the comparison of gendered occupational trajectories in Germany, the study of the changes in women market participation in Denmark, the study of typical day of dual-earner couples in Italy, of mobility patterns in Togo, of internet addiction in Switzerland, and of the quality of employment career after a first unemployment spell. As such this book provides a wealth of information for social scientists interested in quantitative life course analysis, and all those working in sociology, demography, economics, health, psychology, social policy, and statistics. ; Provides new perspectives and methods for sequence analysis Focusses on the link between sequence analysis and other methods for longitudinal data, especially event history analysis and Markov models Stresses the complementarity of sequence analysis and other models for longitudinal data Applications of sequence analysis in a whole range of different domain

    Statistical and network-based methods for the analysis of chromatin accessibility maps in single cells

    In questo lavoro, metodi provenienti dalla Fisica, dalla Statistica e dalla Teoria dei Grafi sono stati impiegati per caratterizzare ed analizzare profili di apertura e accessibilità della cromatina ottenuti con la tecnica ATAC-seq in singole cellule, nella fattispecie linfociti B provenienti da tre pazienti affetti da Leucemia Linfocitica Cronica. Una pipeline bioinformatica è stata sviluppata per processare i dati di sequencing ed ottenere le posizioni accessibili del genoma per ciascuna cellula. La quantità di regioni aperte e la loro distribuzione spaziale lungo il DNA sono state caratterizzate. Infine, l’apertura simultanea nelle stesse singole cellule di regioni regolatrici è stata impiegata come metrica per valutare relazioni funzionali, e in questo modo grafi tra enhancer e promoter sono stati costruiti e le loro proprietà sono state analizzate. La distribuzione spaziale lungo il genoma di regioni aperte consecutive ricapitola proprietà strutturali come gli array di nucleosomi e le strutture a loop della cromatina. Inoltre, i profili di accessibilità delle regioni regolatrici sono significativamente conservati nelle singole cellule. I network tra enhancer e promoter forniscono un modo per caratterizzare la rilevanza di ciascuna regione regolatrice in termini di centralità. Le statistiche sulla connettività tra enhancer e promoter confermano il modello di relazione uno-a-uno come il più frequente, in cui un promoter è regolato dall'enhancer ad esso più vicino. Infine, anche il funzionamento dei superenhancer è stato indagato. In conclusione, ATAC-seq si rivela un'efficace tecnica per indagare l'apertura della cromatina in singole cellule, i cui profili di accessibilità ricapitolano caratteristiche strutturali e funzionali della cromatina. Al fine di indagare i meccanismi della malattia, il panorama di accessibilità dei lifociti tumorali può essere confrontato con quello di cellule sane e cellule trattate con farmaci epigenetici

    On the Expected Number of Limited Length Binary Strings Derived by Certain Urn Models

    Reconstructing Dynamical Systems From Stochastic Differential Equations to Machine Learning

    Die Modellierung komplexer Systeme mit einer großen Anzahl von Freiheitsgraden ist in den letzten Jahrzehnten zu einer großen Herausforderung geworden. In der Regel werden nur einige wenige Variablen komplexer Systeme in Form von gemessenen Zeitreihen beobachtet, während die meisten von ihnen - die möglicherweise mit den beobachteten Variablen interagieren - verborgen bleiben. In dieser Arbeit befassen wir uns mit dem Problem der Rekonstruktion und Vorhersage der zugrunde liegenden Dynamik komplexer Systeme mit Hilfe verschiedener datengestützter Ansätze. Im ersten Teil befassen wir uns mit dem umgekehrten Problem der Ableitung einer unbekannten Netzwerkstruktur komplexer Systeme, die Ausbreitungsphänomene widerspiegelt, aus beobachteten Ereignisreihen. Wir untersuchen die paarweise statistische Ähnlichkeit zwischen den Sequenzen von Ereigniszeitpunkten an allen Knotenpunkten durch Ereignissynchronisation (ES) und Ereignis-Koinzidenz-Analyse (ECA), wobei wir uns auf die Idee stützen, dass funktionale Konnektivität als Stellvertreter für strukturelle Konnektivität dienen kann. Im zweiten Teil konzentrieren wir uns auf die Rekonstruktion der zugrunde liegenden Dynamik komplexer Systeme anhand ihrer dominanten makroskopischen Variablen unter Verwendung verschiedener stochastischer Differentialgleichungen (SDEs). In dieser Arbeit untersuchen wir die Leistung von drei verschiedenen SDEs - der Langevin-Gleichung (LE), der verallgemeinerten Langevin-Gleichung (GLE) und dem Ansatz der empirischen Modellreduktion (EMR). Unsere Ergebnisse zeigen, dass die LE bessere Ergebnisse für Systeme mit schwachem Gedächtnis zeigt, während sie die zugrunde liegende Dynamik von Systemen mit Gedächtniseffekten und farbigem Rauschen nicht rekonstruieren kann. In diesen Situationen sind GLE und EMR besser geeignet, da die Wechselwirkungen zwischen beobachteten und unbeobachteten Variablen in Form von Speichereffekten berücksichtigt werden. Im letzten Teil dieser Arbeit entwickeln wir ein Modell, das auf dem Echo State Network (ESN) basiert und mit der PNF-Methode (Past Noise Forecasting) kombiniert wird, um komplexe Systeme in der realen Welt vorherzusagen. Unsere Ergebnisse zeigen, dass das vorgeschlagene Modell die entscheidenden Merkmale der zugrunde liegenden Dynamik der Klimavariabilität erfasst.Modeling complex systems with large numbers of degrees of freedom have become a grand challenge over the past decades. Typically, only a few variables of complex systems are observed in terms of measured time series, while the majority of them – which potentially interact with the observed ones - remain hidden. Throughout this thesis, we tackle the problem of reconstructing and predicting the underlying dynamics of complex systems using different data-driven approaches. In the first part, we address the inverse problem of inferring an unknown network structure of complex systems, reflecting spreading phenomena, from observed event series. We study the pairwise statistical similarity between the sequences of event timings at all nodes through event synchronization (ES) and event coincidence analysis (ECA), relying on the idea that functional connectivity can serve as a proxy for structural connectivity. In the second part, we focus on reconstructing the underlying dynamics of complex systems from their dominant macroscopic variables using different Stochastic Differential Equations (SDEs). We investigate the performance of three different SDEs – the Langevin Equation (LE), Generalized Langevin Equation (GLE), and the Empirical Model Reduction (EMR) approach in this thesis. Our results reveal that LE demonstrates better results for systems with weak memory while it fails to reconstruct underlying dynamics of systems with memory effects and colored-noise forcing. In these situations, the GLE and EMR are more suitable candidates since the interactions between observed and unobserved variables are considered in terms of memory effects. In the last part of this thesis, we develop a model based on the Echo State Network (ESN), combined with the past noise forecasting (PNF) method, to predict real-world complex systems. Our results show that the proposed model captures the crucial features of the underlying dynamics of climate variability
