44 research outputs found

    The Bayesian Context Trees State Space Model for time series modelling and forecasting

    Full text link
    A hierarchical Bayesian framework is introduced for developing rich mixture models for real-valued time series, along with a collection of effective tools for learning and inference. At the top level, meaningful discrete states are identified as appropriately quantised values of some of the most recent samples. This collection of observable states is described as a discrete context-tree model. Then, at the bottom level, a different, arbitrary model for real-valued time series - a base model - is associated with each state. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. We call this the Bayesian Context Trees State Space Model, or the BCT-X framework. Efficient algorithms are introduced that allow for effective, exact Bayesian inference; in particular, the maximum a posteriori probability (MAP) context-tree model can be identified. These algorithms can be updated sequentially, facilitating efficient online forecasting. The utility of the general framework is illustrated in two particular instances: When autoregressive (AR) models are used as base models, resulting in a nonlinear AR mixture model, and when conditional heteroscedastic (ARCH) models are used, resulting in a mixture model that offers a powerful and systematic way of modelling the well-known volatility asymmetries in financial data. In forecasting, the BCT-X methods are found to outperform state-of-the-art techniques on simulated and real-world data, both in terms of accuracy and computational requirements. In modelling, the BCT-X structure finds natural structure present in the data. In particular, the BCT-ARCH model reveals a novel, important feature of stock market index data, in the form of an enhanced leverage effect.Comment: arXiv admin note: text overlap with arXiv:2106.0302

    Inverse clustering of Gibbs Partitions via independent fragmentation and dual dependent coagulation operators

    Full text link
    Gibbs partitions of the integers generated by stable subordinators of index α∈(0,1)\alpha\in(0,1) form remarkable classes of random partitions where in principle much is known about their properties, including practically effortless obtainment of otherwise complex asymptotic results potentially relevant to applications in general combinatorial stochastic processes, random tree/graph growth models and Bayesian statistics. This class includes the well-known models based on the two-parameter Poisson-Dirichlet distribution which forms the bulk of explicit applications. This work continues efforts to provide interpretations for a larger classes of Gibbs partitions by embedding important operations within this framework. Here we address the formidable problem of extending the dual, infinite-block, coagulation/fragmentation results of Jim Pitman (1999, Annals of Probability), where in terms of coagulation they are based on independent two-parameter Poisson-Dirichlet distributions, to all such Gibbs (stable Poisson-Kingman) models. Our results create nested families of Gibbs partitions, and corresponding mass partitions, over any 0<β<α<1.0<\beta<\alpha<1. We primarily focus on the fragmentation operations, which remain independent in this setting, and corresponding remarkable calculations for Gibbs partitions derived from that operation. We also present definitive results for the dual coagulation operations, now based on our construction of dependent processes, and demonstrate its relatively simple application in terms of Mittag-Leffler and generalized gamma models. The latter demonstrates another approach to recover the duality results in Pitman (1999)

    Security analyses for detecting deserialisation vulnerabilities : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand

    Get PDF
    An important task in software security is to identify potential vulnerabilities. Attackers exploit security vulnerabilities in systems to obtain confidential information, to breach system integrity, and to make systems unavailable to legitimate users. In recent years, particularly 2012, there has been a rise in reported Java vulnerabilities. One type of vulnerability involves (de)serialisation, a commonly used feature to store objects or data structures to an external format and restore them. In 2015, a deserialisation vulnerability was reported involving Apache Commons Collections, a popular Java library, which affected numerous Java applications. Another major deserialisation-related vulnerability that affected 55\% of Android devices was reported in 2015. Both of these vulnerabilities allowed arbitrary code execution on vulnerable systems by malicious users, a serious risk, and this came as a call for the Java community to issue patches to fix serialisation related vulnerabilities in both the Java Development Kit and libraries. Despite attention to coding guidelines and defensive strategies, deserialisation remains a risky feature and a potential weakness in object-oriented applications. In fact, deserialisation related vulnerabilities (both denial-of-service and remote code execution) continue to be reported for Java applications. Further, deserialisation is a case of parsing where external data is parsed from their external representation to a program's internal data structures and hence, potentially similar vulnerabilities can be present in parsers for file formats and serialisation languages. The problem is, given a software package, to detect either injection or denial-of-service vulnerabilities and propose strategies to prevent attacks that exploit them. The research reported in this thesis casts detecting deserialisation related vulnerabilities as a program analysis task. The goal is to automatically discover this class of vulnerabilities using program analysis techniques, and to experimentally evaluate the efficiency and effectiveness of the proposed methods on real-world software. We use multiple techniques to detect reachability to sensitive methods and taint analysis to detect if untrusted user-input can result in security violations. Challenges in using program analysis for detecting deserialisation vulnerabilities include addressing soundness issues in analysing dynamic features in Java (e.g., native code). Another hurdle is that available techniques mostly target the analysis of applications rather than library code. In this thesis, we develop techniques to address soundness issues related to analysing Java code that uses serialisation, and we adapt dynamic techniques such as fuzzing to address precision issues in the results of our analysis. We also use the results from our analysis to study libraries in other languages, and check if they are vulnerable to deserialisation-type attacks. We then provide a discussion on mitigation measures for engineers to protect their software against such vulnerabilities. In our experiments, we show that we can find unreported vulnerabilities in Java code; and how these vulnerabilities are also present in widely-used serialisers for popular languages such as JavaScript, PHP and Rust. In our study, we discovered previously unknown denial-of-service security bugs in applications/libraries that parse external data formats such as YAML, PDF and SVG

    Mining International Political Norms from the GDELT Database

    Full text link
    Researchers have long been interested in the role that norms can play in governing agent actions in multi-agent systems. Much work has been done on formalising normative concepts from human society and adapting them for the government of open software systems, and on the simulation of normative processes in human and artificial societies. However, there has been comparatively little work on applying normative MAS mechanisms to understanding the norms in human society. This work investigates this issue in the context of international politics. Using the GDELT dataset, containing machine-encoded records of international events extracted from news reports, we extracted bilateral sequences of inter-country events and applied a Bayesian norm mining mechanism to identify norms that best explained the observed behaviour. A statistical evaluation showed that the normative model fitted the data significantly better than a probabilistic discrete event model.Comment: 16 pages, 2 figures, pre-print for International Workshop on Coordination, Organizations, Institutions, Norms and Ethics for Governance of Multi-Agent Systems (COINE), co-located with AAMAS 202

    Hierarchical Bayesian Nonparametric Models for Power-Law Sequences

    Get PDF
    Sequence data that exhibits power-law behavior in its marginal and conditional distributions arises frequently from natural processes, with natural language text being a prominent example. We study probabilistic models for such sequences based on a hierarchical non-parametric Bayesian prior, develop inference and learning procedures for making these models useful in practice and applicable to large, real-world data sets, and empirically demonstrate their excellent predictive performance. In particular, we consider models based on the infinite-depth variant of the hierarchical Pitman-Yor process (HPYP) language model [Teh, 2006b] known as the Sequence Memoizer, as well as Sequence Memoizer-based cache language models and hybrid models combining the HPYP with neural language models. We empirically demonstrate that these models performwell on languagemodelling and data compression tasks

    Analyse de la sécurité de systèmes critiques embarqués à forte composante logicielle par interprétation abstraite

    Get PDF
    This thesis is dedicated to the analysis of low-level software, like operating systems, by abstract interpretation. Analyzing OSes is a crucial issue to guarantee the safety of software systems since they are the layer immediately above the hardware and that all applicative tasks rely on them. For critical applications, we want to prove that the OS does not crash, and that it ensures the isolation of programs, so that an untrusted program cannot disrupt a trusted one. The analysis of this kind of programs raises specific issues. This is because OSes must control hardware using instructions that are meaningless in ordinary programs. In addition, because hardware features are outside the scope of C, source code includes assembly blocks mixed with C code. These are the two main axes in this thesis: handling mixed C and assembly, and precise abstraction of instructions that are specific to low-level software. This work is motivated by the analysis of a case study emanating from an industrial partner, which required the implementation of proposed methods in the static analyzer Astrée. The first part is about the formalization of a language mixing simplified models of C and assembly, from syntax to semantics. This specification is crucial to define what is legal and what is a bug, while taking into account the intricacy of interactions of C and assembly, in terms of data flow and control flow. The second part is a short introduction to abstract interpretation focusing on what is useful thereafter. The third part proposes an abstraction of the semantics of mixed C and assembly. This is actually a series of parametric abstractions handling each aspect of the semantics. The fourth part is interested in the question of the abstraction of instructions specific to low-level software. Interest properties can easily be proven using ghost variables, but because of technical reasons, it is difficult to design a reduced product of abstract domains that allows a satisfactory handling of ghost variables. This part builds such a general framework with domains that allow us to solve our problem and many others. The final part details properties to prove in order to guarantee isolation of programs that have not been treated since they raise many complicated questions. We also give some suggestions to improve the product of domains with ghost variables introduced in the previous part, in terms of features and performances.Cette thèse est consacrée à l'analyse de logiciels de bas niveau, tels que les systèmes d'exploitation, par interprétation abstraite. L'analyse des OS est une question importante pour garantir la sûreté des systèmes logiciels puisqu'ils forment le niveau immédiatement au-dessus du matériel et que toutes les tâches applicatives dépendent d'eux. Pour des applications critiques, on veut s'assurer que l'OS ne plante pas, mais aussi qu'il assure l'isolation des programmes, de sorte qu'un programme dont la fiabilité n'a pas été établie ne puisse perturber un programme de confiance. L'analyse de ce genre de programmes soulève des problèmes spécifiques. Cela provient du fait que les OS doivent contrôler le matériel avec des opérations qui n'ont pas de sens dans un programme ordinaire. De plus, comme les fonctionnalités matérielles sont en dehors du for du C, le code source contient des blocs de code assembleur mêlés au C. Ce sont les deux axes de cette thèse : gérer les mélanges de C et d'assembleur, et abstraire finement les opérations spécifiques aux logiciels de bas niveau. Ce travail est guidé par l'analyse d'un cas d'étude d'un partenaire industriel, ce qui a nécessité l'implémentation des méthodes proposées dans l'analyseur statique Astrée. La première partie s'intéresse à la formalisation d'un langage mélangeant des modèles simplifiés du C et de l'assembleur, depuis la syntaxe jusqu'à la sémantique. Cette spécification est importante pour définir ce qui est légal et ce qui constitue une erreur, tout en tenant compte de la complexité des interactions du C et de l'assembleur, tant en termes de données que de flot de contrôle. La seconde partie est une introduction sommaire à l'interprétation abstraite qui se limite à ce qui est utile par la suite. La troisième partie propose une abstraction de la sémantique des mélanges de C et d'assembleur. Il s'agit en fait d'une collection d'abstractions paramétriques qui gèrent chacun des aspects de cette sémantique. La quatrième partie s'intéresse à l'abstraction des opérations spécifiques aux logiciels de bas niveau. Les propriétés d'intérêt peuvent être facilement prouvées à l'aide de variables fantômes, mais pour des raisons techniques, il est difficile de concevoir un produit réduit de domaines abstraits qui supporte une gestion satisfaisante des variables fantômes. Cette partie construit un tel cadre très général ainsi que des domaines qui permettent de résoudre beaucoup de problèmes dont le nôtre. L'ultime partie présente quelques propriétés à prouver pour garantir l'isolation des programmes, qui n'ont pas été traitées, car elles posent de nouvelles et complexes questions. On donne aussi quelques propositions d'amélioration du produit de domaines avec variables fantômes introduit dans la partie précédente, tant en termes de fonctionnalités que de performances

    Generative Non-Markov Models for Information Extraction

    Get PDF
    Learning from unlabeled data is a long-standing challenge in machine learning. A principled solution involves modeling the full joint distribution over inputs and the latent structure of interest, and imputing the missing data via marginalization. Unfortunately, such marginalization is expensive for most non-trivial problems, which places practical limits on the expressiveness of generative models. As a result, joint models often encode strict assumptions about the underlying process such as fixed-order Markovian assumptions and employ simple count-based features of the inputs. In contrast, conditional models, which do not directly model the observed data, are free to incorporate rich overlapping features of the input in order to predict the latent structure of interest. It would be desirable to develop expressive generative models that retain tractable inference. This is the topic of this thesis. In particular, we explore joint models which relax fixed-order Markov assumptions, and investigate the use of recurrent neural networks for automatic feature induction in the generative process. We focus on two structured prediction problems: (1) imputing labeled segmentions of input character sequences, and (2) imputing directed spanning trees relating strings in text corpora. These problems arise in many applications of practical interest, but we are primarily concerned with named-entity recognition and cross-document coreference resolution in this work. For named-entity recognition, we propose a generative model in which the observed characters originate from a latent non-Markov process over words, and where the characters are themselves produced via a non-Markov process: a recurrent neural network (RNN). We propose a sampler for the proposed model in which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler. The kernel is amenable to a fast parallel implementation, and results in fast mixing in practice. For cross-document coreference resolution, we move beyond sequence modeling to consider string-to-string transduction. We stipulate a generative process for a corpus of documents in which entity names arise from copying---and optionally transforming---previous names of the same entity. Our proposed model is sensitive to both the context in which the names occur as well as their spelling. The string-to-string transformations correspond to systematic linguistic processes such as abbreviation, typos, and nicknaming, and by analogy to biology, we think of them as mutations along the edges of a phylogeny. We propose a novel block Gibbs sampler for this problem that alternates between sampling an ordering of the mentions and a spanning tree relating all mentions in the corpus

    Improving object management in HPC workflows

    Get PDF
    Object management represents a substantial fraction of the total computing time in any distributed application, and it also adds complexity in terms of source code. This project proposes and implements a set of features aimed to improve both the usability and performance of distributed application
    corecore