44 research outputs found
The Bayesian Context Trees State Space Model for time series modelling and forecasting
A hierarchical Bayesian framework is introduced for developing rich mixture
models for real-valued time series, along with a collection of effective tools
for learning and inference. At the top level, meaningful discrete states are
identified as appropriately quantised values of some of the most recent
samples. This collection of observable states is described as a discrete
context-tree model. Then, at the bottom level, a different, arbitrary model for
real-valued time series - a base model - is associated with each state. This
defines a very general framework that can be used in conjunction with any
existing model class to build flexible and interpretable mixture models. We
call this the Bayesian Context Trees State Space Model, or the BCT-X framework.
Efficient algorithms are introduced that allow for effective, exact Bayesian
inference; in particular, the maximum a posteriori probability (MAP)
context-tree model can be identified. These algorithms can be updated
sequentially, facilitating efficient online forecasting. The utility of the
general framework is illustrated in two particular instances: When
autoregressive (AR) models are used as base models, resulting in a nonlinear AR
mixture model, and when conditional heteroscedastic (ARCH) models are used,
resulting in a mixture model that offers a powerful and systematic way of
modelling the well-known volatility asymmetries in financial data. In
forecasting, the BCT-X methods are found to outperform state-of-the-art
techniques on simulated and real-world data, both in terms of accuracy and
computational requirements. In modelling, the BCT-X structure finds natural
structure present in the data. In particular, the BCT-ARCH model reveals a
novel, important feature of stock market index data, in the form of an enhanced
leverage effect.Comment: arXiv admin note: text overlap with arXiv:2106.0302
Inverse clustering of Gibbs Partitions via independent fragmentation and dual dependent coagulation operators
Gibbs partitions of the integers generated by stable subordinators of index
form remarkable classes of random partitions where in
principle much is known about their properties, including practically
effortless obtainment of otherwise complex asymptotic results potentially
relevant to applications in general combinatorial stochastic processes, random
tree/graph growth models and Bayesian statistics. This class includes the
well-known models based on the two-parameter Poisson-Dirichlet distribution
which forms the bulk of explicit applications. This work continues efforts to
provide interpretations for a larger classes of Gibbs partitions by embedding
important operations within this framework. Here we address the formidable
problem of extending the dual, infinite-block, coagulation/fragmentation
results of Jim Pitman (1999, Annals of Probability), where in terms of
coagulation they are based on independent two-parameter Poisson-Dirichlet
distributions, to all such Gibbs (stable Poisson-Kingman) models. Our results
create nested families of Gibbs partitions, and corresponding mass partitions,
over any We primarily focus on the fragmentation
operations, which remain independent in this setting, and corresponding
remarkable calculations for Gibbs partitions derived from that operation. We
also present definitive results for the dual coagulation operations, now based
on our construction of dependent processes, and demonstrate its relatively
simple application in terms of Mittag-Leffler and generalized gamma models. The
latter demonstrates another approach to recover the duality results in Pitman
(1999)
Security analyses for detecting deserialisation vulnerabilities : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand
An important task in software security is to identify potential vulnerabilities. Attackers exploit security vulnerabilities in systems to obtain confidential information, to breach system integrity, and to make systems unavailable to legitimate users. In recent years, particularly 2012, there has been a rise in reported Java vulnerabilities. One type of vulnerability involves (de)serialisation, a commonly used feature to store objects or data structures to an external format and restore them. In 2015, a deserialisation vulnerability was reported involving Apache Commons Collections, a popular Java library, which affected numerous Java applications. Another major deserialisation-related vulnerability that affected 55\% of Android devices was reported in 2015. Both of these vulnerabilities allowed arbitrary code execution on vulnerable systems by malicious users, a serious risk, and this came as a call for the Java community to issue patches to fix serialisation related vulnerabilities in both the Java Development Kit and libraries.
Despite attention to coding guidelines and defensive strategies, deserialisation remains a risky feature and a potential weakness in object-oriented applications. In fact, deserialisation related vulnerabilities (both denial-of-service and remote code execution) continue to be reported for Java applications. Further, deserialisation is a case of parsing where external data is parsed from their external representation to a program's internal data structures and hence, potentially similar vulnerabilities can be present in parsers for file formats and serialisation languages.
The problem is, given a software package, to detect either injection or denial-of-service vulnerabilities and propose strategies to prevent attacks that exploit them. The research reported in this thesis casts detecting deserialisation related vulnerabilities as a program analysis task. The goal is to automatically discover this class of vulnerabilities using program analysis techniques, and to experimentally evaluate the efficiency and effectiveness of the proposed methods on real-world software. We use multiple techniques to detect reachability to sensitive methods and taint analysis to detect if untrusted user-input can result in security violations.
Challenges in using program analysis for detecting deserialisation vulnerabilities include addressing soundness issues in analysing dynamic features in Java (e.g., native code). Another hurdle is that available techniques mostly target the analysis of applications rather than library code.
In this thesis, we develop techniques to address soundness issues related to analysing Java code that uses serialisation, and we adapt dynamic techniques such as fuzzing to address precision issues in the results of our analysis. We also use the results from our analysis to study libraries in other languages, and check if they are vulnerable to deserialisation-type attacks. We then provide a discussion on mitigation measures for engineers to protect their software against such vulnerabilities.
In our experiments, we show that we can find unreported vulnerabilities in Java code; and how these vulnerabilities are also present in widely-used serialisers for popular languages such as JavaScript, PHP and Rust. In our study, we discovered previously unknown denial-of-service security bugs in applications/libraries that parse external data formats such as YAML, PDF and SVG
Mining International Political Norms from the GDELT Database
Researchers have long been interested in the role that norms can play in
governing agent actions in multi-agent systems. Much work has been done on
formalising normative concepts from human society and adapting them for the
government of open software systems, and on the simulation of normative
processes in human and artificial societies. However, there has been
comparatively little work on applying normative MAS mechanisms to understanding
the norms in human society.
This work investigates this issue in the context of international politics.
Using the GDELT dataset, containing machine-encoded records of international
events extracted from news reports, we extracted bilateral sequences of
inter-country events and applied a Bayesian norm mining mechanism to identify
norms that best explained the observed behaviour. A statistical evaluation
showed that the normative model fitted the data significantly better than a
probabilistic discrete event model.Comment: 16 pages, 2 figures, pre-print for International Workshop on
Coordination, Organizations, Institutions, Norms and Ethics for Governance of
Multi-Agent Systems (COINE), co-located with AAMAS 202
Hierarchical Bayesian Nonparametric Models for Power-Law Sequences
Sequence data that exhibits power-law behavior in its marginal and conditional distributions arises frequently from natural processes, with natural language text being a prominent example. We study probabilistic models for such sequences based on a hierarchical non-parametric Bayesian prior, develop inference and learning procedures for making these models useful in practice and applicable to large, real-world data sets, and empirically demonstrate their excellent predictive performance. In particular, we consider models based on the infinite-depth variant of the hierarchical Pitman-Yor process (HPYP) language model [Teh, 2006b] known as the Sequence Memoizer, as well as Sequence Memoizer-based cache language models and hybrid models combining the HPYP with neural language models. We empirically demonstrate that these models performwell on languagemodelling and data compression tasks
Analyse de la sécurité de systèmes critiques embarqués à forte composante logicielle par interprétation abstraite
This thesis is dedicated to the analysis of low-level software, like operating systems, by abstract interpretation. Analyzing OSes is a crucial issue to guarantee the safety of software systems since they are the layer immediately above the hardware and that all applicative tasks rely on them. For critical applications, we want to prove that the OS does not crash, and that it ensures the isolation of programs, so that an untrusted program cannot disrupt a trusted one. The analysis of this kind of programs raises specific issues. This is because OSes must control hardware using instructions that are meaningless in ordinary programs. In addition, because hardware features are outside the scope of C, source code includes assembly blocks mixed with C code. These are the two main axes in this thesis: handling mixed C and assembly, and precise abstraction of instructions that are specific to low-level software. This work is motivated by the analysis of a case study emanating from an industrial partner, which required the implementation of proposed methods in the static analyzer Astrée. The first part is about the formalization of a language mixing simplified models of C and assembly, from syntax to semantics. This specification is crucial to define what is legal and what is a bug, while taking into account the intricacy of interactions of C and assembly, in terms of data flow and control flow. The second part is a short introduction to abstract interpretation focusing on what is useful thereafter. The third part proposes an abstraction of the semantics of mixed C and assembly. This is actually a series of parametric abstractions handling each aspect of the semantics. The fourth part is interested in the question of the abstraction of instructions specific to low-level software. Interest properties can easily be proven using ghost variables, but because of technical reasons, it is difficult to design a reduced product of abstract domains that allows a satisfactory handling of ghost variables. This part builds such a general framework with domains that allow us to solve our problem and many others. The final part details properties to prove in order to guarantee isolation of programs that have not been treated since they raise many complicated questions. We also give some suggestions to improve the product of domains with ghost variables introduced in the previous part, in terms of features and performances.Cette thèse est consacrée à l'analyse de logiciels de bas niveau, tels que les systèmes d'exploitation, par interprétation abstraite. L'analyse des OS est une question importante pour garantir la sûreté des systèmes logiciels puisqu'ils forment le niveau immédiatement au-dessus du matériel et que toutes les tâches applicatives dépendent d'eux. Pour des applications critiques, on veut s'assurer que l'OS ne plante pas, mais aussi qu'il assure l'isolation des programmes, de sorte qu'un programme dont la fiabilité n'a pas été établie ne puisse perturber un programme de confiance. L'analyse de ce genre de programmes soulève des problèmes spécifiques. Cela provient du fait que les OS doivent contrôler le matériel avec des opérations qui n'ont pas de sens dans un programme ordinaire. De plus, comme les fonctionnalités matérielles sont en dehors du for du C, le code source contient des blocs de code assembleur mêlés au C. Ce sont les deux axes de cette thèse : gérer les mélanges de C et d'assembleur, et abstraire finement les opérations spécifiques aux logiciels de bas niveau. Ce travail est guidé par l'analyse d'un cas d'étude d'un partenaire industriel, ce qui a nécessité l'implémentation des méthodes proposées dans l'analyseur statique Astrée. La première partie s'intéresse à la formalisation d'un langage mélangeant des modèles simplifiés du C et de l'assembleur, depuis la syntaxe jusqu'à la sémantique. Cette spécification est importante pour définir ce qui est légal et ce qui constitue une erreur, tout en tenant compte de la complexité des interactions du C et de l'assembleur, tant en termes de données que de flot de contrôle. La seconde partie est une introduction sommaire à l'interprétation abstraite qui se limite à ce qui est utile par la suite. La troisième partie propose une abstraction de la sémantique des mélanges de C et d'assembleur. Il s'agit en fait d'une collection d'abstractions paramétriques qui gèrent chacun des aspects de cette sémantique. La quatrième partie s'intéresse à l'abstraction des opérations spécifiques aux logiciels de bas niveau. Les propriétés d'intérêt peuvent être facilement prouvées à l'aide de variables fantômes, mais pour des raisons techniques, il est difficile de concevoir un produit réduit de domaines abstraits qui supporte une gestion satisfaisante des variables fantômes. Cette partie construit un tel cadre très général ainsi que des domaines qui permettent de résoudre beaucoup de problèmes dont le nôtre. L'ultime partie présente quelques propriétés à prouver pour garantir l'isolation des programmes, qui n'ont pas été traitées, car elles posent de nouvelles et complexes questions. On donne aussi quelques propositions d'amélioration du produit de domaines avec variables fantômes introduit dans la partie précédente, tant en termes de fonctionnalités que de performances
Generative Non-Markov Models for Information Extraction
Learning from unlabeled data is a long-standing challenge in machine learning. A
principled solution involves modeling the full joint distribution over inputs
and the latent structure of interest, and imputing the missing data via
marginalization. Unfortunately, such marginalization is expensive for most
non-trivial problems, which places practical limits on the expressiveness of
generative models. As a result, joint models often encode strict assumptions
about the underlying process such as fixed-order Markovian assumptions and
employ simple count-based features of the inputs. In contrast, conditional
models, which do not directly model the observed data, are free to incorporate
rich overlapping features of the input in order to predict the latent structure
of interest. It would be desirable to develop expressive generative models that
retain tractable inference. This is the topic of this thesis. In particular, we
explore joint models which relax fixed-order Markov assumptions, and investigate
the use of recurrent neural networks for automatic feature induction in the
generative process.
We focus on two structured prediction problems: (1) imputing labeled segmentions
of input character sequences, and (2) imputing directed spanning trees relating
strings in text corpora. These problems arise in many applications of practical
interest, but we are primarily concerned with named-entity recognition and
cross-document coreference resolution in this work.
For named-entity recognition, we propose a generative model in which the
observed characters originate from a latent non-Markov process over words, and
where the characters are themselves produced via a non-Markov process: a
recurrent neural network (RNN). We propose a sampler for the proposed model in
which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler.
The kernel is amenable to a fast parallel implementation, and results in fast
mixing in practice.
For cross-document coreference resolution, we move beyond sequence modeling to
consider string-to-string transduction. We stipulate a generative process for a
corpus of documents in which entity names arise from copying---and optionally
transforming---previous names of the same entity. Our proposed model is
sensitive to both the context in which the names occur as well as their
spelling. The string-to-string transformations correspond to systematic
linguistic processes such as abbreviation, typos, and nicknaming, and by analogy
to biology, we think of them as mutations along the edges of a phylogeny. We
propose a novel block Gibbs sampler for this problem that alternates between
sampling an ordering of the mentions and a spanning tree relating all mentions
in the corpus
Improving object management in HPC workflows
Object management represents a substantial fraction of the total computing time in any distributed application, and it also adds complexity in terms of source code. This project proposes and implements a set of features aimed to improve both the usability and performance of distributed application