443 research outputs found

    Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package

    This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST

    Sequential Learning and Variable Length Markov Chains

    Sequential Learning is a framework that was created for statistical learning problems where (Yt)(Y_t), the sequence of states is dependent. More specifically, when it has a dependence structure that can be represented as a first order Markov chain. It works by first taking nonsequential probability estimates P(YtXt)P(Y_t | X_t) and then modifying these with the sequential part to produce P(YtX1:T)P(Y_t | X_{1:T}). However, not all sequential models on a discrete space admit such a representation, at least not easily. As such, our first task is to extend Variable Length Markov Chains (VLMCs), which belie their name and are not Markovian, to be used in the sequential learning framework. This extension greatly broadens the scope of sequential learning as using VLMCs permits sequential learning with far fewer assumptions about the underlying dependence of states. After developing the VLMC extension we provide an overview of sequential learning in general and investigate the probability estimates it produces both theoretically and with a simulation study to assess model performance as a function of the complexity of the underlying sequential model and the quality of the initial probability estimates. Next, we apply VLMC sequential learning to the original dataset and problem that inspired sequential learning --- that of scoring sleep in mice using video data. We find that VLMCs perform at the same level, tying and sometimes beating the previous best sequential method which required many assumptions about the sequence of sleep states and a much more rigid model of sequential dependence. Finally, we turn our attention to the problem of modifying predictors when marginal class probabilities are known. This is inspired by the fact that in sequential learning problems, the marginal class distribution can vary substantially from sample to sample in contrast to i.i.d. problems. We provide a general method of marginal probability reweighting, show it to be equivalent to several extant methods used on similar problems, and provide a proof that our method improves probability estimates under log loss. We conclude with simulations assessing our method as a function of loss type and classifier used

    On Prediction Using Variable Order Markov Models

    This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a "decomposed" CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems

    Modeling Dependencies in Natural Languages with Latent Variables

    In this thesis, we investigate the use of latent variables to model complex dependencies in natural languages. Traditional models, which have a fixed parameterization, often make strong independence assumptions that lead to poor performance. This problem is often addressed by incorporating additional dependencies into the model (e.g., using higher order N-grams for language modeling). These added dependencies can increase data sparsity and/or require expert knowledge, together with trial and error, in order to identify and incorporate the most important dependencies (as in lexicalized parsing models). Traditional models, when developed for a particular genre, domain, or language, are also often difficult to adapt to another. In contrast, previous work has shown that latent variable models, which automatically learn dependencies in a data-driven way, are able to flexibly adjust the number of parameters based on the type and the amount of training data available. We have created several different types of latent variable models for a diverse set of natural language processing applications, including novel models for part-of-speech tagging, language modeling, and machine translation, and an improved model for parsing. These models perform significantly better than traditional models. We have also created and evaluated three different methods for improving the performance of latent variable models. While these methods can be applied to any of our applications, we focus our experiments on parsing. The first method involves self-training, i.e., we train models using a combination of gold standard training data and a large amount of automatically labeled training data. We conclude from a series of experiments that the latent variable models benefit much more from self-training than conventional models, apparently due to their flexibility to adjust their model parameterization to learn more accurate models from the additional automatically labeled training data. The second method takes advantage of the variability among latent variable models to combine multiple models for enhanced performance. We investigate several different training protocols to combine self-training with model combination. We conclude that these two techniques are complementary to each other and can be effectively combined to train very high quality parsing models. The third method replaces the generative multinomial lexical model of latent variable grammars with a feature-rich log-linear lexical model to provide a principled solution to address data sparsity, handle out-of-vocabulary words, and exploit overlapping features during model induction. We conclude from experiments that the resulting grammars are able to effectively parse three different languages. This work contributes to natural language processing by creating flexible and effective latent variable models for several different languages. Our investigation of self-training, model combination, and log-linear models also provides insights into the effective application of these machine learning techniques to other disciplines

    Hierarchical Bayesian Nonparametric Models for Power-Law Sequences

    Sequence data that exhibits power-law behavior in its marginal and conditional distributions arises frequently from natural processes, with natural language text being a prominent example. We study probabilistic models for such sequences based on a hierarchical non-parametric Bayesian prior, develop inference and learning procedures for making these models useful in practice and applicable to large, real-world data sets, and empirically demonstrate their excellent predictive performance. In particular, we consider models based on the infinite-depth variant of the hierarchical Pitman-Yor process (HPYP) language model [Teh, 2006b] known as the Sequence Memoizer, as well as Sequence Memoizer-based cache language models and hybrid models combining the HPYP with neural language models. We empirically demonstrate that these models performwell on languagemodelling and data compression tasks

    Probabilistic Modelling of Morphologically Rich Languages

    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c

    Kernel methods in machine learning

    We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Creative Support Musical Composition System: a study on Multiple Viewpoints Representations in Variable Markov Oracle

    Em meados do século XX, assistiu-se ao surgimento de uma área de estudo focada na geração au-tomática de conteúdo musical por meios computacionais. Os primeiros exemplos concentram-se no processamento offline de dados musicais mas, recentemente, a comunidade tem vindo a explorar maioritariamente sistemas musicais interativos e em tempo-real. Além disso, uma tendência recente enfatiza a importância da tecnologia assistiva, que promove uma abordagem centrada em escolhas do utilizador, oferecendo várias sugestões para um determinado problema criativo. Nesse contexto, a minha investigação tem como objetivo promover novas ferramentas de software para sistemas de suporte criativo, onde algoritmos podem participar colaborativamente no fluxo de composição. Em maior detalhe, procuro uma ferramenta que aprenda com dados musicais de tamanho variável para fornecer feedback em tempo real durante o processo de composição. À luz das características de multi-dimensionalidade e hierarquia presentes nas estruturas musicais, pretendo estudar as representações que abstraem os seus padrões temporais, para promover a geração de múltiplas soluções ordenadas por grau de optimização para um determinado contexto musical. Por fim, a natureza subjetiva da escolha é dada ao utilizador, ao qual é fornecido um número limitado de soluções 'ideais'. Uma representação simbólica da música manifestada como Modelos sob múltiplos pontos de vista, combinada com o autómato Variable Markov Oracle (VMO), é usada para testar a interação ideal entre a multi-dimensionalidade da representação e a idealidade do modelo VMO, fornecendo soluções coerentes, inovadoras e estilisticamente diversas. Para avaliar o sistema, foram realizados testes para validar a ferramenta num cenário especializado com alunos de composição, usando o modelo de testes do índice de suporte à criatividade.The mid-20th century witnessed the emergence of an area of study that focused on the automatic generation of musical content by computational means. Early examples focus on offline processing of musical data and recently, the community has moved towards interactive online musical systems. Furthermore, a recent trend stresses the importance of assistive technology, which pro-motes a user-in-loop approach by offering multiple suggestions to a given creative problem. In this context, my research aims to foster new software tools for creative support systems, where algorithms can collaboratively participate in the composition flow. In greater detail, I seek a tool that learns from variable-length musical data to provide real-time feedback during the composition process. In light of the multidimensional and hierarchical structure of music, I aim to study the representations which abstract its temporal patterns, to foster the generation of multiple ranked solutions to a given musical context. Ultimately, the subjective nature of the choice is given to the user to which a limited number of 'optimal' solutions are provided. A symbolic music representation manifested as Multiple Viewpoint Models combined with the Variable Markov Oracle (VMO) automaton, are used to test optimal interaction between the multi-dimensionality of the representation with the optimality of the VMO model in providing both style-coherent, novel, and diverse solutions. To evaluate the system, an experiment was conducted to validate the tool in an expert-based scenario with composition students, using the creativity support index test

    A computational framework for unsupervised analysis of everyday human activities

    In order to make computers proactive and assistive, we must enable them to perceive, learn, and predict what is happening in their surroundings. This presents us with the challenge of formalizing computational models of everyday human activities. For a majority of environments, the structure of the in situ activities is generally not known a priori. This thesis therefore investigates knowledge representations and manipulation techniques that can facilitate learning of such everyday human activities in a minimally supervised manner. A key step towards this end is finding appropriate representations for human activities. We posit that if we chose to describe activities as finite sequences of an appropriate set of events, then the global structure of these activities can be uniquely encoded using their local event sub-sequences. With this perspective at hand, we particularly investigate representations that characterize activities in terms of their fixed and variable length event subsequences. We comparatively analyze these representations in terms of their representational scope, feature cardinality and noise sensitivity. Exploiting such representations, we propose a computational framework to discover the various activity-classes taking place in an environment. We model these activity-classes as maximally similar activity-cliques in a completely connected graph of activities, and describe how to discover them efficiently. Moreover, we propose methods for finding concise characterizations of these discovered activity-classes, both from a holistic as well as a by-parts perspective. Using such characterizations, we present an incremental method to classify a new activity instance to one of the discovered activity-classes, and to automatically detect if it is anomalous with respect to the general characteristics of its membership class. Our results show the efficacy of our framework in a variety of everyday environments.Ph.D.Committee Chair: Aaron Bobick; Committee Member: Charles Isbell; Committee Member: David Hogg; Committee Member: Irfan Essa; Committee Member: James Reh