371 research outputs found

    Context Tree Selection: A Unifying View

    Get PDF
    The present paper investigates non-asymptotic properties of two popular procedures of context tree (or Variable Length Markov Chains) estimation: Rissanen's algorithm Context and the Penalized Maximum Likelihood criterion. First showing how they are related, we prove finite horizon bounds for the probability of over- and under-estimation. Concerning overestimation, no boundedness or loss-of-memory conditions are required: the proof relies on new deviation inequalities for empirical probabilities of independent interest. The underestimation properties rely on loss-of-memory and separation conditions of the process. These results improve and generalize the bounds obtained previously. Context tree models have been introduced by Rissanen as a parsimonious generalization of Markov models. Since then, they have been widely used in applied probability and statistics

    Identification of nonlinear time-varying systems using an online sliding-window and common model structure selection (CMSS) approach with applications to EEG

    Get PDF
    The identification of nonlinear time-varying systems using linear-in-the-parameter models is investigated. A new efficient Common Model Structure Selection (CMSS) algorithm is proposed to select a common model structure. The main idea and key procedure is: First, generate K 1 data sets (the first K data sets are used for training, and theK 1 th one is used for testing) using an online sliding window method; then detect significant model terms to form a common model structure which fits over all the K training data sets using the new proposed CMSS approach. Finally, estimate and refine the time-varying parameters for the identified common-structured model using a Recursive Least Squares (RLS) parameter estimation method. The new method can effectively detect and adaptively track the transient variation of nonstationary signals. Two examples are presented to illustrate the effectiveness of the new approach including an application to an EEG data set

    Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis

    Full text link
    Interesting and challenging methodological questions arise from the analysis of Big Biomedical Data, where viable solutions are sought with the help of modern computational tools. In this dissertation, I look at problems in biomedical studies related to data integration, data heterogeneity, and related statistical learning algorithms. The overarching strategy throughout the dissertation research is rooted in the treatment of individual datasets, but not individual subjects, as the elements of focus. Thus, I generalized some of the traditional subject-level methods to be tailored for the development of Big Data methodologies. Following an introduction overview in the first chapter, Chapter II concerns the development of fusion learning of model heterogeneity in data integration via a regression coefficient clustering method. The statistical learning procedure is built for the generalized linear models, and enforces an adjacent fusion penalty on ordered parameters (Wang et al., 2016). This is an adaptation of the fused lasso (Tibshirani et al., 2005), and an extension to the homogeneity pursuit (Ke et al., 2015) that only considers a single data set. Using this method, we can identify regression coefficient heterogeneity across sub-datasets and fuse homogeneous subsets to greatly simplify the regression model, so to improve statistical power. The proposed fusion learning algorithm (published as Tang and Song (2016)) allows the integration of a large number of sub-datasets, a clear advantage over the traditional methods with stratum-covariate interactions or random effects. This method is useful to cluster treatment effects, so some outlying studies may be detected. We demonstrate our method with datasets from the Panel Study of Income Dynamics and from the Early Life Exposures in Mexico to Environmental Toxicants study. This method has also been extended to the Cox proportional hazards model to handle time-to-event response. Chapter III, under the assumption of homogeneous generalized linear model, focuses on the development of a divide-and-combine method for extremely large data that may be stored on distributed file systems. Using the means of confidence distribution (Fisher, 1956; Efron, 1993), I develop a procedure to combine results from different sub-datasets, where lasso is used to reduce model size in order to achieve numerical stability. The algorithm fits into the MapReduce paradigm and may be perfectly parallelized. To deal with estimation bias incurred by lasso regularization, a de-bias step is invoked so the proposed method can enjoy a valid inference. The method is conceptually simple, and computationally scalable and fast, with the numerical evidence illustrated in the comparison with the benchmark maximum likelihood estimator based on full data, and some other competing divide-and-combine-type methods. We apply the method to a large public dataset from the National Highway Traffic Safety Administration on identifying the risk factors of accident injury. In Chapter IV, I generalize the fusion learning algorithm given in Chapter II and develop a coefficient clustering method for correlated data in the context of the generalized estimating equations. The motivation of this generalization is to assess model heterogeneity for the pattern mixture modeling approach (Little, 1993) where models are stratified by missing data patterns. This is one of primary strategies in the literature to deal with the informative missing data mechanism. My method aims to simplify the pattern mixture model by fusing some homogeneous parameters under the generalized estimating equations (GEE, Liang and Zeger (1986)) framework.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145885/1/lutang_1.pd

    Contributions à la localisation intra-muros. De la modélisation à la calibration théorique et pratique d'estimateurs

    Get PDF
    Préfigurant la prochaine grande étape dans le domaine de la navigation, la géolocalisation intra-muros est un domaine de recherche très actif depuis quelques années. Alors que la géolocalisation est entrée dans le quotidien de nombreux professionnels et particuliers avec, notamment, le guidage routier assisté, les besoins d'étendre les applications à l'intérieur se font de plus en plus pressants. Cependant, les systèmes existants se heurtent à des contraintes techniques bien supérieures à celles rencontrées à l'extérieur, la faute, notamment, à la propagation chaotique des ondes électromagnétiques dans les environnements confinés et inhomogènes. Nous proposons dans ce manuscrit une approche statistique du problème de géolocalisation d'un mobile à l'intérieur d'un bâtiment utilisant les ondes WiFi environnantes. Ce manuscrit s'articule autour de deux questions centrales : celle de la détermination des cartes de propagation des ondes WiFi dans un bâtiment donné et celle de la construction d'estimateurs des positions du mobile à l'aide de ces cartes de propagation. Le cadre statistique utilisé dans cette thèse afin de répondre à ces questions est celui des modèles de Markov cachés. Nous proposons notamment, dans un cadre paramétrique, une méthode d'inférence permettant l'estimation en ligne des cartes de propagation, sur la base des informations relevées par le mobile. Dans un cadre non-paramétrique, nous avons étudié la possibilité d'estimer les cartes de propagation considérées comme simple fonction régulière sur l'environnement à géolocaliser. Nos résultats sur l'estimation non paramétrique dans les modèles de Markov cachés permettent d'exhiber un estimateur des fonctions de propagation dont la consistance est établie dans un cadre général. La dernière partie du manuscrit porte sur l'estimation de l'arbre de contextes dans les modèles de Markov cachés à longueur variable.Foreshadowing the next big step in the field of navigation, indoor geolocation has been a very active field of research in the last few years. While geolocation entered the life of many individuals and professionals, particularly through assisted navigation systems on roads, needs to extend the applications inside the buildings are more and more present. However, existing systems face many more technical constraints than those encountered outside, including the chaotic propagation of electromagnetic waves in confined and inhomogeneous environments. In this manuscript, we propose a statistical approach to the problem of geolocation of a mobile device inside a building, using the WiFi surrounding waves. This manuscript focuses on two central issues: the determination of WiFi wave propagation maps inside a building and the construction of estimators of the mobile's positions using these propagation maps. The statistical framework used in this thesis to answer these questions is that of hidden Markov models. We propose, in a parametric framework, an inference method for the online estimation of the propagation maps, on the basis of the informations reported by the mobile. In a nonparametric framework, we investigated the possibility of estimating the propagation maps considered as a single regular function on the environment that we wish to geolocate. Our results on the nonparametric estimation in hidden Markov models make it possible to produce estimators of the propagation functions whose consistency is established in a general framework. The last part of the manuscript deals with the estimation of the context tree in variable length hidden Markov models.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Os ritmos escritos do português brasileiro e moçambicano : um estudo comparativo em cadeias com ordem variável

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Estatística, 2018.Neste trabalho comparamos o ritmo escrito do português brasileiro e do português moçambicano por meio de cadeias com ordem variável tomando valores num alfabeto A = {0, 1, 2, 3, 4}. Para isso, decodificamos textos não-literários de ambas as línguas de acordo com a presença ou ausência de tonicidade silábica e início de palavra prosódica. Inicialmente, três estimadores foram estudados: o Algoritmo Contexto, proposto por Rissanen (1983), a versão modificada do Algortimo Contexto, proposta por Galves e Leonardi (2008), e o estimador BIC, encontrado em Csiszár e Talata (2006). A fim de analisar o melhor estimdor para a aplicação, estudos simulados foram realizados para os três estimadores mencionados, para diferentes constantes de penalização (no caso do estimador BIC), limiares de poda (Algoritmo Contexto e a sua versão modificada), tamanhos de amostra e estruturas de contextos. Notamos que o estimador BIC apresentou o melhor desempenho para os casos estudados, sendo usado, assim, na aplicação. Ao total, obtivemos o retorno de dez árvores de contextos distintas para os 56 textos analisados do português brasileiro e do português moçambicano.In this paper, we compare the written rhythm of Brazilian and Mozambican Portuguese variant through a chain with variable length taking values in alphabet A = {0, 1, 2, 3, 4}. For this, we decoded non-literary texts in both languages according to the presence or absence of their syllabic tonicity and their prosodic word onset. Initially, three estimators were studied: the Context Algorithm, proposed by Rissanen (1983), a modified version of the Context Algorithm, proposed by Galves and Leonardi (2008), and the BIC estimator, found in Csiszár and Talata (2006). In order to analyze the best estimator for the application, simulated studies were performed for the three previously mentioned estimators, for different penalty constants (in the case of the BIC estimator), pruning thresholds (Context Algorithm and its modifeid version), samples sizes and contextual structures. We noticed that the BIC estimator presented the best performance for the studied cases, being thus used in the application. In total, we obtained the return of ten distinct contexts trees for the 56 analyzed texts in Brazilian Portuguese and Mozambican Portuguese

    Perfect Simulation Of Processes With Long Memory: a 'Coupling Into And From The Past' Algorithm

    Get PDF
    International audienceWe describe a new algorithm for the perfect simulation of variable length Markov chains and random systems with perfect connections. This algorithm, which generalizes Propp and Wilson's simulation scheme, is based on the idea of coupling into and from the past. It improves on existing algorithms by relaxing the conditions on the kernel and by accelerating convergence, even in the simple case of finite order Markov chains. Although chains of variable or infinite order have been widely investigated for decades, their use in applied probability, from information theory to bio-informatics and linguistics, has recently led to considerable renewed interest
    corecore