174 research outputs found

    A Bayesian Approach to Graphical Record Linkage and Deduplication

    Get PDF
    © 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online

    SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

    Get PDF
    We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate kk-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data

    Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

    Get PDF
    We introduce a new approach to goodness-of-fit testing in the high dimensional, sparse extended multinomial context. The paper takes a computational information geometric approach, extending classical higher order asymptotic theory. We show why the Wald – equivalently, the Pearson X2 and score statistics – are unworkable in this context, but that the deviance has a simple, accurate and tractable sampling distribution even for moderate sample sizes. Issues of uniformity of asymptotic approximations across model space are discussed. A variety of important applications and extensions are noted

    The interplay of microscopic and mesoscopic structure in complex networks

    Get PDF
    Not all nodes in a network are created equal. Differences and similarities exist at both individual node and group levels. Disentangling single node from group properties is crucial for network modeling and structural inference. Based on unbiased generative probabilistic exponential random graph models and employing distributive message passing techniques, we present an efficient algorithm that allows one to separate the contributions of individual nodes and groups of nodes to the network structure. This leads to improved detection accuracy of latent class structure in real world data sets compared to models that focus on group structure alone. Furthermore, the inclusion of hitherto neglected group specific effects in models used to assess the statistical significance of small subgraph (motif) distributions in networks may be sufficient to explain most of the observed statistics. We show the predictive power of such generative models in forecasting putative gene-disease associations in the Online Mendelian Inheritance in Man (OMIM) database. The approach is suitable for both directed and undirected uni-partite as well as for bipartite networks

    A Semantic Reasoning Method Towards Ontological Model for Automated Learning Analysis

    Get PDF
    Semantic reasoning can help solve the problem of regulating the evolving and static measures of knowledge at theoretical and technological levels. The technique has been proven to enhance the capability of process models by making inferences, retaining and applying what they have learned as well as discovery of new processes. The work in this paper propose a semantic rule-based approach directed towards discovering learners interaction patterns within a learning knowledge base, and then respond by making decision based on adaptive rules centred on captured user profiles. The method applies semantic rules and description logic queries to build ontology model capable of automatically computing the various learning activities within a Learning Knowledge-Base, and to check the consistency of learning object/data types. The approach is grounded on inductive and deductive logic descriptions that allows the use of a Reasoner to check that all definitions within the learning model are consistent and can also recognise which concepts that fit within each defined class. Inductive reasoning is practically applied in order to discover sets of inferred learner categories, while deductive approach is used to prove and enhance the discovered rules and logic expressions. Thus, this work applies effective reasoning methods to make inferences over a Learning Process Knowledge-Base that leads to automated discovery of learning patterns/behaviour

    Web Queries as a Source for Syndromic Surveillance

    Get PDF
    In the field of syndromic surveillance, various sources are exploited for outbreak detection, monitoring and prediction. This paper describes a study on queries submitted to a medical web site, with influenza as a case study. The hypothesis of the work was that queries on influenza and influenza-like illness would provide a basis for the estimation of the timing of the peak and the intensity of the yearly influenza outbreaks that would be as good as the existing laboratory and sentinel surveillance. We calculated the occurrence of various queries related to influenza from search logs submitted to a Swedish medical web site for two influenza seasons. These figures were subsequently used to generate two models, one to estimate the number of laboratory verified influenza cases and one to estimate the proportion of patients with influenza-like illness reported by selected General Practitioners in Sweden. We applied an approach designed for highly correlated data, partial least squares regression. In our work, we found that certain web queries on influenza follow the same pattern as that obtained by the two other surveillance systems for influenza epidemics, and that they have equal power for the estimation of the influenza burden in society. Web queries give a unique access to ill individuals who are not (yet) seeking care. This paper shows the potential of web queries as an accurate, cheap and labour extensive source for syndromic surveillance

    Long-term declines in ADLs, IADLs, and mobility among older Medicare beneficiaries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most prior studies have focused on short-term (≤ 2 years) functional declines. But those studies cannot address aging effects inasmuch as all participants have aged the same amount. Therefore, the authors studied the extent of long-term functional decline in older Medicare beneficiaries who were followed for varying time lengths, and the authors also identified the risk factors associated with those declines.</p> <p>Methods</p> <p>The analytic sample included 5,871 self- or proxy-respondents who had complete baseline and follow-up survey data that could be linked to their Medicare claims for 1993-2007. Functional status was assessed using activities of daily living (ADLs), instrumental ADLs (IADLs), and mobility limitations, with declines defined as the development of two of more new difficulties. Multiple logistic regression analysis was used to focus on the associations involving respondent status, health lifestyle, continuity of care, managed care status, health shocks, and terminal drop.</p> <p>Results</p> <p>The average amount of time between the first and final interviews was 8.0 years. Declines were observed for 36.6% on ADL abilities, 32.3% on IADL abilities, and 30.9% on mobility abilities. Functional decline was more likely to occur when proxy-reports were used, and the effects of baseline function on decline were reduced when proxy-reports were used. Engaging in vigorous physical activity consistently and substantially protected against functional decline, whereas obesity, cigarette smoking, and alcohol consumption were only associated with mobility declines. Post-baseline hospitalizations were the most robust predictors of functional decline, exhibiting a dose-response effect such that the greater the average annual number of hospital episodes, the greater the likelihood of functional status decline. Participants whose final interview preceded their death by one year or less had substantially greater odds of functional status decline.</p> <p>Conclusions</p> <p>Both the additive and interactive (with functional status) effects of respondent status should be taken into consideration whenever proxy-reports are used. Encouraging exercise could broadly reduce the risk of functional decline across all three outcomes, although interventions encouraging weight reduction and smoking cessation would only affect mobility declines. Reducing hospitalization and re-hospitalization rates could also broadly reduce the risk of functional decline across all three outcomes.</p

    The Hierarchical Age-Period-Cohort model: Why does it find the results that it finds?

    Get PDF
    It is claimed the hierarchical-age–period–cohort (HAPC) model solves the age–period–cohort (APC) identification problem. However, this is debateable; simulations show situations where the model produces incorrect results, countered by proponents of the model arguing those simulations are not relevant to real-life scenarios. This paper moves beyond questioning whether the HAPC model works, to why it produces the results it does. We argue HAPC estimates are the result not of the distinctive substantive APC processes occurring in the dataset, but are primarily an artefact of the data structure—that is, the way the data has been collected. Were the data collected differently, the results produced would be different. This is illustrated both with simulations and real data, the latter by taking a variety of samples from the National Health Interview Survey (NHIS) data used by Reither et al. (Soc Sci Med 69(10):1439–1448, 2009) in their HAPC study of obesity. When a sample based on a small range of cohorts is taken, such that the period range is much greater than the cohort range, the results produced are very different to those produced when cohort groups span a much wider range than periods, as is structurally the case with repeated cross-sectional data. The paper also addresses the latest defence of the HAPC model by its proponents (Reither et al. in Soc Sci Med 145:125–128, 2015a). The results lend further support to the view that the HAPC model is not able to accurately discern APC effects, and should be used with caution when there appear to be period or cohort near-linear trends
    corecore