139 research outputs found

    Generating name-like vectors for testing large-scale entity resolution

    Get PDF
    Entity resolution (ER), the problem of identifying and linking records that belong to the same real-world entities in structured and unstructured data, is a primary task in data integration. Accurate and efficient ER has a major practical impact on various applications across commercial, security and scientific domains. Recently, scalable ER techniques have received enormous attention with the increasing need to combine large-scale datasets. The shortage of training and ground truth data impedes the development and testing of ER algorithms. Good public datasets, especially those containing personal information, are restricted in this area and usually small in size. Due to privacy and confidential issues, testing algorithms or techniques with real datasets is challenging in ER research. Simulation is one technique for generating synthetic datasets that have characteristics similar to those of real data for testing algorithms. Many existing simulation tools in ER lack support for generating large-scale data and have problems in complexity, scalability, and limitations of resampling. In our work, we propose a simple, inexpensive, and fast synthetic data generation tool. Our tool only generates entity names in the first stage, but these are commonly used as identification keys in ER algorithms. We avoid the detail-level simulation of entity names using a simple vector representation that delivers simplicity and efficiency. In this paper, we discuss how to simulate simple vectors that approximate the properties of entity names. We describe the overall construction of the tool based on data analysis of a namespace that contains entity names collected from the actual environment.Samudra Herath, Matthew Roughan and Gary Glone

    An experimental evaluation of a loop versus a reference design for two-channel microarrays

    Get PDF
    Motivation: Despite theoretical arguments that socalled "loop designs" of two-channel DNA microarray experiments are more efficient, biologists keep on using "reference designs". We describe two sets of microarray experiments with RNA from two different biological systems (TPA-stimulated mammalian cells and Streptomyces coelicor). In each case, both a loop and a reference design were performed using the same RNA preparations with the aim to study their relative efficiency. Results: The results of these experiments show that (1) the loop design attains a much higher precision than the reference design, (2) multiplicative spot effects are a large source of variability, and if they are not accounted for in the mathematical model, for example by taking log-ratios or including spot-effects, then the model will perform poorly. The first result is reinforced by a simulation study. Practical recommendations are given on how simple loop designs can be extended to more realistic experimental designs and how standard statistical methods allow the experimentalist to use and interpret the results from loop designs in practice

    A comparison of survival models for prediction of eight-year revision risk following total knee and hip arthroplasty

    Get PDF
    Background: There is increasing interest in the development and use of clinical prediction models, but a lack of evidence-supported guidance on the merits of diferent modelling approaches. This is especially true for time-to event outcomes, where limited studies have compared the vast number of modelling approaches available. This study compares prediction accuracy and variable importance measures for four modelling approaches in prediction of time-to-revision surgery following total knee arthroplasty (TKA) and total hip arthroplasty (THA). Methods: The study included 321,945 TKA and 151,113 THA procedures performed between 1 January 2003 and 31 December 2017. Accuracy of the Cox model, Weibull parametric model, fexible parametric model, and random survival forest were compared, with patient age, sex, comorbidities, and prosthesis characteristics considered as predictors. Prediction accuracy was assessed using the Index of Prediction Accuracy (IPA), c-index, and smoothed calibration curves. Variable importance rankings from the Cox model and random survival forest were also compared Results: Overall, the Cox and fexible parametric survival models performed best for prediction of both TKA (integrated IPA 0.056 (95% CI [0.054, 0.057]) compared to 0.054 (95% CI [0.053, 0.056]) for the Weibull parametric model), and THA revision. (0.029 95% CI [0.027, 0.030] compared to 0.027 (95% CI [0.025, 0.028]) for the random survival forest). The c-index showed broadly similar discrimination between all modelling approaches. Models were generally well calibrated, but random survival forest underftted the predicted risk of TKA revision compared to regression approaches. The most important predictors of revision were similar in the Cox model and random survival forest for TKA (age, opioid use, and patella resurfacing) and THA (femoral cement, depression, and opioid use). Conclusion: The Cox and fexible parametric models had superior overall performance, although all approaches performed similarly. Notably, this study showed no beneft of a tuned random survival forest over regression models in this setting.Alana R. Cuthbert, Lynne C. Giles, Gary Glonek, Lisa M. Kalisch Ellett, and Nicole L. Prat

    Alignment of time course gene expression data and the classification of developmentally driven genes with hidden Markov models

    Get PDF
    BACKGROUND: We consider data from a time course microarray experiment that was conducted on grapevines over the development cycle of the grape berries at two different vineyards in South Australia. Although the underlying biological process of berry development is the same at both vineyards, there are differences in the timing of the development due to local conditions. We aim to align the data from the two vineyards to enable an integrated analysis of the gene expression and use the alignment of the expression profiles to classify likely developmental function. RESULTS: We present a novel alignment method based on hidden Markov models (HMMs) and use the method to align the motivating grapevine data. We show that our alignment method is robust against subsets of profiles that are not suitable for alignment, investigate alignment diagnostics under the model and demonstrate the classification of developmentally driven genes. CONCLUSIONS: The classification of developmentally driven genes both validates that the alignment we obtain is meaningful and also gives new evidence that can be used to identify the role of genes with unknown function. Using our alignment methodology, we find at least 1279 grapevine probe sets with no current annotated function that are likely to be controlled in a developmental manner.Sean Robinson, Garique Glonek, Inge Koch, Mark Thomas, and Christopher Davie

    Binary Models for Marginal Independence

    Full text link
    Log-linear models are a classical tool for the analysis of contingency tables. In particular, the subclass of graphical log-linear models provides a general framework for modelling conditional independences. However, with the exception of special structures, marginal independence hypotheses cannot be accommodated by these traditional models. Focusing on binary variables, we present a model class that provides a framework for modelling marginal independences in contingency tables. The approach taken is graphical and draws on analogies to multivariate Gaussian models for marginal independence. For the graphical model representation we use bi-directed graphs, which are in the tradition of path diagrams. We show how the models can be parameterized in a simple fashion, and how maximum likelihood estimation can be performed using a version of the Iterated Conditional Fitting algorithm. Finally we consider combining these models with symmetry restrictions

    A constrained polynomial regression procedure for estimating the local False Discovery Rate

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the context of genomic association studies, for which a large number of statistical tests are performed simultaneously, the local False Discovery Rate (<it>lFDR</it>), which quantifies the evidence of a specific gene association with a clinical or biological variable of interest, is a relevant criterion for taking into account the multiple testing problem. The <it>lFDR </it>not only allows an inference to be made for each gene through its specific value, but also an estimate of Benjamini-Hochberg's False Discovery Rate (<it>FDR</it>) for subsets of genes.</p> <p>Results</p> <p>In the framework of estimating procedures without any distributional assumption under the alternative hypothesis, a new and efficient procedure for estimating the <it>lFDR </it>is described. The results of a simulation study indicated good performances for the proposed estimator in comparison to four published ones. The five different procedures were applied to real datasets.</p> <p>Conclusion</p> <p>A novel and efficient procedure for estimating <it>lFDR </it>was developed and evaluated.</p

    The Role of Extramembranous Cytoplasmic Termini in Assembly and Stability of the Tetrameric K+-Channel KcsA

    Get PDF
    Membrane-active alcohol 2,2,2-trifluoroethanol has been proven to be an attractive tool in the investigation of the intrinsic stability of integral membrane protein complexes by taking K+-channel KcsA as a suitable and representative ion channel. In the present study, the roles of both cytoplasmic N and C termini in channel assembly and stability of KcsA were determined. The N terminus (1–18 residues) slightly increased tetramer stability via electrostatic interactions in the presence of 30 mol.% acidic phosphatidylglycerol (PG) in phosphatidylcholine lipid bilayer. Furthermore, the N terminus was found to be potentially required for efficient channel (re)assembly. In contrast, truncation of the C terminus (125–160 residues) greatly facilitated channel reversibility from either a partially or a completely unfolded state, and this domain was substantially involved in stabilizing the tetramer in either the presence or absence of PG in lipid bilayer. These studies provide new insights into how extramembranous parts play their crucial roles in the assembly and stability of integral membrane protein complexes

    Lower age at menarche affects survival in older Australian women: results from the Australian Longitudinal Study of Ageing

    Get PDF
    Extent: 10p.Background: While menarche indicates the beginning of a woman's reproductive life, relatively little is known about the association between age at menarche and subsequent morbidity and mortality. We aimed to examine the effect of lower age at menarche on all-cause mortality in older Australian women over 15 years of follow-up. Methods: Data were drawn from the Australian Longitudinal Study of Ageing (n = 1,031 women aged 65-103 years). We estimated the hazard ratio (HR) associated with lower age at menarche using Cox proportional hazards models, and adjusted for a broad range of reproductive, demographic, health and lifestyle covariates. Results: During the follow-up period, 673 women (65%) died (average 7.3 years (SD 4.1) of follow-up for decedents). Women with menses onset < 12 years of age (10.7%; n = 106) had an increased hazard of death over the follow-up period (adjusted HR 1.28; 95%CI 0.99-1.65) compared with women who began menstruating aged ≥ 12 years (89.3%; n = 883). However, when age at menarche was considered as a continuous variable, the adjusted HRs associated with the linear and quadratic terms for age at menarche were not statistically significant at a 5% level of significance (linear HR 0.76; 95%CI 0.56 - 1.04; quadratic HR 1.01; 95%CI 1.00-1.02). Conclusion: Women with lower age at menarche may have reduced survival into old age. These results lend support to the known associations between earlier menarche and risk of metabolic disease in early adulthood. Strategies to minimise earlier menarche, such as promoting healthy weights and minimising family dysfunction during childhood, may also have positive longer-term effects on survival in later life.Lynne C Giles, Gary FV Glonek, Vivienne M Moore, Michael J Davies and Mary A Luszc
    corecore