517,932 research outputs found

    Statistical data mining for symbol associations in genomic databases

    Full text link
    A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test is proposed to assess the significance of a group of symbols when found in several genesets of a given database. Applied to symbol pairs, the thresholded p-values of the test define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections did correspond to already known interactions. On more specific selections of C2, many previously unkown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence

    Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation

    Full text link
    The growing expanse of e-commerce and the widespread availability of online databases raise many fears regarding loss of privacy and many statistical challenges. Even with encryption and other nominal forms of protection for individual databases, we still need to protect against the violation of privacy through linkages across multiple databases. These issues parallel those that have arisen and received some attention in the context of homeland security. Following the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. We present an overview of some proposals that have surfaced for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. We also explore their link to the related literature on privacy-preserving data mining. In particular, we focus on the matching problem across databases and the concept of ``selective revelation'' and their confidentiality implications.Comment: Published at http://dx.doi.org/10.1214/088342306000000240 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Catalog of Radio Galaxies with z>0.3. I:Construction of the Sample

    Full text link
    The procedure of the construction of a sample of distant (z>0.3z>0.3) radio galaxies using NED, SDSS, and CATS databases for further application in statistical tests is described. The sample is assumed to be cleaned from objects with quasar properties. Primary statistical analysis of the list is performed and the regression dependence of the spectral index on redshift is found.Comment: 9 pages, 6 figures, 2 table

    From the principles of genomic data sharing to the practices of data access committees

    Get PDF
    Sharing genomic research data through controlled-access databases has increased in recent years. Policymakers and funding organizations endorse genomic data sharing in order to optimize the use of public funds and to increase the statistical power of databases. Well-established data access arrangements and data access committees (DACs)responsible for reviewing and managing requests for access to genomic databasesare therefore central for implementing the policies and principles of data sharing. This article aims to investigate the functionality of DACs through the perspective of existing practices

    Rule-Based and Case-Based Reasoning in Housing Prices

    Get PDF
    People reason about real-estate prices both in terms of general rules and in terms of analogies to similar cases. We propose to empirically test which mode of reasoning fits the data better. To this end, we develop the statistical techniques required for the estimation of the case-based model. It is hypothesized that case-based reasoning will have relatively more explanatory power in databases of rental apartments, whereas rule-based reasoning will have a relative advantage in sales data. We motivate this hypothesis on theoretical grounds, and find empirical support for it by comparing the two statistical techniques (rule-based and case-based) on two databases (rentals and sales).Housing, similarity, regression, case-based reasoning, rule-based reasoning

    Caching and Distributing Statistical Analyses in R

    Get PDF
    We present the cacher package for R, which provides tools for caching statistical analyses and for distributing these analyses to others in an efficient manner. The cacher package takes objects created by evaluating R expressions and stores them in key-value databases. These databases of cached objects can subsequently be assembled into packages for distribution over the web. The cacher package also provides tools to help readers examine the data and code in a statistical analysis and reproduce, modify, or improve upon the results. In addition, readers can easily conduct alternate analyses of the data. We describe the design and implementation of the cacher package and provide two examples of how the package can be used for reproducible research.
    • …
    corecore