221 research outputs found

    Encoding databases satisfying a given set of dependencies

    Get PDF
    Consider a relation schema with a set of dependency constraints. A fundamental question is what is the minimum space where the possible instances of the schema can be "stored". We study the following model. Encode the instances by giving a function which maps the set of possible instances into the set of words of a given length over the binary alphabet in a decodable way. The problem is to find the minimum length needed. This minimum is called the information content of the database. We investigate several cases where the set of dependency constraints consist of relatively simple sets of functional or multivalued dependencies. We also consider the following natural extension. Is it possible to encode the instances such a way that small changes in the instance cause a small change in the code. © 2012 Springer-Verlag

    IPAD: Stable Interpretable Forecasting with Knockoffs Inference

    Get PDF
    Interpretability and stability are two important features that are desired in many contemporary big data applications arising in economics and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the interpretability is still largely underdeveloped in the econometric settings. To this end, in this paper we exploit the general framework of model-X knockoffs introduced recently in Cand\`{e}s, Fan, Janson and Lv (2018), which is nonconventional for reproducible large-scale inference in that the framework is completely free of the use of p-values for significance testing, and suggest a new method of intertwined probabilistic factors decoupling (IPAD) for stable interpretable forecasting with knockoffs inference in high-dimensional models. The recipe of the method is constructing the knockoff variables by assuming a latent factor model that is exploited widely in economics and finance for the association structure of covariates. Our method and work are distinct from the existing literature in that we estimate the covariate distribution from data instead of assuming that it is known when constructing the knockoff variables, our procedure does not require any sample splitting, we provide theoretical justifications on the asymptotic false discovery rate control, and the theory for the power analysis is also established. Several simulation examples and the real data analysis further demonstrate that the newly suggested method has appealing finite-sample performance with desired interpretability and stability compared to some popularly used forecasting methods

    GeneBins: a database for classifying gene expression data, with application to plant genome arrays

    Get PDF
    BACKGROUND: To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms. RESULTS: We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG ontology. The GeneBins database currently supports the functional classification of expression data from four Affymetrix arrays; Arabidopsis thaliana, Oryza sativa, Glycine max and Medicago truncatula. An online analysis tool to identify relevant functions is also provided. CONCLUSION: GeneBins provides resources to interpret gene expression results from microarray experiments. It is available a

    Investigating the Correlation between Performance Scores and Energy Consumption of Mobile Web Apps

    Get PDF
    Context. Developers have access to tools like Google Lighthouse to assess the performance of web apps and to guide the adoption of development best practices. However, when it comes to energy consumption of mobile web apps, these tools seem to be lacking. Goal. This study investigates on the correlation between the performance scores produced by Lighthouse and the energy consumption of mobile web apps. Method. We design and conduct an empirical experiment where 21 real mobile web apps are (i) analyzed via the Lighthouse performance analysis tool and (ii) measured on an Android device running a software-based energy profiler. Then, we statistically assess how energy consumption correlates with the obtained performance scores and carry out an effect size estimation. Results. We discover a statistically significant negative correlation between performance scores and the energy consumption of mobile web apps (with medium to large effect sizes), implying that an increase of the performance score tend to lead to a decrease of energy consumption. Conclusions. We recommend developers to strive to improve the performance level of their mobile web apps, as this can also have a positive impact on their energy consumption on Android devices

    The Impacts of Reduced Access to Abortion and Family Planning Services: Evidence from Texas

    Full text link
    Between 2011 and 2014, Texas enacted three pieces of legislation that significantly reduced funding for family planning services and increased restrictions on abortion clinic operations. Together this legislation creates cross-county variation in access to abortion and family planning services, which we leverage to understand the impact of family planning and abortion clinic access on abortions, births, and contraceptive purchases. In-state abortions fell 20% and births rose 3% in counties that no longer had an abortion provider within 50 miles. Births increased 1% and contraceptive purchases rose 8% in counties without a publicly-funded family planning clinic within 25 miles

    The distribution of genetic diversity in a Brassica oleracea gene bank collection related to the effects on diversity of regeneration, as measured with AFLPs

    Get PDF
    The ex situ conservation of plant genetic resources in gene banks involves the selection of accessions to be conserved and the maintenance of these accessions for current and future users. Decisions concerning both these issues require knowledge about the distribution of genetic diversity within and between accessions sampled from the gene pool, but also about the changes in variation of these samples as a result of regenerations. These issues were studied in an existing gene bank collection of a cross-pollinating crop using a selection of groups of very similar Dutch white cabbage accessions, and additional groups of reference material representing the Dutch, and the global white cabbage gene pool. Six accessions were sampled both before and after a standard regeneration. 30 plants of each of 50 accessions plus 6 regeneration populations included in the study were characterised with AFLPs, using scores for 103 polymorphic bands. It was shown that the genetic changes as a result of standard gene bank regenerations, as measured by AFLPs, are of a comparable magnitude as the differences between some of the more similar accessions. The observed changes are mainly due to highly significant changes in allele frequencies for a few fragments, whereas for the majority of fragments the alleles occur in similar frequencies before and after regeneration. It is argued that, given the changes of accessions over generations, accessions that display similar levels of differentiation may be combined safely

    Semi-supervised discovery of differential genes

    Get PDF
    BACKGROUND: Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels. RESULTS: We assume a latent variable model for the expression of active genes and apply the optimal discovery procedure (ODP) proposed by Storey (2005) to the model. Our latent variable model allows gene significance scores to be applied to unsupervised and semi-supervised cases. The ODP framework improves detectability by sharing the estimated parameters of null and alternative models of multiple tests over multiple genes. A theoretical consideration leads to two different interpretations of the latent variable, i.e., it only implicitly affects the alternative model through the model parameters, or it is explicitly included in the alternative model, so that the interpretations correspond to two different implementations of ODP. By comparing the two implementations through experiments with simulation data, we have found that sharing the latent variable estimation is effective for increasing the detectability of truly active genes. We also show that the unsupervised and semi-supervised rating of genes, which takes into account the samples without condition labels, can improve detection of active genes in real gene discovery problems. CONCLUSION: The experimental results indicate that the ODP framework is effective for hypotheses including latent variables and is further improved by sharing the estimations of hidden variables over multiple tests

    A Platform for Processing Expression of Short Time Series (PESTS)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Time course microarray profiles examine the expression of genes over a time domain. They are necessary in order to determine the complete set of genes that are dynamically expressed under given conditions, and to determine the interaction between these genes. Because of cost and resource issues, most time series datasets contain less than 9 points and there are few tools available geared towards the analysis of this type of data.</p> <p>Results</p> <p>To this end, we introduce a platform for Processing Expression of Short Time Series (PESTS). It was designed with a focus on usability and interpretability of analyses for the researcher. As such, it implements several standard techniques for comparability as well as visualization functions. However, it is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points.</p> <p>Conclusions</p> <p>PESTS is fully generalizable to other types of time series analyses. PESTS implements novel methods as well as several standard techniques for comparability and visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit. PESTS is available to download for free to academic and non-profit users at <url>http://www.mailman.columbia.edu/academic-departments/biostatistics/research-service/software-development</url>.</p
    corecore