110 research outputs found

    How to Host a Data Competition: Statistical Advice for Design and Analysis of a Data Competition

    Full text link
    Data competitions rely on real-time leaderboards to rank competitor entries and stimulate algorithm improvement. While such competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the underlying problem of interest to the host. This paper outlines some important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition based on our experience. It also describes a post-competition analysis that enables robust and efficient assessment of the strengths and weaknesses of solutions from different competitors, as well as greater understanding of the regions of the input space that are well-solved. The post-competition analysis, which complements the leaderboard, uses exploratory data analysis and generalized linear models (GLMs). The GLMs not only expand the range of results we can explore, they also provide more detailed analysis of individual sub-questions including similarities and differences between algorithms across different types of scenarios, universally easy or hard regions of the input space, and different learning objectives. When coupled with a strategically planned data generation approach, the methods provide richer and more informative summaries to enhance the interpretation of results beyond just the rankings on the leaderboard. The methods are illustrated with a recently completed competition to evaluate algorithms capable of detecting, identifying, and locating radioactive materials in an urban environment.Comment: 36 page

    Binary Interval Search (BITS): A Scalable Algorithm for Counting Interval Intersections

    Get PDF
    Motivation: The comparison of diverse genomic datasets is fundamental to understanding genome biology. Researchers must explore many large datasets of genome intervals (e.g., genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect: that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features is crucial for future discovery. Results: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures such as Graphics Processing Units (GPUs) by illustrating its utility for efficient Monte-Carlo simulations measuring the significance of relationships between sets of genomic intervals

    Rational Design of Temperature-Sensitive Alleles Using Computational Structure Prediction

    Get PDF
    Temperature-sensitive (ts) mutations are mutations that exhibit a mutant phenotype at high or low temperatures and a wild-type phenotype at normal temperature. Temperature-sensitive mutants are valuable tools for geneticists, particularly in the study of essential genes. However, finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in silico method that uses Rosetta and machine learning techniques to predict a highly accurate “top 5” list of ts mutations given the structure of a protein of interest. Rosetta is a protein structure prediction and design code, used here to model and score how proteins accommodate point mutations with side-chain and backbone movements. We show that integrating Rosetta relax-derived features with sequence-based features results in accurate temperature-sensitive mutation predictions

    A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE

    Get PDF
    We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms

    A Novel Protein LZTFL1 Regulates Ciliary Trafficking of the BBSome and Smoothened

    Get PDF
    Many signaling proteins including G protein-coupled receptors localize to primary cilia, regulating cellular processes including differentiation, proliferation, organogenesis, and tumorigenesis. Bardet-Biedl Syndrome (BBS) proteins are involved in maintaining ciliary function by mediating protein trafficking to the cilia. However, the mechanisms governing ciliary trafficking by BBS proteins are not well understood. Here, we show that a novel protein, Leucine-zipper transcription factor-like 1 (LZTFL1), interacts with a BBS protein complex known as the BBSome and regulates ciliary trafficking of this complex. We also show that all BBSome subunits and BBS3 (also known as ARL6) are required for BBSome ciliary entry and that reduction of LZTFL1 restores BBSome trafficking to cilia in BBS3 and BBS5 depleted cells. Finally, we found that BBS proteins and LZTFL1 regulate ciliary trafficking of hedgehog signal transducer, Smoothened. Our findings suggest that LZTFL1 is an important regulator of BBSome ciliary trafficking and hedgehog signaling

    Systematic Evaluation of Factors Influencing ChIP-Seq Fidelity

    Get PDF
    We performed a systematic evaluation of how variations in sequencing depth and other parameters influence interpretation of Chromatin immunoprecipitation (ChIP) followed by sequencing (ChIP-seq) experiments. Using Drosophila S2 cells, we generated ChIP-seq datasets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin state bias, open chromatin regions yielded higher coverage, which led to false positives if not corrected and had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP library complexity at high coverage. The removal of reads originating at the same base reduced false-positives while having little effect on detection sensitivity. Even at a depth of ~1 read/bp coverage of mappable genome, ~1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely-used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle datasets with deep coverage

    Functional similarities between pigeon \u27milk\u27 and mammalian milk : induction of immune gene expression and modification of the microbiota

    Get PDF
    Pigeon ‘milk’ and mammalian milk have functional similarities in terms of nutritional benefit and delivery of immunoglobulins to the young. Mammalian milk has been clearly shown to aid in the development of the immune system and microbiota of the young, but similar effects have not yet been attributed to pigeon ‘milk’. Therefore, using a chicken model, we investigated the effect of pigeon ‘milk’ on immune gene expression in the Gut Associated Lymphoid Tissue (GALT) and on the composition of the caecal microbiota. Chickens fed pigeon ‘milk’ had a faster rate of growth and a better feed conversion ratio than control chickens. There was significantly enhanced expression of immune-related gene pathways and interferon-stimulated genes in the GALT of pigeon ‘milk’-fed chickens. These pathways include the innate immune response, regulation of cytokine production and regulation of B cell activation and proliferation. The caecal microbiota of pigeon ‘milk’-fed chickens was significantly more diverse than control chickens, and appears to be affected by prebiotics in pigeon ‘milk’, as well as being directly seeded by bacteria present in pigeon ‘milk’. Our results demonstrate that pigeon ‘milk’ has further modes of action which make it functionally similar to mammalian milk. We hypothesise that pigeon ‘lactation’ and mammalian lactation evolved independently but resulted in similarly functional products

    Mapping H4K20me3 onto the chromatin landscape of senescent cells indicates a function in control of cell senescence and tumor suppression through preservation of genetic and epigenetic stability

    Get PDF
    Background: Histone modification H4K20me3 and its methyltransferase SUV420H2 have been implicated in suppression of tumorigenesis. The underlying mechanism is unclear, although H4K20me3 abundance increases during cellular senescence, a stable proliferation arrest and tumor suppressor process, triggered by diverse molecular cues, including activated oncogenes. Here, we investigate the function of H4K20me3 in senescence and tumor suppression. Results: Using immunofluorescence and ChIP-seq we determine the distribution of H4K20me3 in proliferating and senescent human cells. Altered H4K20me3 in senescence is coupled to H4K16ac and DNA methylation changes in senescence. In senescent cells, H4K20me3 is especially enriched at DNA sequences contained within specialized domains of senescence-associated heterochromatin foci (SAHF), as well as specific families of non-genic and genic repeats. Altered H4K20me3 does not correlate strongly with changes in gene expression between proliferating and senescent cells; however, in senescent cells, but not proliferating cells, H4K20me3 enrichment at gene bodies correlates inversely with gene expression, reflecting de novo accumulation of H4K20me3 at repressed genes in senescent cells, including at genes also repressed in proliferating cells. Although elevated SUV420H2 upregulates H4K20me3, this does not accelerate senescence of primary human cells. However, elevated SUV420H2/H4K20me3 reinforces oncogene-induced senescence-associated proliferation arrest and slows tumorigenesis in vivo. Conclusions: These results corroborate a role for chromatin in underpinning the senescence phenotype but do not support a major role for H4K20me3 in initiation of senescence. Rather, we speculate that H4K20me3 plays a role in heterochromatinization and stabilization of the epigenome and genome of pre-malignant, oncogene-expressing senescent cells, thereby suppressing epigenetic and genetic instability and contributing to long-term senescence-mediated tumor suppression
    corecore