59 research outputs found

    Data Similarity is Not Enough to Explain Language Model Performance

    Full text link
    Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed

    A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

    Full text link
    Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development

    Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making

    Full text link
    Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making

    Development of a subcutaneous ear implant to deliver an anaplasmosis vaccine to dairy steers

    Get PDF
    Bovine anaplasmosis is the most prevalent tick-transmitted disease of cattle worldwide and a major obstacle to profitable beef production. Use of chlortetracycline-medicated feed to control active anaplasmosis infections during the vector season has raised concerns about the potential emergence of antimicrobial resistance in bacteria that may pose a risk to human health. Furthermore, the absence of effectiveness data for a commercially available, conditionally licensed anaplasmosis vaccine is a major impediment to implementing anaplasmosis control programs. The primary objective of this study was to develop a single-dose vaccine delivery platform to produce long-lasting protective immunity against anaplasmosis infections. Twelve Holstein steers, aged 11-12 weeks, were administered a novel 3-stage, single-dose vaccine against Anaplasma marginale (Am) major surface protein 1a. The vaccine consisted of a soluble vaccine administered subcutaneously (s.c.) for immune priming, a vaccine depot of a biodegradable polyanhydride rod with intermediate slow release of the vaccine for boosting immune response, and an immune-isolated vaccine platform for extended antigen release (VPEAR implant) deposited s.c. in the ear. Six calves were randomly assigned to two vaccine constructs (n=3) that featured rods and implants containing a combination of two different adjuvants, diethylaminoethyl (DEAE)-Dextran and Quil-A (Group A). The remaining 6 calves were randomly assigned to two vaccine constructs (n=3) that featured rods and implants containing the same adjuvant (either DEAE-Dextran or Quil A) (Group B). Twenty one months post-implantation, calves were challenged intravenously with Am stabilate and were monitored weekly for signs of fever, decreased packed cell volume (PCV) and bacteremia. Data were analyzed using a mixed effects model and chi-squared tests (SAS v9.04.01, SAS Institute, Cary, NC). Calves in Group A had higher PCV than calves in Group B (P = 0.006) at day 35 post-infection. Calves in Group A were less likely to require antibiotic intervention compared with calves in Group B (P = 0.014). Results indicate that calves exhibited diminished clinical signs of anaplasmosis when antigen was delivered with a combination of adjuvants as opposed to a single adjuvant. This demonstrates the feasibility of providing long lasting protection against clinical bovine anaplasmosis infections using a subcutaneous ear implant vaccine construct

    High throughput analysis of epistasis in genome-wide association studies with BiForce

    Get PDF
    Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single nucleotide polymorphism (SNP) combinations. Fast screening tools are needed to make epistasis analysis routinely available in GWAS. Results: We present BiForce to support high-throughput analysis of epistasis in GWAS for either quantitative or binary disease (case–control) traits. BiForce achieves great computational efficiency by using memory efficient data structures, Boolean bitwise operations and multithreaded parallelization. It performs a full pair-wise genome scan to detect interactions involving SNPs with or without significant marginal effects using appropriate Bonferroni-corrected significance thresholds. We show that BiForce is more powerful and significantly faster than published tools for both binary and quantitative traits in a series of performance tests on simulated and real datasets. We demonstrate BiForce in analysing eight metabolic traits in a GWAS cohort (323 697 SNPs, >4500 individuals) and two disease traits in another (>340 000 SNPs, >1750 cases and 1500 controls) on a 32-node computing cluster. BiForce completed analyses of the eight metabolic traits within 1 day, identified nine epistatic pairs of SNPs in five metabolic traits and 18 SNP pairs in two disease traits. BiForce can make the analysis of epistasis a routine exercise in GWAS and thus improve our understanding of the role of epistasis in the genetic regulation of complex traits. Availability and implementation: The software is free and can be downloaded from http://bioinfo.utu.fi/BiForce/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Bioinformatics challenges for genome-wide association studies

    Get PDF
    Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

    Epitaxial Growth and Processing of Compound Semiconductors

    Get PDF
    Contains an introduction and reports on six research projects.Defense Advanced Research Projects Agency/U.S. Navy - Office of Naval Research University Research Initiative Subcontract N00014-92-J-1893Joint Services Electronics Program Grant DAAH04-95-1-0038National Center for Integrated Photonics Technology Contract 542-381National Science Foundation Grant DMR 92-02957MIT Lincoln Laboratory Contract BX-6085National Center for Integrated Photonics Technology Subcontract 542-383U.S. Air Force - Office of Scientific Research Grant F49620-96-1-0126U.S. Navy - Office of Naval Research Grant N00014-91-J-1956National Science Foundation Grant DMR 94-0033

    Gas Source Molecular Beam Epitaxy of Compound Semiconductors

    Get PDF
    Contains an introduction and reports on seven research projects.Defense Advanced Research Projects Agency Subcontract 284-25041Joint Services Electronics Program Contract DAAL04-95-1-0038National Center for Integrated Photonic Technology Contract 542-381U.S. Army Research Office/ AASERT Contract DAAH04-93-G-0175National Science Foundation Grant DMR 92-02957Joint Services Electronics Program Grant DAAL04-95-1-0038National Science Foundation Grant DMR 90-22933National Science Foundation Grant DMR 92-02957National Center for Integrated Photonic Technology Contract 542-381MIT Lincoln LaboratoryNational Center for Integrated Photonic Technology Subcontract 542-383National Science Foundation DMR 94-0033
    corecore