Search CORE

59 research outputs found

Data Similarity is Not Enough to Explain Language Model Performance

Author: Mimno David
Reif Emily
Yauney Gregory
Publication venue
Publication date: 15/11/2023
Field of study

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed

arXiv.org e-Print Archive

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Author: Ippolito Daphne
Lee Katherine
Longpre Shayne
Mimno David
Reif Emily
Roberts Adam
Robinson Kevin
Wei Jason
Yauney Gregory
Zhou Denny
Zoph Barret
Publication venue
Publication date: 13/11/2023
Field of study

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development

arXiv.org e-Print Archive

Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making

Author: Cai Carrie J.
Corrado Greg S.
Hegde Narayan
Hipp Jason
Kim Been
Reif Emily
Smilkov Daniel
Stumpe Martin C.
Terry Michael
Viegas Fernanda
Wattenberg Martin
Publication venue
Publication date: 08/02/2019
Field of study

Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making

arXiv.org e-Print Archive

Crossref

Development of a subcutaneous ear implant to deliver an anaplasmosis vaccine to dairy steers

Author: Anantatat Tippawan
Coetzee Johann
Curtis Andrew
Jaberi-Douraki Majid
Jones Douglas
Kelly Sean
Kleinhenz Michael
Martin Miriam
Montgomery Shawnee
Narasimhan Balaji
Narasimhan Balaji
Reif Kathryn
Reppert Emily
Skinner Brandt
Publication venue: Iowa State University Digital Repository
Publication date: 31/12/2019
Field of study

Bovine anaplasmosis is the most prevalent tick-transmitted disease of cattle worldwide and a major obstacle to profitable beef production. Use of chlortetracycline-medicated feed to control active anaplasmosis infections during the vector season has raised concerns about the potential emergence of antimicrobial resistance in bacteria that may pose a risk to human health. Furthermore, the absence of effectiveness data for a commercially available, conditionally licensed anaplasmosis vaccine is a major impediment to implementing anaplasmosis control programs. The primary objective of this study was to develop a single-dose vaccine delivery platform to produce long-lasting protective immunity against anaplasmosis infections. Twelve Holstein steers, aged 11-12 weeks, were administered a novel 3-stage, single-dose vaccine against Anaplasma marginale (Am) major surface protein 1a. The vaccine consisted of a soluble vaccine administered subcutaneously (s.c.) for immune priming, a vaccine depot of a biodegradable polyanhydride rod with intermediate slow release of the vaccine for boosting immune response, and an immune-isolated vaccine platform for extended antigen release (VPEAR implant) deposited s.c. in the ear. Six calves were randomly assigned to two vaccine constructs (n=3) that featured rods and implants containing a combination of two different adjuvants, diethylaminoethyl (DEAE)-Dextran and Quil-A (Group A). The remaining 6 calves were randomly assigned to two vaccine constructs (n=3) that featured rods and implants containing the same adjuvant (either DEAE-Dextran or Quil A) (Group B). Twenty one months post-implantation, calves were challenged intravenously with Am stabilate and were monitored weekly for signs of fever, decreased packed cell volume (PCV) and bacteremia. Data were analyzed using a mixed effects model and chi-squared tests (SAS v9.04.01, SAS Institute, Cary, NC). Calves in Group A had higher PCV than calves in Group B (P = 0.006) at day 35 post-infection. Calves in Group A were less likely to require antibiotic intervention compared with calves in Group B (P = 0.014). Results indicate that calves exhibited diminished clinical signs of anaplasmosis when antigen was delivered with a combination of adjuvants as opposed to a single adjuvant. This demonstrates the feasibility of providing long lasting protection against clinical bovine anaplasmosis infections using a subcutaneous ear implant vaccine construct

Digital Repository @ Iowa State University (ISU)

High throughput analysis of epistasis in genome-wide association studies with BiForce

Author: Attila Gyenesei
Aulchenko
Aulchenko
Cattaert
Chris S. Haley
Colin A.M. Semple
Consortium
Cordell
Dudek
Eichler
Emily
Evans
Evans
Gauderman
Gibson
Greene
Haig
Hemani
Hindorff
Jonathan Moody
Kam-Thong
Kooperberg
Lam
Lappalainen
Levy
Li
Li
Liu
Maher
Manolio
Marchini
McCarthy
Moore
Motsinger-Reif
Neuman
Purcell
Rokop
Sabatti
Schupbach
Strange
Tang
Wan
Wei
Wei
Wei
Wen-Hua Wei
Yang
Yung
Zhang
Zuk
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single nucleotide polymorphism (SNP) combinations. Fast screening tools are needed to make epistasis analysis routinely available in GWAS. Results: We present BiForce to support high-throughput analysis of epistasis in GWAS for either quantitative or binary disease (case–control) traits. BiForce achieves great computational efficiency by using memory efficient data structures, Boolean bitwise operations and multithreaded parallelization. It performs a full pair-wise genome scan to detect interactions involving SNPs with or without significant marginal effects using appropriate Bonferroni-corrected significance thresholds. We show that BiForce is more powerful and significantly faster than published tools for both binary and quantitative traits in a series of performance tests on simulated and real datasets. We demonstrate BiForce in analysing eight metabolic traits in a GWAS cohort (323 697 SNPs, >4500 individuals) and two disease traits in another (>340 000 SNPs, >1750 cases and 1500 controls) on a 32-node computing cluster. BiForce completed analyses of the eight metabolic traits within 1 day, identified nine epistatic pairs of SNPs in five metabolic traits and 18 SNP pairs in two disease traits. BiForce can make the analysis of epistasis a routine exercise in GWAS and thus improve our understanding of the role of epistasis in the genetic regulation of complex traits. Availability and implementation: The software is free and can be downloaded from http://bioinfo.utu.fi/BiForce/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

Crossref

PubMed Central

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Bioinformatics challenges for genome-wide association studies

Author: Ahmed
Altshuler
Amundadottir
Askland
Bureau
Bush
Calle
Chang
Chanock
Cook
Culverhouse
Donnelly
Easton
Eiberg
Elbers
Emily
F. W. Asselbergs
Greene
Hahn
Hahn
Hirschhorn
Holmans
Infante
J. H. Moore
Jakobsdottir
Kooperberg
Kraft
Lewontin
Lou
Lunetta
Manolio
Manolio
Marchini
McKinney
McKinney
Mei
Millstein
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Motsinger
Namkung
Nelson
Pan
Pattin
Reich
Reif
Ripperger
Ritchie
Ritchie
Ritchie
S. M. Williams
Schork
Sinnott-Armstrong
Spencer
Thornton-Wells
Torkamani
Velez
Wang
Wilke
Williams
Wongseree
Yu
Yu
Zhang
Publication venue: Oxford University Press
Publication date: 15/02/2010
Field of study

Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

CiteSeerX

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

PubMed Central

UCL Discovery

Dissertations of the University of Groningen

Epitaxial Growth and Processing of Compound Semiconductors

Author: Ahadian Joseph F.
Dahleh Munther A.
Dougherty David J.
Fan Shanhui
Fonstad Clifton G., Jr.
Goorsky Mark S.
Hall Katherine L.
House Jody L.
Ippen Erich P.
Joannopoulos John D.
Kolodziejski Leslie A.
Koontz Elisabeth M.
Lim Kuo-Yi
Lim Michael H. Y.
Milikow Jeremy M.
Patterson Steven G.
Petrich Gale S.
Prasad Sheila
Qi Minghao
Reif L. Rafael
Smith Henry I.
Steinmeyer Günter
Tang Xiao-Feng
Tziligakis Constantine N.
Viadyananthan Praveen T.
Villeneuve Pierre R.
Warlick Emily L.
Warnick Sean
Publication venue: Research Laboratory of Electronics (RLE) at the Massachusetts Institute of Technology (MIT)
Publication date
Field of study

Contains an introduction and reports on six research projects.Defense Advanced Research Projects Agency/U.S. Navy - Office of Naval Research University Research Initiative Subcontract N00014-92-J-1893Joint Services Electronics Program Grant DAAH04-95-1-0038National Center for Integrated Photonics Technology Contract 542-381National Science Foundation Grant DMR 92-02957MIT Lincoln Laboratory Contract BX-6085National Center for Integrated Photonics Technology Subcontract 542-383U.S. Air Force - Office of Scientific Research Grant F49620-96-1-0126U.S. Navy - Office of Naval Research Grant N00014-91-J-1956National Science Foundation Grant DMR 94-0033

DSpace@MIT

Gas Source Molecular Beam Epitaxy of Compound Semiconductors

Author: Ahadian Joseph F.
Chen Jerry C.
Damask Jay N.
Donnelly Joseph P.
Dougherty David J.
Fan Shanhui
Fonstad Clifton G., Jr.
Hall Katherine L.
Haus Hermann A.
Ho Easen
House Jody L.
Ippen Erich P.
Joannopoulos John D.
Kolodziejski Leslie A.
Lim Kuo-Yi
Lopatnikova Anna
Marley Elisabeth A.
Milikow Jeremy M.
Patterson Steven G.
Petrich Gale S.
Reif L. Rafael
Shenoy Krishna V.
Smith Henry I.
Tang Xiao-feng
Villeneuve Pierre R.
Warlick Emily L.
Publication venue: Research Laboratory of Electronics (RLE) at the Massachusetts Institute of Technology (MIT)
Publication date
Field of study

Contains an introduction and reports on seven research projects.Defense Advanced Research Projects Agency Subcontract 284-25041Joint Services Electronics Program Contract DAAL04-95-1-0038National Center for Integrated Photonic Technology Contract 542-381U.S. Army Research Office/ AASERT Contract DAAH04-93-G-0175National Science Foundation Grant DMR 92-02957Joint Services Electronics Program Grant DAAL04-95-1-0038National Science Foundation Grant DMR 90-22933National Science Foundation Grant DMR 92-02957National Center for Integrated Photonic Technology Contract 542-381MIT Lincoln LaboratoryNational Center for Integrated Photonic Technology Subcontract 542-383National Science Foundation DMR 94-0033

DSpace@MIT