5 research outputs found

    Large language models shape and are shaped by society: A survey of arXiv publication patterns

    Full text link
    There has been a steep recent increase in the number of large language model (LLM) papers, producing a dramatic shift in the scientific landscape which remains largely undocumented through bibliometric analysis. Here, we analyze 388K papers posted on the CS and Stat arXivs, focusing on changes in publication patterns in 2023 vs. 2018-2022. We analyze how the proportion of LLM papers is increasing; the LLM-related topics receiving the most attention; the authors writing LLM papers; how authors' research topics correlate with their backgrounds; the factors distinguishing highly cited LLM papers; and the patterns of international collaboration. We show that LLM research increasingly focuses on societal impacts: there has been an 18x increase in the proportion of LLM-related papers on the Computers and Society sub-arXiv, and authors newly publishing on LLMs are more likely to focus on applications and societal impacts than more experienced authors. LLM research is also shaped by social dynamics: we document gender and academic/industry disparities in the topics LLM authors focus on, and a US/China schism in the collaboration network. Overall, our analysis documents the profound ways in which LLM research both shapes and is shaped by society, attesting to the necessity of sociotechnical lenses.Comment: Working pape

    Coarse race data conceals disparities in clinical risk score performance

    Full text link
    Healthcare data in the United States often records only a patient's coarse race group: for example, both Indian and Chinese patients are typically coded as ``Asian.'' It is unknown, however, whether this coarse coding conceals meaningful disparities in the performance of clinical risk scores across granular race groups. Here we show that it does. Using data from 418K emergency department visits, we assess clinical risk score performance disparities across granular race groups for three outcomes, five risk scores, and four performance metrics. Across outcomes and metrics, we show that there are significant granular disparities in performance within coarse race categories. In fact, variation in performance metrics within coarse groups often exceeds the variation between coarse groups. We explore why these disparities arise, finding that outcome rates, feature distributions, and the relationships between features and outcomes all vary significantly across granular race categories. Our results suggest that healthcare providers, hospital systems, and machine learning researchers should strive to collect, release, and use granular race data in place of coarse race data, and that existing analyses may significantly underestimate racial disparities in performance.Comment: The first two authors contributed equally. Under revie

    In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data

    No full text
    Chemotherapy is a routine treatment approach for early-stage cancers, but the effectiveness of such treatments is often limited by drug resistance, toxicity, and tumor heterogeneity. Combination chemotherapy, in which two or more drugs are applied simultaneously, offers one promising approach to address these concerns, since two single-target drugs may synergize with one another through interconnected biological processes. However, the identification of effective dual therapies has been particularly challenging; because the search space is large, combination success rates are low. Here, we present our method for DREAM AstraZeneca-Sanger Drug Combination Prediction Challenge to predict synergistic drug combinations. Our approach involves using biologically relevant drug and cell line features with machine learning. Our machine learning model obtained the primary metric = 0.36 and the tie-breaker metric = 0.37 in the extension round of the challenge which was ranked in top 15 out of 76 submissions. Our approach also achieves a mean primary metric of 0.39 with ten repetitions of 10-fold cross-validation. Further, we analyzed our model's predictions to better understand the molecular processes underlying synergy and discovered that key regulators of tumorigenesis such as TNFA and BRAF are often targets in synergistic interactions, while MYC is often duplicated. Through further analysis of our predictions, we were also ble to gain insight into mechanisms and potential biomarkers of synergistic drug pairs.</p

    Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.

    No full text
    The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced
    corecore