5 research outputs found
Large language models shape and are shaped by society: A survey of arXiv publication patterns
There has been a steep recent increase in the number of large language model
(LLM) papers, producing a dramatic shift in the scientific landscape which
remains largely undocumented through bibliometric analysis. Here, we analyze
388K papers posted on the CS and Stat arXivs, focusing on changes in
publication patterns in 2023 vs. 2018-2022. We analyze how the proportion of
LLM papers is increasing; the LLM-related topics receiving the most attention;
the authors writing LLM papers; how authors' research topics correlate with
their backgrounds; the factors distinguishing highly cited LLM papers; and the
patterns of international collaboration. We show that LLM research increasingly
focuses on societal impacts: there has been an 18x increase in the proportion
of LLM-related papers on the Computers and Society sub-arXiv, and authors newly
publishing on LLMs are more likely to focus on applications and societal
impacts than more experienced authors. LLM research is also shaped by social
dynamics: we document gender and academic/industry disparities in the topics
LLM authors focus on, and a US/China schism in the collaboration network.
Overall, our analysis documents the profound ways in which LLM research both
shapes and is shaped by society, attesting to the necessity of sociotechnical
lenses.Comment: Working pape
Coarse race data conceals disparities in clinical risk score performance
Healthcare data in the United States often records only a patient's coarse
race group: for example, both Indian and Chinese patients are typically coded
as ``Asian.'' It is unknown, however, whether this coarse coding conceals
meaningful disparities in the performance of clinical risk scores across
granular race groups. Here we show that it does. Using data from 418K emergency
department visits, we assess clinical risk score performance disparities across
granular race groups for three outcomes, five risk scores, and four performance
metrics. Across outcomes and metrics, we show that there are significant
granular disparities in performance within coarse race categories. In fact,
variation in performance metrics within coarse groups often exceeds the
variation between coarse groups. We explore why these disparities arise,
finding that outcome rates, feature distributions, and the relationships
between features and outcomes all vary significantly across granular race
categories. Our results suggest that healthcare providers, hospital systems,
and machine learning researchers should strive to collect, release, and use
granular race data in place of coarse race data, and that existing analyses may
significantly underestimate racial disparities in performance.Comment: The first two authors contributed equally. Under revie
In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data
Chemotherapy is a routine treatment approach for early-stage cancers, but the effectiveness of such treatments is often limited by drug resistance, toxicity, and tumor heterogeneity. Combination chemotherapy, in which two or more drugs are applied simultaneously, offers one promising approach to address these concerns, since two single-target drugs may synergize with one another through interconnected biological processes. However, the identification of effective dual therapies has been particularly challenging; because the search space is large, combination success rates are low. Here, we present our method for DREAM AstraZeneca-Sanger Drug Combination Prediction Challenge to predict synergistic drug combinations. Our approach involves using biologically relevant drug and cell line features with machine learning. Our machine learning model obtained the primary metric = 0.36 and the tie-breaker metric = 0.37 in the extension round of the challenge which was ranked in top 15 out of 76 submissions. Our approach also achieves a mean primary metric of 0.39 with ten repetitions of 10-fold cross-validation. Further, we analyzed our model's predictions to better understand the molecular processes underlying synergy and discovered that key regulators of tumorigenesis such as TNFA and BRAF are often targets in synergistic interactions, while MYC is often duplicated. Through further analysis of our predictions, we were also ble to gain insight into mechanisms and potential biomarkers of synergistic drug pairs.</p
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced