30 research outputs found
Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach
Elucidating the genetic basis of complex traits and diseases in non-European populations is particularly challenging because US minority populations have been under-represented in genetic association studies. We developed an empirical Bayes approach named XPEB (cross-population empirical Bayes), designed to improve the power for mapping complex-trait-associated loci in a minority population by exploiting information from genome-wide association studies (GWASs) from another ethnic population. Taking as input summary statistics from two GWASsâa target GWAS from an ethnic minority population of primary interest and an auxiliary base GWAS (such as a larger GWAS in Europeans)âour XPEB approach reprioritizes SNPs in the target population to compute local false-discovery rates. We demonstrated, through simulations, that whenever the base GWAS harbors relevant information, XPEB gains efficiency. Moreover, XPEB has the ability to discard irrelevant auxiliary information, providing a safeguard against inflated false-discovery rates due to genetic heterogeneity between populations. Applied to a blood-lipids study in African Americans, XPEB more than quadrupled the discoveries from the conventional approach, which used a target GWAS alone, bringing the number of significant loci from 14 to 65. Thus, XPEB offers a flexible framework for mapping complex traits in minority populations
Genome-wide Characterization of Shared and Distinct Genetic Components that Influence Blood Lipid Levels in Ethnically Diverse Human Populations
Blood lipid concentrations are heritable risk factors associated with atherosclerosis and cardiovascular diseases. Lipid traits exhibit considerable variation among populations of distinct ancestral origin as well as between individuals within a population. We performed association analyses to identify genetic loci influencing lipid concentrations in African American and Hispanic American women in the Womenâs Health Initiative SNP Health Association Resource. We validated one African-specific high-density lipoprotein cholesterol locus at CD36 as well as 14 known lipid loci that have been previously implicated in studies of European populations. Moreover, we demonstrate striking similarities in genetic architecture (loci influencing the trait, direction and magnitude of genetic effects, and proportions of phenotypic variation explained) of lipid traits across populations. In particular, we found that a disproportionate fraction of lipid variation in African Americans and Hispanic Americans can be attributed to genomic loci exhibiting statistical evidence of association in Europeans, even though the precise genes and variants remain unknown. At the same time, we found substantial allelic heterogeneity within shared loci, characterized both by population-specific rare variants and variants shared among multiple populations that occur at disparate frequencies. The allelic heterogeneity emphasizes the importance of including diverse populations in future genetic association studies of complex traits such as lipids; furthermore, the overlap in lipid loci across populations of diverse ancestral origin argues that additional knowledge can be gleaned from multiple populations
Single Cell Profiling of Circulating Tumor Cells: Transcriptional Heterogeneity and Diversity from Breast Cancer Cell Lines
BACKGROUND: To improve cancer therapy, it is critical to target metastasizing cells. Circulating tumor cells (CTCs) are rare cells found in the blood of patients with solid tumors and may play a key role in cancer dissemination. Uncovering CTC phenotypes offers a potential avenue to inform treatment. However, CTC transcriptional profiling is limited by leukocyte contamination; an approach to surmount this problem is single cell analysis. Here we demonstrate feasibility of performing high dimensional single CTC profiling, providing early insight into CTC heterogeneity and allowing comparisons to breast cancer cell lines widely used for drug discovery. METHODOLOGY/PRINCIPAL FINDINGS: We purified CTCs using the MagSweeper, an immunomagnetic enrichment device that isolates live tumor cells from unfractionated blood. CTCs that met stringent criteria for further analysis were obtained from 70% (14/20) of primary and 70% (21/30) of metastatic breast cancer patients; none were captured from patients with non-epithelial cancer (n = 20) or healthy subjects (n = 25). Microfluidic-based single cell transcriptional profiling of 87 cancer-associated and reference genes showed heterogeneity among individual CTCs, separating them into two major subgroups, based on 31 highly expressed genes. In contrast, single cells from seven breast cancer cell lines were tightly clustered together by sample ID and ER status. CTC profiles were distinct from those of cancer cell lines, questioning the suitability of such lines for drug discovery efforts for late stage cancer therapy. CONCLUSIONS/SIGNIFICANCE: For the first time, we directly measured high dimensional gene expression in individual CTCs without the common practice of pooling such cells. Elevated transcript levels of genes associated with metastasis NPTN, S100A4, S100A9, and with epithelial mesenchymal transition: VIM, TGFĂ1, ZEB2, FOXC1, CXCR4, were striking compared to cell lines. Our findings demonstrate that profiling CTCs on a cell-by-cell basis is possible and may facilitate the application of 'liquid biopsies' to better model drug discovery
Ancestral Components of Admixed Genomes in a Mexican Cohort
For most of the world, human genome structure at a population level is shaped by interplay between ancient geographic isolation and more recent demographic shifts, factors that are captured by the concepts of biogeographic ancestry and admixture, respectively. The ancestry of non-admixed individuals can often be traced to a specific population in a precise region, but current approaches for studying admixed individuals generally yield coarse information in which genome ancestry proportions are identified according to continent of origin. Here we introduce a new analytic strategy for this problem that allows fine-grained characterization of admixed individuals with respect to both geographic and genomic coordinates. Ancestry segments from different continents, identified with a probabilistic model, are used to construct and study âvirtual genomesâ of admixed individuals. We apply this approach to a cohort of 492 parentâoffspring trios from Mexico City. The relative contributions from the three continental-level ancestral populationsâAfrica, Europe, and Americaâvary substantially between individuals, and the distribution of haplotype block length suggests an admixing time of 10â15 generations. The European and Indigenous American virtual genomes of each Mexican individual can be traced to precise regions within each continent, and they reveal a gradient of Amerindian ancestry between indigenous people of southwestern Mexico and Mayans of the Yucatan Peninsula. This contrasts sharply with the African roots of African Americans, which have been characterized by a uniform mixing of multiple West African populations. We also use the virtual European and Indigenous American genomes to search for the signatures of selection in the ancestral populations, and we identify previously known targets of selection in other populations, as well as new candidate loci. The ability to infer precise ancestral components of admixed genomes will facilitate studies of disease-related phenotypes and will allow new insight into the adaptive and demographic history of indigenous people
Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?
Author-supplied citations are a fraction of the related literature for a paper. The ârelated citationsâ on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed ârelated citations.â We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper â many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper
The United States COVID-19 Forecast Hub dataset
Academic researchers, government agencies, industry groups, and individuals have produced forecasts at an unprecedented scale during the COVID-19 pandemic. To leverage these forecasts, the United States Centers for Disease Control and Prevention (CDC) partnered with an academic research lab at the University of Massachusetts Amherst to create the US COVID-19 Forecast Hub. Launched in April 2020, the Forecast Hub is a dataset with point and probabilistic forecasts of incident cases, incident hospitalizations, incident deaths, and cumulative deaths due to COVID-19 at county, state, and national, levels in the United States. Included forecasts represent a variety of modeling approaches, data sources, and assumptions regarding the spread of COVID-19. The goal of this dataset is to establish a standardized and comparable set of short-term forecasts from modeling teams. These data can be used to develop ensemble models, communicate forecasts to the public, create visualizations, compare models, and inform policies regarding COVID-19 mitigation. These open-source data are available via download from GitHub, through an online API, and through R packages
Learning causal networks using inducible transcription factors and transcriptomeâwide time series
Abstract We present IDEA (the Induction Dynamics gene Expression Atlas), a dataset constructed by independently inducing hundreds of transcription factors (TFs) and measuring timecourses of the resulting gene expression responses in budding yeast. Each experiment captures a regulatory cascade connecting a single induced regulator to the genes it causally regulates. We discuss the regulatory cascade of a single TF, Aft1, in detail; however, IDEA contains >Â 200 TF induction experiments with 20Â million individual observations and 100,000 signalâcontaining dynamic responses. As an application of IDEA, we integrate all timecourses into a wholeâcell transcriptional model, which is used to predict and validate multiple new and underappreciated transcriptional regulators. We also find that the magnitudes of coefficients in this model are predictive of genetic interaction profile similarities. In addition to being a resource for exploring regulatory connectivity between TFs and their target genes, our modeling approach shows that combining rapid perturbations of individual genes with genomeâscale timeâseries measurements is an effective strategy for elucidating gene regulatory networks