24 research outputs found

    Functionally informed fine-mapping and polygenic localization of complex trait heritability

    Get PDF
    Fine-mapping aims to identify causal variants impacting complex traits. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy by leveraging functional annotations across the entire genome-not just genome-wide-significant loci-to specify prior probabilities for fine-mapping methods such as SuSiE or FINEMAP. In simulations, PolyFun + SuSiE and PolyFun + FINEMAP were well calibrated and identified >20% more variants with a posterior causal probability >0.95 than identified in their nonfunctionally informed counterparts. In analyses of 49 UK Biobank traits (average n = 318,000), PolyFun + SuSiE identified 3,025 fine-mapped variant-trait pairs with posterior causal probability >0.95, a >32% improvement versus SuSiE. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 28 (hair color) to 3,400 (height) to 2 million (number of children). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures. PolyFun is a computationally scalable framework for functionally informed fine-mapping that makes full use of genome-wide data. It prioritizes more variants than previous methods when applied to 49 complex traits from UK Biobank.Peer reviewe

    Data from: Evolutionary history inferred from the de novo assembly of a non-model organism, the blue-eyed black lemur

    No full text
    Lemurs, the living primates most distantly related to humans, demonstrate incredible diversity in behaviour, life history patterns and adaptive traits. Although many lemur species are endangered within their native Madagascar, there is no high-quality genome assembly from this taxon, limiting population and conservation genetic studies. One critically endangered lemur is the blue-eyed black lemur Eulemur flavifrons. This species is fixed for blue irises, a convergent trait that evolved at least four times in primates and was subject to positive selection in humans, where 5′ regulatory variation of OCA2 explains most of the brown/blue eye colour differences. We built a de novo genome assembly for E. flavifrons, providing the most complete lemur genome to date, and a high confidence consensus sequence for close sister species E. macaco, the (brown-eyed) black lemur. From diversity and divergence patterns across the genomes, we estimated a recent split time of the two species (160 Kya) and temporal fluctuations in effective population sizes that accord with known environmental changes. By looking for regions of unusually low diversity, we identified potential signals of directional selection in E. flavifrons at MITF, a melanocyte development gene that regulates OCA2 and has previously been associated with variation in human iris colour, as well as at several other genes involved in melanin biosynthesis in mammals. Our study thus illustrates how whole-genome sequencing of a few individuals can illuminate the demographic and selection history of nonmodel species

    Eulemur flavifrons fastq file for PSMC

    No full text
    This fastq file was generated from the all-sites vcf file for Harlow (the E. flavifrons individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 20 or above, PE reads aligned to the same scaffold with correctly oriented read pairs mapping within three standard deviations of the mean, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using the vcf2fq function within the vcfutils.pl script, using a minimum depth of 26 and a maximum depth of 104 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20

    Eulemur macaco fastq files for PSMC

    No full text
    These two files should be combined into one fastq representing the whole genome. The combined fastq file was generated from the all-sites vcf file for Harmonia (the E. macaco individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 10 or above, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using a modification of the vcf2fq function within the vcfutils.pl script (vcf2fqnonref, in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl), using a minimum depth of 10 and a maximum depth of 41 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20

    Eulemur macaco fastq files for PSMC

    No full text
    These two files should be combined into one fastq representing the whole genome. The combined fastq file was generated from the all-sites vcf file for Harmonia (the E. macaco individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 10 or above, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using a modification of the vcf2fq function within the vcfutils.pl script (vcf2fqnonref, in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl), using a minimum depth of 10 and a maximum depth of 41 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20

    Dataset S1: Annotated transcripts in the 1% FST tail

    No full text
    This list contains all transcripts mapped to regions in the 1% tail of FST from the full dataset in either species. Columns represent Ensembl transcript ID, region of high FST where the transcript was annotated, start position of the transcript within the region, end position of the transcript within the region, proportion of the transcript mapped to the region, mean percent identity for parts of the transcript that mapped, Ensembl gene ID, and gene symbol. We annotated genes in the 1% tail of all 20 kb non-overlapping windows by aligning human protein sequences to the blue-eyed black lemur genome. We obtained protein sequences for human genome build hg18 and used TBLASTN version 2.2.22+ (Altschul et al. 1990, 1997), with an e-value threshold of 5 x 10-5 to identify orthologs within the regions of the blue-eyed black lemur reference genome corresponding to the 1% FST tail. We then took the list of all human proteins with hits within candidate regions and performed TBLASTN for these proteins against the entire lemur genome. We retained proteins whose best genome-wide match (containing the lowest e-value or maximum mean percent identity) for any subset of the protein sequence overlapped the candidate region. In cases in which multiple proteins mapped to the same location (>50% protein length overlapping, presumably representing multiple transcripts of the same gene or multiple genes in the same family), we retained the protein with the largest total length spanned by initial TBLASTN hits or the largest mean percent identity

    Variable sites VCF files

    No full text
    These files contain sites with a high posterior probability (> 0.8) of being polymorphic in the 8-sample dataset (4 samples from each species), along with their genotype likelihoods in these samples as estimated by ANGSD (http://popgen.dk/wiki/index.php/ANGSD). Genotype likelihoods were estimated using the GATK model (-GL 2). The probability of a site being variable was estimated using ngsStat (https://github.com/mfumagalli/ngsTools), based on posterior probabilities calculated using the genotype likelihoods and the site frequency spectrum estimated from high quality sites (minimum mapping quality 1, minimum quality 20, minimum 3 samples with data, minimum depth 9) within each species

    EfEmSangerSequencingForDryad

    No full text
    Sanger sequencing results for up to 16 individuals (9 blue-eyed black lemurs and 7 black lemurs from the Duke Lemur Center) sequenced at 17 amplicons throughout scaffold2503 (the scaffold containing the orthologs of HERC2/OCAs) and aligned using EBio
    corecore