9 research outputs found

    ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

    Full text link
    We present ensemblQueryR, a package providing an R interface to the Ensembl REST API that facilitates flexible, fast, user-friendly and R workflow integrable querying of Ensembl REST API linkage disequilibrium (LD) endpoints, optimised for high-throughput querying. ensemblQueryR achieves this through functions that are intuitive and amenable to custom code integration, use of familiar R object types as inputs and outputs, code optimisation and optional parallelisation functionality. For each LD endpoint, ensemblQueryR provides two functions, permitting both single-query and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate that ensemblQueryR has improved performance in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase over analogous software whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through availability of Docker and singularity images, making this tool widely accessible to the scientific community

    ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

    Get PDF
    We present ensemblQueryR, an R package for querying Ensembl linkage disequilibrium (LD) endpoints. This package is flexible, fast and user-friendly, and optimised for high-throughput querying. ensemblQueryR uses functions that are intuitive and amenable to custom code integration, familiar R object types as inputs and outputs as well as providing parallelisation functionality. For each Ensembl LD endpoint, ensemblQueryR provides two functions, permitting both single- and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate improved computational performance of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community

    aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3

    Get PDF
    Amazon Simple Storage Service (Amazon S3) is a widely used platform for storing large biomedical datasets. Unintended data alterations can occur during data writing and transmission, altering the original content and generating unexpected results. However, no open-source and easy-to-use tool exists to verify end-to-end data integrity. Here, we present aws-s3-integrity-check, a user-friendly, lightweight, and reliable bash tool to verify the integrity of a dataset stored in an Amazon S3 bucket. Using this tool, we only needed ∼114 min to verify the integrity of 1,045 records ranging between 5 bytes and 10 gigabytes and occupying ∼935 gigabytes of the Amazon S3 cloud. Our aws-s3-integrity-check tool also provides file-by-file on-screen and log-file-based information about the status of each integrity check. To our knowledge, this tool is the only open-source one that allows verifying the integrity of a dataset uploaded to the Amazon S3 Storage quickly, reliably, and efficiently. The tool is freely available for download and use at https://github.com/SoniaRuiz/aws-s3-integrity-check and https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check

    IntroVerse: a comprehensive database of introns across human tissues

    Get PDF
    Dysregulation of RNA splicing contributes to both rare and complex diseases. RNA-sequencing data from human tissues has shown that this process can be inaccurate, resulting in the presence of novel introns detected at low frequency across samples and within an individual. To enable the full spectrum of intron use to be explored, we have developed IntroVerse, which offers an extensive catalogue on the splicing of 332,571 annotated introns and a linked set of 4,679,474 novel junctions covering 32,669 different genes. This dataset has been generated through the analysis of 17,510 human control RNA samples from 54 tissues provided by the Genotype-Tissue Expression Consortium. IntroVerse has two unique features: (i) it provides a complete catalogue of novel junctions and (ii) each novel junction has been assigned to a specific annotated intron. This unique, hierarchical structure offers multiple uses, including the identification of novel transcripts from known genes and their tissue-specific usage, and the assessment of background splicing noise for introns thought to be mis-spliced in disease states. IntroVerse provides a user-friendly web interface and is freely available at https://rytenlab.com/browser/app/introverse

    Human-lineage-specific genomic elements are associated with neurodegenerative disease and APOE transcript usage

    Get PDF
    Knowledge of genomic features specific to the human lineage may provide insights into brain-related diseases. We leverage high-depth whole genome sequencing data to generate a combined annotation identifying regions simultaneously depleted for genetic variation (constrained regions) and poorly conserved across primates. We propose that these constrained, non-conserved regions (CNCRs) have been subject to human-specific purifying selection and are enriched for brain-specific elements. We find that CNCRs are depleted from protein-coding genes but enriched within lncRNAs. We demonstrate that per-SNP heritability of a range of brain-relevant phenotypes are enriched within CNCRs. We find that genes implicated in neurological diseases have high CNCR density, including APOE, highlighting an unannotated intron-3 retention event. Using human brain RNA-sequencing data, we show the intron-3-retaining transcript to be more abundant in Alzheimer?s disease with more severe tau and amyloid pathological burden. Thus, we demonstrate potential association of human-lineage-specific sequences in brain development and neurological disease.FUNDING: Acknowledgements The authors are grateful to the participants in the Religious Order Study, the Memory and Aging Project. Z.C. and R.H.R. were supported by grants from the Leonard Wolfson Foundation. M.R. was supported by the United Kingdom Medical Research Council (MRC) through the award of a Tenure Track Clinician Scientist Fellowship (MR/ N008324/1). J.H. was supported by the UK Dementia Research Institute which receives its funding from DRI Limited, funded by the UK Medical Research Council, Alzheimer’s Society and Alzheimer’s Research UK. J.H. has also been funded by the Medical Research Council (award MR/N026004/1), Wellcome Trust (award 202903/Z/16/Z), Dolby Family Fund and National Institute for Health Research University College London Hospitals Biomedical Research Centre. J.B. is supported through the Science and Technology Agency, Séneca Foundation, CARM, Spain (research project 00007/COVI/20)

    Functional genomics provide key insights to improve the diagnostic yield of hereditary ataxia

    Get PDF
    Improvements in functional genomic annotation have led to a critical mass of neurogenetic discoveries. This is exemplified in hereditary ataxia, a heterogeneous group of disorders characterised by incoordination from cerebellar dysfunction. Associated pathogenic variants in more than 300 genes have been described, leading to a detailed genetic classification partitioned by age-of-onset. Despite these advances, up to 75% of patients with ataxia remain molecularly undiagnosed even following whole genome sequencing, as exemplified in the 100,000 Genomes Project. This study aimed to understand whether we can improve our knowledge of the genetic architecture of hereditary ataxia by leveraging functional genomic annotations, and as a result, generate insights and strategies that raise the diagnostic yield. To achieve these aims, we used publicly-available multi-omics data to generate 294 genic features, capturing information relating to a gene's structure, genetic variation, tissue-specific, cell-type-specific and temporal expression, as well as protein products of a gene. We studied these features across genes typically causing childhood-onset, adult-onset or both types of disease first individually, then collectively. This led to the generation of testable hypotheses which we investigated using whole genome sequencing data from up to 2,182 individuals presenting with ataxia and 6,658 non-neurological probands recruited in the 100,000 Genomes Project. Using this approach, we demonstrated a high short tandem repeat (STR) density within childhood-onset genes suggesting that we may be missing pathogenic repeat expansions within this cohort. This was verified in both childhood- and adult-onset ataxia patients from the 100,000 Genomes Project who were unexpectedly found to have a trend for higher repeat sizes even at naturally-occurring STRs within known ataxia genes, implying a role for STRs in pathogenesis. Using unsupervised analysis, we found significant similarities in genomic annotation across the gene panels, which suggested adult- and childhood-onset patients should be screened using a common diagnostic gene set. We tested this within the 100,000 Genomes Project by assessing the burden of pathogenic variants among childhood-onset genes in adult-onset patients and vice versa. This demonstrated a significantly higher burden of rare, potentially pathogenic variants in conventional childhood-onset genes among individuals with adult-onset ataxia. Our analysis has implications for the current clinical practice in genetic testing for hereditary ataxia. We suggest that the diagnostic rate for hereditary ataxia could be increased by removing the age-of-onset partition, and through a modified screening for repeat expansions in naturally-occurring STRs within known ataxia-associated genes, in effect treating these regions as candidate pathogenic loci

    ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

    No full text
    We present ensemblQueryR, an R package for querying Ensembl linkage disequilibrium (LD) endpoints. This package is flexible, fast and user-friendly, and optimised for high-throughput querying. ensemblQueryR uses functions that are intuitive and amenable to custom code integration, familiar R object types as inputs and outputs as well as providing parallelisation functionality. For each Ensembl LD endpoint, ensemblQueryR provides two functions, permitting both single- and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate improved computational performance of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community

    aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3

    No full text
    Amazon Simple Storage Service (Amazon S3) is a widely used platform for storing large biomedical datasets. Unintended data alterations can occur during data writing and transmission, altering the original content and generating unexpected results. However, no open-source and easy-to-use tool exists to verify end-to-end data integrity. Here, we present aws-s3-integrity-check, a user-friendly, lightweight, and reliable bash tool to verify the integrity of a dataset stored in an Amazon S3 bucket. Using this tool, we only needed ∼114 min to verify the integrity of 1,045 records ranging between 5 bytes and 10 gigabytes and occupying ∼935 gigabytes of the Amazon S3 cloud. Our aws-s3-integrity-check tool also provides file-by-file on-screen and log-file-based information about the status of each integrity check. To our knowledge, this tool is the only open-source one that allows verifying the integrity of a dataset uploaded to the Amazon S3 Storage quickly, reliably, and efficiently. The tool is freely available for download and use at https://github.com/SoniaRuiz/aws-s3-integrity-check and https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check
    corecore