589 research outputs found

    HLA predictions from long sequence read alignments, streamed directly into HLAminer

    Full text link
    The rapidly changing landscape of sequencing technologies brings new opportunities to genomics research. Longer sequence reads and higher sequence throughput coupled with ever-improving base accuracy and decreasing per-base cost is now making long reads suitable for analyzing polymorphic regions of the human genome, such as those of the human leucocyte antigen (HLA) gene complex. Here I present a simple protocol for predicting HLA signatures from whole genome shotgun (WGS) long sequencing reads, by directly streaming sequence alignments into HLAminer. The method is as simple as running minimap2, it scales with the number of sequences to align, and can be used with any read aligner capable of sam format output without the need to store bulky alignment files to disk. I show how the predictions are robust even with older and less [base] accurate WGS nanopore datasets and relatively low (10X) sequence coverage and present a step-by-step protocol to predict HLA class I and II genes from the long sequencing reads of modern third-generation technologies.Comment: 4 pages, 3 table

    ntLink: a toolkit for de novo genome assembly scaffolding and mapping using long reads

    Full text link
    With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly benefited genome assembly work, including scaffolding, by providing long-range evidence that can aid in resolving the challenging repetitive regions of complex genomes. ntLink is a flexible and resource-efficient genome scaffolding tool that utilizes long-read sequencing data to improve upon draft genome assemblies built from any sequencing technologies, including the same long reads. Instead of using read alignments to identify candidate joins, ntLink utilizes minimizer-based mappings to infer how input sequences should be ordered and oriented into scaffolds. Recent improvements to ntLink have added important features such as overlap detection, gap-filling and in-code scaffolding iterations. Here, we present three basic protocols demonstrating how to use each of these new features to yield highly contiguous genome assemblies, while still maintaining ntLink's proven computational efficiency. Further, as we illustrate in the alternate protocols, the lightweight minimizer-based mappings that enable ntLink scaffolding can also be utilized for other downstream applications, such as misassembly detection. With its modularity and multiple modes of execution, ntLink has broad benefit to the genomics community, from genome scaffolding and beyond. ntLink is an open-source project and is freely available from https://github.com/bcgsc/ntLink.Comment: 23 pages, 2 figure

    Targeted Assembly of Short Sequence Reads

    Get PDF
    As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants, by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled strin-gently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming ge-nomic mutations, polymorphism, fusion and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly

    Activation of an Endogenous Retrovirus-Associated Long Non-Coding RNA in Human Adenocarcinoma

    Get PDF
    Background Long non-coding RNAs (lncRNAs) are emerging as molecules that significantly impact many cellular processes and have been associated with almost every human cancer. Compared to protein-coding genes, lncRNA genes are often associated with transposable elements, particularly with endogenous retroviral elements (ERVs). ERVs can have potentially deleterious effects on genome structure and function, so these elements are typically silenced in normal somatic tissues, albeit with varying efficiency. The aberrant regulation of ERVs associated with lncRNAs (ERV-lncRNAs), coupled with the diverse range of lncRNA functions, creates significant potential for ERV-lncRNAs to impact cancer biology. Methods We used RNA-seq analysis to identify and profile the expression of a novel lncRNA in six large cohorts, including over 7,500 samples from The Cancer Genome Atlas (TCGA). Results We identified the tumor-specific expression of a novel lncRNA that we have named Endogenous retroViral-associated ADenocarcinoma RNA or ‘EVADR’, by analyzing RNA-seq data derived from colorectal tumors and matched normal control tissues. Subsequent analysis of TCGA RNA-seq data revealed the striking association of EVADR with adenocarcinomas, which are tumors of glandular origin. Moderate to high levels of EVADR were detected in 25 to 53% of colon, rectal, lung, pancreas and stomach adenocarcinomas (mean = 30 to 144 FPKM), and EVADR expression correlated with decreased patient survival (Cox regression; hazard ratio = 1.47, 95% confidence interval = 1.06 to 2.04, P = 0.02). In tumor sites of non-glandular origin, EVADR expression was detectable at only very low levels and in less than 10% of patients. For EVADR, a MER48 ERV element provides an active promoter to drive its transcription. Genome-wide, MER48 insertions are associated with nine lncRNAs, but none of the MER48-associated lncRNAs other than EVADR were consistently expressed in adenocarcinomas, demonstrating the specific activation of EVADR. The sequence and structure of the EVADR locus is highly conserved among Old World monkeys and apes but not New World monkeys or prosimians, where the MER48 insertion is absent. Conservation of the EVADR locus suggests a functional role for this novel lncRNA in humans and our closest primate relatives. Conclusions Our results describe the specific activation of a highly conserved ERV-lncRNA in numerous cancers of glandular origin, a finding with diagnostic, prognostic and therapeutic implications

    The Sensitivity of Massively Parallel Sequencing for Detecting Candidate Infectious Agents Associated with Human Tissue

    Get PDF
    Massively parallel sequencing technology now provides the opportunity to sample the transcriptome of a given tissue comprehensively. Transcripts at only a few copies per cell are readily detectable, allowing the discovery of low abundance viral and bacterial transcripts in human tissue samples. Here we describe an approach for mining large sequence data sets for the presence of microbial sequences. Further, we demonstrate the sensitivity of this approach by sequencing human RNA-seq libraries spiked with decreasing amounts of an RNA-virus. At a modest depth of sequencing, viral transcripts can be detected at frequencies less than 1 in 1,000,000. With current sequencing platforms approaching outputs of one billion reads per run, this is a highly sensitive method for detecting putative infectious agents associated with human tissues

    Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach

    Get PDF
    BACKGROUND: High throughput sequencing-by-synthesis is an emerging technology that allows the rapid production of millions of bases of data. Although the sequence reads are short, they can readily be used for re-sequencing. By re-sequencing the mRNA products of a cell, one may rapidly discover polymorphisms and splice variants particular to that cell. RESULTS: We present the utility of massively parallel sequencing by synthesis for profiling the transcriptome of a human prostate cancer cell-line, LNCaP, that has been treated with the synthetic androgen, R1881. Through the generation of approximately 20 megabases (MB) of EST data, we detect transcription from over 10,000 gene loci, 25 previously undescribed alternative splicing events involving known exons, and over 1,500 high quality single nucleotide discrepancies with the reference human sequence. Further, we map nearly 10,000 ESTs to positions on the genome where no transcription is currently predicted to occur. We also characterize various obstacles with using sequencing by synthesis for transcriptome analysis and propose solutions to these problems. CONCLUSION: The use of high-throughput sequencing-by-synthesis methods for transcript profiling allows the specific and sensitive detection of many of a cell's transcripts, and also allows the discovery of high quality base discrepancies, and alternative splice variants. Thus, this technology may provide an effective means of understanding various disease states, discovering novel targets for disease treatment, and discovery of novel transcripts

    3D genomics across the tree of life reveals condensin II as a determinant of architecture type

    Get PDF
    We investigated genome folding across the eukaryotic tree of life. We find two types of three-dimensional(3D) genome architectures at the chromosome scale. Each type appears and disappears repeatedlyduring eukaryotic evolution. The type of genome architecture that an organism exhibits correlates with theabsence of condensin II subunits. Moreover, condensin II depletion converts the architecture of thehuman genome to a state resembling that seen in organisms such as fungi or mosquitoes. In this state,centromeres cluster together at nucleoli, and heterochromatin domains merge. We propose a physicalmodel in which lengthwise compaction of chromosomes by condensin II during mitosis determineschromosome-scale genome architecture, with effects that are retained during the subsequent interphase.This mechanism likely has been conserved since the last common ancestor of all eukaryotes.C.H. is supported by the Boehringer Ingelheim Fonds; C.H., Á.S.C., and B.D.R. are supported by an ERC CoG (772471, “CohesinLooping”); A.M.O.E. and B.D.R. are supported by the Dutch Research Council (NWO-Echo); and J.A.R. and R.H.M. are supported by the Dutch Cancer Society (KWF). T.v.S. and B.v.S. are supported by NIH Common Fund “4D Nucleome” Program grant U54DK107965. H.T. and E.d.W. are supported by an ERC StG (637597, “HAP-PHEN”). J.A.R., T.v.S., H.T., R.H.M., B.v.S., and E.d.W. are part of the Oncode Institute, which is partly financed by the Dutch Cancer Society. Work at the Center for Theoretical Biological Physics is sponsored by the NSF (grants PHY-2019745 and CHE-1614101) and by the Welch Foundation (grant C-1792). V.G.C. is funded by FAPESP (São Paulo State Research Foundation and Higher Education Personnel) grants 2016/13998-8 and 2017/09662-7. J.N.O. is a CPRIT Scholar in Cancer Research. E.L.A. was supported by an NSF Physics Frontiers Center Award (PHY-2019745), the Welch Foundation (Q-1866), a USDA Agriculture and Food Research Initiative grant (2017-05741), the Behavioral Plasticity Research Institute (NSF DBI-2021795), and an NIH Encyclopedia of DNA Elements Mapping Center Award (UM1HG009375). Hi-C data for the 24 species were created by the DNA Zoo Consortium (www.dnazoo.org). DNA Zoo is supported by Illumina, Inc.; IBM; and the Pawsey Supercomputing Center. P.K. is supported by the University of Western Australia. L.L.M. was supported by NIH (1R01NS114491) and NSF awards (1557923, 1548121, and 1645219) and the Human Frontiers Science Program (RGP0060/2017). The draft A. californica project was supported by NHGRI. J.L.G.-S. received funding from the ERC (grant agreement no. 740041), the Spanish Ministerio de Economía y Competitividad (grant no. BFU2016-74961-P), and the institutional grant Unidad de Excelencia María de Maeztu (MDM-2016-0687). R.D.K. is supported by NIH grant RO1DK121366. V.H. is supported by NIH grant NIH1P41HD071837. K.M. is supported by a MEXT grant (20H05936). M.C.W. is supported by the NIH grants R01AG045183, R01AT009050, R01AG062257, and DP1DK113644 and by the Welch Foundation. E.F. was supported by NHGR

    A Giant Planet Candidate Transiting a White Dwarf

    Full text link
    Astronomers have discovered thousands of planets outside the solar system, most of which orbit stars that will eventually evolve into red giants and then into white dwarfs. During the red giant phase, any close-orbiting planets will be engulfed by the star, but more distant planets can survive this phase and remain in orbit around the white dwarf. Some white dwarfs show evidence for rocky material floating in their atmospheres, in warm debris disks, or orbiting very closely, which has been interpreted as the debris of rocky planets that were scattered inward and tidally disrupted. Recently, the discovery of a gaseous debris disk with a composition similar to ice giant planets demonstrated that massive planets might also find their way into tight orbits around white dwarfs, but it is unclear whether the planets can survive the journey. So far, the detection of intact planets in close orbits around white dwarfs has remained elusive. Here, we report the discovery of a giant planet candidate transiting the white dwarf WD 1856+534 (TIC 267574918) every 1.4 days. The planet candidate is roughly the same size as Jupiter and is no more than 14 times as massive (with 95% confidence). Other cases of white dwarfs with close brown dwarf or stellar companions are explained as the consequence of common-envelope evolution, wherein the original orbit is enveloped during the red-giant phase and shrinks due to friction. In this case, though, the low mass and relatively long orbital period of the planet candidate make common-envelope evolution less likely. Instead, the WD 1856+534 system seems to demonstrate that giant planets can be scattered into tight orbits without being tidally disrupted, and motivates searches for smaller transiting planets around white dwarfs.Comment: 50 pages, 12 figures, 2 tables. Published in Nature on Sept. 17, 2020. The final authenticated version is available online at: https://www.nature.com/articles/s41586-020-2713-

    Rare and low-frequency coding variants alter human adult height

    Get PDF
    Height is a highly heritable, classic polygenic trait with ~700 common associated variants identified so far through genome - wide association studies . Here , we report 83 height - associated coding variants with lower minor allele frequenc ies ( range of 0.1 - 4.8% ) and effects of up to 2 16 cm /allele ( e.g. in IHH , STC2 , AR and CRISPLD2 ) , >10 times the average effect of common variants . In functional follow - up studies, rare height - increasing alleles of STC2 (+1 - 2 cm/allele) compromise d proteolytic inhibition of PAPP - A and increased cleavage of IGFBP - 4 in vitro , resulting in higher bioavailability of insulin - like growth factors . The se 83 height - associated variants overlap genes mutated in monogenic growth disorders and highlight new biological candidates ( e.g. ADAMTS3, IL11RA, NOX4 ) and pathways ( e.g . proteoglycan/ glycosaminoglycan synthesis ) involved in growth . Our results demonstrate that sufficiently large sample sizes can uncover rare and low - frequency variants of moderate to large effect associated with polygenic human phenotypes , and that these variants implicate relevant genes and pathways
    corecore