589 research outputs found
HLA predictions from long sequence read alignments, streamed directly into HLAminer
The rapidly changing landscape of sequencing technologies brings new
opportunities to genomics research. Longer sequence reads and higher sequence
throughput coupled with ever-improving base accuracy and decreasing per-base
cost is now making long reads suitable for analyzing polymorphic regions of the
human genome, such as those of the human leucocyte antigen (HLA) gene complex.
Here I present a simple protocol for predicting HLA signatures from whole
genome shotgun (WGS) long sequencing reads, by directly streaming sequence
alignments into HLAminer. The method is as simple as running minimap2, it
scales with the number of sequences to align, and can be used with any read
aligner capable of sam format output without the need to store bulky alignment
files to disk. I show how the predictions are robust even with older and less
[base] accurate WGS nanopore datasets and relatively low (10X) sequence
coverage and present a step-by-step protocol to predict HLA class I and II
genes from the long sequencing reads of modern third-generation technologies.Comment: 4 pages, 3 table
ntLink: a toolkit for de novo genome assembly scaffolding and mapping using long reads
With the increasing affordability and accessibility of genome sequencing
data, de novo genome assembly is an important first step to a wide variety of
downstream studies and analyses. Therefore, bioinformatics tools that enable
the generation of high-quality genome assemblies in a computationally efficient
manner are essential. Recent developments in long-read sequencing technologies
have greatly benefited genome assembly work, including scaffolding, by
providing long-range evidence that can aid in resolving the challenging
repetitive regions of complex genomes. ntLink is a flexible and
resource-efficient genome scaffolding tool that utilizes long-read sequencing
data to improve upon draft genome assemblies built from any sequencing
technologies, including the same long reads. Instead of using read alignments
to identify candidate joins, ntLink utilizes minimizer-based mappings to infer
how input sequences should be ordered and oriented into scaffolds. Recent
improvements to ntLink have added important features such as overlap detection,
gap-filling and in-code scaffolding iterations. Here, we present three basic
protocols demonstrating how to use each of these new features to yield highly
contiguous genome assemblies, while still maintaining ntLink's proven
computational efficiency. Further, as we illustrate in the alternate protocols,
the lightweight minimizer-based mappings that enable ntLink scaffolding can
also be utilized for other downstream applications, such as misassembly
detection. With its modularity and multiple modes of execution, ntLink has
broad benefit to the genomics community, from genome scaffolding and beyond.
ntLink is an open-source project and is freely available from
https://github.com/bcgsc/ntLink.Comment: 23 pages, 2 figure
Targeted Assembly of Short Sequence Reads
As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants, by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled strin-gently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming ge-nomic mutations, polymorphism, fusion and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly
Activation of an Endogenous Retrovirus-Associated Long Non-Coding RNA in Human Adenocarcinoma
Background
Long non-coding RNAs (lncRNAs) are emerging as molecules that significantly impact many cellular processes and have been associated with almost every human cancer. Compared to protein-coding genes, lncRNA genes are often associated with transposable elements, particularly with endogenous retroviral elements (ERVs). ERVs can have potentially deleterious effects on genome structure and function, so these elements are typically silenced in normal somatic tissues, albeit with varying efficiency. The aberrant regulation of ERVs associated with lncRNAs (ERV-lncRNAs), coupled with the diverse range of lncRNA functions, creates significant potential for ERV-lncRNAs to impact cancer biology.
Methods
We used RNA-seq analysis to identify and profile the expression of a novel lncRNA in six large cohorts, including over 7,500 samples from The Cancer Genome Atlas (TCGA).
Results
We identified the tumor-specific expression of a novel lncRNA that we have named Endogenous retroViral-associated ADenocarcinoma RNA or ‘EVADR’, by analyzing RNA-seq data derived from colorectal tumors and matched normal control tissues. Subsequent analysis of TCGA RNA-seq data revealed the striking association of EVADR with adenocarcinomas, which are tumors of glandular origin. Moderate to high levels of EVADR were detected in 25 to 53% of colon, rectal, lung, pancreas and stomach adenocarcinomas (mean = 30 to 144 FPKM), and EVADR expression correlated with decreased patient survival (Cox regression; hazard ratio = 1.47, 95% confidence interval = 1.06 to 2.04, P = 0.02). In tumor sites of non-glandular origin, EVADR expression was detectable at only very low levels and in less than 10% of patients. For EVADR, a MER48 ERV element provides an active promoter to drive its transcription. Genome-wide, MER48 insertions are associated with nine lncRNAs, but none of the MER48-associated lncRNAs other than EVADR were consistently expressed in adenocarcinomas, demonstrating the specific activation of EVADR. The sequence and structure of the EVADR locus is highly conserved among Old World monkeys and apes but not New World monkeys or prosimians, where the MER48 insertion is absent. Conservation of the EVADR locus suggests a functional role for this novel lncRNA in humans and our closest primate relatives.
Conclusions
Our results describe the specific activation of a highly conserved ERV-lncRNA in numerous cancers of glandular origin, a finding with diagnostic, prognostic and therapeutic implications
The Sensitivity of Massively Parallel Sequencing for Detecting Candidate Infectious Agents Associated with Human Tissue
Massively parallel sequencing technology now provides the opportunity to sample the transcriptome of a given tissue comprehensively. Transcripts at only a few copies per cell are readily detectable, allowing the discovery of low abundance viral and bacterial transcripts in human tissue samples. Here we describe an approach for mining large sequence data sets for the presence of microbial sequences. Further, we demonstrate the sensitivity of this approach by sequencing human RNA-seq libraries spiked with decreasing amounts of an RNA-virus. At a modest depth of sequencing, viral transcripts can be detected at frequencies less than 1 in 1,000,000. With current sequencing platforms approaching outputs of one billion reads per run, this is a highly sensitive method for detecting putative infectious agents associated with human tissues
Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach
BACKGROUND: High throughput sequencing-by-synthesis is an emerging technology that allows the rapid production of millions of bases of data. Although the sequence reads are short, they can readily be used for re-sequencing. By re-sequencing the mRNA products of a cell, one may rapidly discover polymorphisms and splice variants particular to that cell. RESULTS: We present the utility of massively parallel sequencing by synthesis for profiling the transcriptome of a human prostate cancer cell-line, LNCaP, that has been treated with the synthetic androgen, R1881. Through the generation of approximately 20 megabases (MB) of EST data, we detect transcription from over 10,000 gene loci, 25 previously undescribed alternative splicing events involving known exons, and over 1,500 high quality single nucleotide discrepancies with the reference human sequence. Further, we map nearly 10,000 ESTs to positions on the genome where no transcription is currently predicted to occur. We also characterize various obstacles with using sequencing by synthesis for transcriptome analysis and propose solutions to these problems. CONCLUSION: The use of high-throughput sequencing-by-synthesis methods for transcript profiling allows the specific and sensitive detection of many of a cell's transcripts, and also allows the discovery of high quality base discrepancies, and alternative splice variants. Thus, this technology may provide an effective means of understanding various disease states, discovering novel targets for disease treatment, and discovery of novel transcripts
3D genomics across the tree of life reveals condensin II as a determinant of architecture type
We investigated genome folding across the eukaryotic tree of life. We find two types of three-dimensional(3D) genome architectures at the chromosome scale. Each type appears and disappears repeatedlyduring eukaryotic evolution. The type of genome architecture that an organism exhibits correlates with theabsence of condensin II subunits. Moreover, condensin II depletion converts the architecture of thehuman genome to a state resembling that seen in organisms such as fungi or mosquitoes. In this state,centromeres cluster together at nucleoli, and heterochromatin domains merge. We propose a physicalmodel in which lengthwise compaction of chromosomes by condensin II during mitosis determineschromosome-scale genome architecture, with effects that are retained during the subsequent interphase.This mechanism likely has been conserved since the last common ancestor of all eukaryotes.C.H. is supported by the Boehringer Ingelheim Fonds; C.H., Á.S.C., and B.D.R. are supported by an ERC CoG (772471, “CohesinLooping”); A.M.O.E. and B.D.R. are supported by the Dutch Research Council (NWO-Echo); and J.A.R. and R.H.M. are supported by the Dutch Cancer Society (KWF). T.v.S. and B.v.S. are supported by NIH Common Fund “4D Nucleome” Program grant U54DK107965. H.T. and E.d.W. are supported by an ERC StG (637597, “HAP-PHEN”). J.A.R., T.v.S., H.T., R.H.M., B.v.S., and E.d.W. are part of the Oncode Institute, which is partly financed by the Dutch Cancer Society. Work at the Center for Theoretical Biological Physics is sponsored by the NSF (grants PHY-2019745 and CHE-1614101) and by the Welch Foundation (grant C-1792). V.G.C. is funded by FAPESP (São Paulo State Research Foundation and Higher Education Personnel) grants 2016/13998-8 and 2017/09662-7. J.N.O. is a CPRIT Scholar in Cancer Research. E.L.A. was supported by an NSF Physics Frontiers Center Award (PHY-2019745), the Welch Foundation (Q-1866), a USDA Agriculture and Food Research Initiative grant (2017-05741), the Behavioral Plasticity Research Institute (NSF DBI-2021795), and an NIH Encyclopedia of DNA Elements Mapping Center Award (UM1HG009375). Hi-C data for the 24 species were created by the DNA Zoo Consortium (www.dnazoo.org). DNA Zoo is supported by Illumina, Inc.; IBM; and the Pawsey Supercomputing Center. P.K. is supported by the University of Western Australia. L.L.M. was supported by NIH (1R01NS114491) and NSF awards (1557923, 1548121, and 1645219) and the Human Frontiers Science Program (RGP0060/2017). The draft A. californica project was supported by NHGRI. J.L.G.-S. received funding from the ERC (grant agreement no. 740041), the Spanish Ministerio de Economía y Competitividad (grant no. BFU2016-74961-P), and the institutional grant Unidad de Excelencia María de Maeztu (MDM-2016-0687). R.D.K. is supported by NIH grant RO1DK121366. V.H. is supported by NIH grant NIH1P41HD071837. K.M. is supported by a MEXT grant (20H05936). M.C.W. is supported by the NIH grants R01AG045183, R01AT009050, R01AG062257, and DP1DK113644 and by the Welch Foundation. E.F. was supported by NHGR
A Giant Planet Candidate Transiting a White Dwarf
Astronomers have discovered thousands of planets outside the solar system,
most of which orbit stars that will eventually evolve into red giants and then
into white dwarfs. During the red giant phase, any close-orbiting planets will
be engulfed by the star, but more distant planets can survive this phase and
remain in orbit around the white dwarf. Some white dwarfs show evidence for
rocky material floating in their atmospheres, in warm debris disks, or orbiting
very closely, which has been interpreted as the debris of rocky planets that
were scattered inward and tidally disrupted. Recently, the discovery of a
gaseous debris disk with a composition similar to ice giant planets
demonstrated that massive planets might also find their way into tight orbits
around white dwarfs, but it is unclear whether the planets can survive the
journey. So far, the detection of intact planets in close orbits around white
dwarfs has remained elusive. Here, we report the discovery of a giant planet
candidate transiting the white dwarf WD 1856+534 (TIC 267574918) every 1.4
days. The planet candidate is roughly the same size as Jupiter and is no more
than 14 times as massive (with 95% confidence). Other cases of white dwarfs
with close brown dwarf or stellar companions are explained as the consequence
of common-envelope evolution, wherein the original orbit is enveloped during
the red-giant phase and shrinks due to friction. In this case, though, the low
mass and relatively long orbital period of the planet candidate make
common-envelope evolution less likely. Instead, the WD 1856+534 system seems to
demonstrate that giant planets can be scattered into tight orbits without being
tidally disrupted, and motivates searches for smaller transiting planets around
white dwarfs.Comment: 50 pages, 12 figures, 2 tables. Published in Nature on Sept. 17,
2020. The final authenticated version is available online at:
https://www.nature.com/articles/s41586-020-2713-
Rare and low-frequency coding variants alter human adult height
Height is a highly heritable, classic polygenic trait with ~700 common associated variants identified so far through genome - wide association studies . Here , we report 83 height - associated coding variants with lower minor allele frequenc ies ( range of 0.1 - 4.8% ) and effects of up to 2 16 cm /allele ( e.g. in IHH , STC2 , AR and CRISPLD2 ) , >10 times the average effect of common variants . In functional follow - up studies, rare height - increasing alleles of STC2 (+1 - 2 cm/allele) compromise d proteolytic inhibition of PAPP - A and increased cleavage of IGFBP - 4 in vitro , resulting in higher bioavailability of insulin - like growth factors . The se 83 height - associated variants overlap genes mutated in monogenic growth disorders and highlight new biological candidates ( e.g. ADAMTS3, IL11RA, NOX4 ) and pathways ( e.g . proteoglycan/ glycosaminoglycan synthesis ) involved in growth . Our results demonstrate that sufficiently large sample sizes can uncover rare and low - frequency variants of moderate to large effect associated with polygenic human phenotypes , and that these variants implicate relevant genes and pathways
- …