Search CORE

2,169 research outputs found

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

Author: Bansal Vikas
Publication venue: eScholarship, University of California
Publication date: 01/03/2017
Field of study

BackgroundPCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.ResultsIn this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.ConclusionsThe method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates

PubMed Central

eScholarship - University of California

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Author: A Adey
A Auton
A Mortazavi
AM Mezlini
CS Chilamakuri
D Aird
DR Bentley
EN Smith
H Li
I Kozarewa
IF Bronner
JA Casbon
KD Hansen
MA DePristo
MN Bainbridge
N Whiteford
PA ’t Hoen
S Islam
SR Head
T Daley
T Kivioja
T Lappalainen
Vikas Bansal
W Zhou
Y Chen
Y Kukita
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy.

Author: Heger A.
Smith T.S.
Sudbery I.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2017
Field of study

Unique Molecular Identifiers (UMIs) are random oligonucleotide barcodes that are increasingly used in high-throughput sequencing experiments. Through a UMI, identical copies arising from distinct molecules can be distinguished from those arising through PCR-amplification of the same molecule. However, bioinformatic methods to leverage the information from UMIs have yet to be formalised. In particular, sequencing errors in the UMI sequence are often ignored, or else resolved in an ad-hoc manner. We show that errors in the UMI sequence are common and introduce network based methods to account for these errors when identifying PCR duplicates. Using these methods, we demonstrate improved quantification accuracy both under simulated conditions and real iCLIP and single cell RNA-Seq datasets. Reproducibility between iCLIP replicates and single cell RNA Seq clustering are both improved using our proposed network-based method, demonstrating the value of properly accounting for errors in UMIs. These methods are implemented in the open source UMI-tools software package

PubMed Central

Oxford University Research Archive

White Rose Research Online

Comparative analysis of RNA sequencing methods for degraded or low-input samples

Author: A Roberts
Aaron M Berlin
AL Beyer
Alec Wysoker
Andreas Gnirke
Andrey Sivachenko
Aviv Regev
B Langmead
B Li
BE Maden
C Trapnell
D Aird
D Ramsköld
David S DeLuca
Dawn Anne Thompson
Diego Borges-Rivera
DS DeLuca
F Tang
G Giannoukos
H Aviv
H Li
H Yi
JD Morlan
Joshua Z Levin
JZ Levin
L Yang
M Griffin
MA Tariq
Michele A Busby
Nathalie Pochet
R Huang
R Rosenkranz
Rahul Satija
S Islam
Timothy Fennell
TR Dreszer
X Pan
Xian Adiconis
YH Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2013
Field of study

available in PMC 2014 January 01RNA-seq is an effective method for studying the transcriptome, but it can be difficult to apply to scarce or degraded RNA from fixed clinical samples, rare cell populations or cadavers. Recent studies have proposed several methods for RNA-seq of low-quality and/or low-quantity samples, but the relative merits of these methods have not been systematically analyzed. Here we compare five such methods using metrics relevant to transcriptome annotation, transcript discovery and gene expression. Using a single human RNA sample, we constructed and sequenced ten libraries with these methods and compared them against two control libraries. We found that the RNase H method performed best for chemically fragmented, low-quality RNA, and we confirmed this through analysis of actual degraded samples. RNase H can even effectively replace oligo(dT)-based methods for standard RNA-seq. SMART and NuGEN had distinct strengths for measuring low-quantity RNA. Our analysis allows biologists to select the most suitable methods and provides a benchmark for future method development.National Institutes of Health (U.S.) (Pioneer Award DP1-OD003958-01)National Human Genome Research Institute (U.S.) (NHGRI) 1P01HG005062-01)National Human Genome Research Institute (U.S.) (NHGRI Center of Excellence in Genome Science Award 1P50HG006193-01)Howard Hughes Medical Institute (Investigator)Merkin Family Foundation for Stem Cell ResearchBroad Institute of MIT and Harvard (Klarman Cell Observatory)National Human Genome Research Institute (U.S.) (NHGRI grant HG03067)Fonds voor Wetenschappelijk Onderzoek--Vlaandere

DSpace@MIT

Crossref

PubMed Central

Recommended from our members

Comprehensive comparative analysis of RNA sequencing methods for degraded or low input samples

Author: Adiconis Xian
Berlin Aaron M.
Borges-Rivera Diego
Busby Michele A.
DeLuca David S.
Fennell Timothy
Gnirke Andreas
Levin Joshua Z.
Pochet Nathalie
Regev Aviv
Satija Rahul
Sivachenko Andrey
Thompson Dawn Anne
Wysoker Alec
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/03/2014
Field of study

RNA-Seq is an effective method to study the transcriptome, but can be difficult to apply to scarce or degraded RNA from fixed clinical samples, rare cell populations, or cadavers. Recent studies have proposed several methods for RNA-Seq of low quality and/or low quantity samples, but their relative merits have not been systematically analyzed. Here, we compare five such methods using metrics relevant to transcriptome annotation, transcript discovery, and gene expression. Using a single human RNA sample, we constructed and sequenced ten libraries with these methods and two control libraries. We find that the RNase H method performed best for low quality RNA, and confirmed this with actual degraded samples. RNase H can even effectively replace oligo (dT) based methods for standard RNA-Seq. SMART and NuGEN had distinct strengths for low quantity RNA. Our analysis allows biologists to select the most suitable methods and provides a benchmark for future method development

Harvard University - DASH

Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data

Author: A Barski
A Boyle
A Fejes
A Mortazavi
A Valouev
AC Roman
AFA Smit
B Langmead
B Li
B Pasaniuc
B Ren
B Rhead
Bo Li
C Feschotte
C Spyrou
C Wu
Colin Dewey
D Day
D Huang
D Johnson
D Nix
Dongjun Chung
E Gonzalez
E Portales-Casamar
Emery H. Bresnick
G Bourque
G Dennis
G Tuteja
G Yuan
GJ Faulkner
H Im
H Ji
I Albert
J Bailey
J Bailey
J Dohm
J Jurka
J Rozowsky
J Wang
JBZ Gu
K Blahnik
Kun Liang
L Rowen
M Hurles
M Nicolae
M Taub
P Kuan
P Polak
Pei Fen Kuan
PM Fenwick
PV Kharchenko
R Gentleman
R Jothi
Rajendran Sanalkumar
RK Auerbach
S Cawley
S Kurdistani
Sündüz Keleş
T Bailey
T Evans
T Fujiwara
T Marques-Bonet
T Mikkelsen
T Nicholas
T Wang
TL Bailey
Wyeth W. Wasserman
Y Cheng
Y Seo
Y Zhang
Z Qin
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Carolina Digital Repository

RNA-Seq optimization with eQTL gold standards.

Author: Arking Dan E
Ashar Foram N
Bader Joel S
Ellis Shannon E
Gupta Simone
West Andrew B
Publication venue: eScholarship, University of California
Publication date: 01/01/2013
Field of study

BackgroundRNA-Sequencing (RNA-Seq) experiments have been optimized for library preparation, mapping, and gene expression estimation. These methods, however, have revealed weaknesses in the next stages of analysis of differential expression, with results sensitive to systematic sample stratification or, in more extreme cases, to outliers. Further, a method to assess normalization and adjustment measures imposed on the data is lacking.ResultsTo address these issues, we utilize previously published eQTLs as a novel gold standard at the center of a framework that integrates DNA genotypes and RNA-Seq data to optimize analysis and aid in the understanding of genetic variation and gene expression. After detecting sample contamination and sequencing outliers in RNA-Seq data, a set of previously published brain eQTLs was used to determine if sample outlier removal was appropriate. Improved replication of known eQTLs supported removal of these samples in downstream analyses. eQTL replication was further employed to assess normalization methods, covariate inclusion, and gene annotation. This method was validated in an independent RNA-Seq blood data set from the GTEx project and a tissue-appropriate set of eQTLs. eQTL replication in both data sets highlights the necessity of accounting for unknown covariates in RNA-Seq data analysis.ConclusionAs each RNA-Seq experiment is unique with its own experiment-specific limitations, we offer an easily-implementable method that uses the replication of known eQTLs to guide each step in one's data analysis pipeline. In the two data sets presented herein, we highlight not only the necessity of careful outlier detection but also the need to account for unknown covariates in RNA-Seq experiments

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Determining the quality and complexity of next-generation sequencing data without a reference genome

Author: Irina Pulyakhina
Jeroen FJ Laros
Johan T den Dunnen
Ken Kraaijeveld
Lusine Khachatryan
Martijn Vermaat
Michiel van Galen
Peter AC ’t Hoen
Peter de Knijff
Seyed Yahya Anvar
Yavuz Ariyurek
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Genomics, epigenetics, population genetics and bioinformatic

VU Research Portal

Springer - Publisher Connector

PubMed Central

Leiden University Scholary Publications

Genome-scale analysis identifies paralog lethality as a vulnerability of chromosome 1p loss in cancer.

Author: Berger Ashton C
Bhatia Sangeeta N
Buss Colin G
Carr Steven A
Cherniack Andrew D
Choi Peter S
Hahn William C
Krill-Burger John M
Lage Kasper
Malolepsza Edyta
Meyerson Matthew
Nogueira Marina F
Pedamallu Chandra Sekhar
Schenone Monica
Shih Juliann
Strathdee Craig A
Tamayo Pablo
Tanenbaum Benjamin
Taylor Alison M
Tsherniak Aviad
Vazquez Francisca
Viswanathan Srinivas R
Wawer Mathias J
Publication venue: eScholarship, University of California
Publication date: 01/07/2018
Field of study

Functional redundancy shared by paralog genes may afford protection against genetic perturbations, but it can also result in genetic vulnerabilities due to mutual interdependency1-5. Here, we surveyed genome-scale short hairpin RNA and CRISPR screening data on hundreds of cancer cell lines and identified MAGOH and MAGOHB, core members of the splicing-dependent exon junction complex, as top-ranked paralog dependencies6-8. MAGOHB is the top gene dependency in cells with hemizygous MAGOH deletion, a pervasive genetic event that frequently occurs due to chromosome 1p loss. Inhibition of MAGOHB in a MAGOH-deleted context compromises viability by globally perturbing alternative splicing and RNA surveillance. Dependency on IPO13, an importin-β receptor that mediates nuclear import of the MAGOH/B-Y14 heterodimer9, is highly correlated with dependency on both MAGOH and MAGOHB. Both MAGOHB and IPO13 represent dependencies in murine xenografts with hemizygous MAGOH deletion. Our results identify MAGOH and MAGOHB as reciprocal paralog dependencies across cancer types and suggest a rationale for targeting the MAGOHB-IPO13 axis in cancers with chromosome 1p deletion

DSpace@MIT

Crossref

eScholarship - University of California