Search CORE

eScholarship - University of California

Differential expression analysis for sequence count data

Author: A Agresti
A Mortazavi
AC Cameron
AM Smith
AS Morrissy
B Langmead
C Loader
CI Bliss
DD Licatalosi
G Robertson
GK Smyth
GK Smyth
I Lönnstedt
J Bullard
JC Marioni
JF Lawless
JS Bloom
K Saha
L Wang
L Whitaker
M Kasowski
MD Robinson
MD Robinson
MD Robinson
MD Robinson
P Engström
P McCullagh
RC Gentleman
Simon Anders
SJ Clark
U Nagalakshmi
Wolfgang Huber
Y Benjamini
Publication venue
Publication date: 01/01/2010
Field of study

*Motivation:* High-throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq) or cell counting (barcode sequencing). Statistical inference of differential signal in such data requires estimation of their variability throughout the dynamic range. When the number of replicates is small, error modelling is needed to achieve statistical power.

*Results:* We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. 

*Availability:* A free open-source R software package, _DESeq_, is available from the Bioconductor project and from "http://www-huber.embl.de/users/anders/DESeq":http://www-huber.embl.de/users/anders/DESeq

Springer - Publisher Connector

Springer

Institute of Mathematics AS CR, v. v. i.

Nature Precedings

The examination of baseline noise and the impact on the interpretation of low-template DNA samples

Author: Wellner Genevieve A.
Publication venue
Publication date: 22/01/2016
Field of study

It is common practice for DNA STR profiles to be analyzed using an analytical threshold (AT), but as more low template DNA (LT-DNA) samples are tested it has become evident that these thresholds do not adequately separate signal from noise. In order to confidently examine LT-DNA samples, the behavior and characteristics of the background noise of STR profiles must be better understood. Thus, the background noise of single source LT-DNA STR profiles were examined to characterize the noise distribution and determine how it changes with DNA template mass and injection time. Current noise models typically assume the noise is independent of fragment size but, given the tendency of the baseline noise to increase with template amount, it is important to establish whether the baseline noise is randomly found throughout the capillary electrophoresis (CE) run or whether it is situated in specific regions of the electropherogram. While it has been shown that the baseline noise of negative samples does not behave similarly to the baseline noise of profiles generated using optimal levels of DNA, the ATs determined using negative samples have shown to be similar to those developed with near-zero, low template mass samples. The distinction between low-template samples, where the noise is consistent regardless of target mass, and standard samples could be made at approximately 0.063 ng for samples amplified using the Identifiler^TM Plus amplification kit (29 cycle protocol), and injected for 5 and 10 seconds. At amplification target masses greater than 0.063 ng, the average noise peak height increased and began to plateau between 0.5 and 1.0 ng for samples injected for 5 and 10 seconds. To examine the time dependent nature of the baseline noise, the baselines of over 400 profiles were combined onto one axis for each target mass and each injection time. Areas of reproducibly higher noise peak heights were identified as areas of potential non-specific amplified product. When the samples were injected for five seconds, the baseline noise did not appear to be time dependent. However, when the samples were injected for either 10 or 20 seconds, there were three areas that exhibited an increase in noise; these areas were identified at 118 bases in green, 231 bases in yellow, and 106 bases in red. If a probabilistic analysis or AT is to be employed for DNA interpretation, consideration must be given as to how the validation or calibration samples are prepared. Ideally the validation data should include all the variation seen within typical samples. To this end, a study was performed to examine possible sources of variation in the baseline noise within the electropherogram. Specifically, three samples were prepared at seven target masses using four different kit lots, four capillary lots, in four amplification batches or four injection batches. The distribution of the noise peak heights in the blue and green channels for samples with variable capillary lots, amplifications, and injections were similar, but the distribution of the noise heights for samples with variable kit lots was shifted. This shift in the distribution of the samples with variable kit lots was due to the average peak height of the individual kit lots varying by approximately two. The yellow and red channels showed a general agreement between the distributions of the samples run with variable kit lots, amplifications, and injections, but the samples run with various capillary lots had a distribution shifted to the left. When the distribution of the noise height for each capillary was examined, the average peak height variation was less than two RFU between capillary lots. Use of a probabilistic method requires an accurate description of the distribution of the baseline noise. Three distributions were tested: Gaussian, log-normal, and Poisson. The Poisson distribution did not approximate the noise distributions well. The log-normal distribution was a better approximation than the Gaussian resulting in a smaller sum of the residuals squared. It was also shown that the distributions impacted the probability that a peak was noise; though how significant of an impact this difference makes on the final probability of an entire STR profile was not determined and may be of interest for future studies

Boston University Institutional Repository (OpenBU)

2b-RAD genotyping for population genomic studies of Chagas disease vectors: Rhodnius ecuadoriensis in Ecuador

Author: Andersson Björn
Costales Jaime A.
De Noia Michele
Grijalva Mario J.
Hernandez Castro Luis Enrique
Hernandez-Castro Luis E.
Llewellyn Martin S.
Ocaña-Mayorga Sofía
Paterno Marta
Villacís Anita G.
Yumiseva Cesar A.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/07/2017
Field of study

Background: Rhodnius ecuadoriensis is the main triatomine vector of Chagas disease, American trypanosomiasis, in Southern Ecuador and Northern Peru. Genomic approaches and next generation sequencing technologies have become powerful tools for investigating population diversity and structure which is a key consideration for vector control. Here we assess the effectiveness of three different 2b restriction site-associated DNA (2b-RAD) genotyping strategies in R. ecuadoriensis to provide sufficient genomic resolution to tease apart microevolutionary processes and undertake some pilot population genomic analyses. Methodology/Principal findings: The 2b-RAD protocol was carried out in-house at a non-specialized laboratory using 20 R. ecuadoriensis adults collected from the central coast and southern Andean region of Ecuador, from June 2006 to July 2013. 2b-RAD sequencing data was performed on an Illumina MiSeq instrument and analyzed with the STACKS de novo pipeline for loci assembly and Single Nucleotide Polymorphism (SNP) discovery. Preliminary population genomic analyses (global AMOVA and Bayesian clustering) were implemented. Our results showed that the 2b-RAD genotyping protocol is effective for R. ecuadoriensis and likely for other triatomine species. However, only BcgI and CspCI restriction enzymes provided a number of markers suitable for population genomic analysis at the read depth we generated. Our preliminary genomic analyses detected a signal of genetic structuring across the study area. Conclusions/Significance: Our findings suggest that 2b-RAD genotyping is both a cost effective and methodologically simple approach for generating high resolution genomic data for Chagas disease vectors with the power to distinguish between different vector populations at epidemiologically relevant scales. As such, 2b-RAD represents a powerful tool in the hands of medical entomologists with limited access to specialized molecular biological equipment. Author summary: Understanding Chagas disease vector (triatomine) population dispersal is key for the design of control measures tailored for the epidemiological situation of a particular region. In Ecuador, Rhodnius ecuadoriensis is a cause of concern for Chagas disease transmission, since it is widely distributed from the central coast to southern Ecuador. Here, a genome-wide sequencing (2b-RAD) approach was performed in 20 specimens from four communities from Manabí (central coast) and Loja (southern) provinces of Ecuador, and the effectiveness of three type IIB restriction enzymes was assessed. The findings of this study show that this genotyping methodology is cost effective in R. ecuadoriensis and likely in other triatomine species. In addition, preliminary population genomic analysis results detected a signal of population structure among geographically distinct communities and genetic variability within communities. As such, 2b-RAD shows significant promise as a relatively low-tech solution for determination of vector population genomics, dynamics, and spread

ZENODO

Electronic Archiving System

Enlighten

Machine learning-guided directed evolution for protein engineering

Author: Arnold Frances H.
Wu Zachary
Yang Kevin K.
Publication venue
Publication date: 19/04/2019
Field of study

Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

arXiv.org e-Print Archive

Caltech Authors

The Effect of Transposable Element Insertions on Gene Expression Evolution in Rodents

Author: A Nekrutenko
A Smit
Adam Eyre-Walker
AI Su
AO Urrutia
B McClintock
B-Y Liao
C Feschotte
CB Lowe
David Enard
G Bejerano
I. King Jordan
IK Jordan
J Brosius
JC Silva
JR Walker
JS Han
JS Han
L Marino-Ramirez
LA Pennacchio
LN van de Lagemaat
M Kamal
P Khaitovich
P Medstrand
PD Keightley
RA Irizarry
RC Gentleman
RJ Britten
RJ Britten
RM Kuhn
TJ Hubbard
TS Mikkelsen
V Pereira
Vini Pereira
W Enard
W Makalowski
X Xie
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2009
Field of study

Background:Many genomes contain a substantial number of transposable elements (TEs), a few of which are known to be involved in regulating gene expression. However, recent observations suggest that TEs may have played a very important role in the evolution of gene expression because many conserved non-genic sequences, some of which are know to be involved in gene regulation, resemble TEs. Results:Here we investigate whether new TE insertions affect gene expression profiles by testing whether gene expression divergence between mouse and rat is correlated to the numbers of new transposable elements inserted near genes. We show that expression divergence is significantly correlated to the number of new LTR and SINE elements, but not to the numbers of LINEs. We also show that expression divergence is not significantly correlated to the numbers of ancestral TEs in most cases, which suggests that the correlations between expression divergence and the numbers of new TEs are causal in nature. We quantify the effect and estimate that TE insertion has accounted for ~20% (95% confidence interval: 12% to 26%) of all expression profile divergence in rodents. Conclusions:We conclude that TE insertions may have had a major impact on the evolution of gene expression levels in rodents

Public Library of Science (PLOS)

arXiv.org e-Print Archive

Sussex Research Online

The EM Algorithm in Genetics, Genomics and Public Health

Author: Laird Nan M.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 11/04/2011
Field of study

The popularity of the EM algorithm owes much to the 1977 paper by Dempster, Laird and Rubin. That paper gave the algorithm its name, identified the general form and some key properties of the algorithm and established its broad applicability in scientific research. This review gives a nontechnical introduction to the algorithm for a general scientific audience, and presents a few examples characteristic of its application.Comment: Published in at http://dx.doi.org/10.1214/08-STS270 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

Public Library of Science (PLOS)

Microevolution of Helicobacter pylori during prolonged infection of single hosts and within families

Author: A Gelman
A Mena
A Tomitani
B Bjorkholm
B Linz
Barica Kusecek
BF Voight
Christelle Bahlawane
D Falush
D Falush
D Kersulyte
Daniel Falush
DE Berg
DJ Wilson
EA Lin
EC Holmes
EC Holmes
EE Smith
EJ Javaux
EJ Kuipers
EP Rocha
FU Battistuzzi
GI Peterson
Giovanna Morelli
GM Pupo
H Ochman
H Ochman
Harmit S. Malik
HD Holland
J Kang
J Parkhill
J Raymond
JF Tomb
JK Pritchard
JM Kang
K Thornton
KA Jolley
L Feng
M Achtman
M Achtman
M Achtman
M Eppinger
MA Beaumont
Mark Achtman
MM Mwangi
NA Moran
NJ Butterfield
NS Taylor
P Marjoram
P Roumagnac
PK Ingvarsson
PP Sheridan
RJ Meinersmann
S Chattopadhyay
S Kryazhimskiy
S Kulick
S Schwarz
S Sreevatsan
S Suerbaum
S Suerbaum
S Talarico
S Tavare
Sandra Schwarz
Sebastian Suerbaum
SJ Weissman
SR Harris
SR Leopold
SY Ho
SY Ho
T Wirth
T Wirth
T Wirth
U Nübel
X Didelot
Xavier Didelot
Y Moodley
Z Lin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2010
Field of study

Our understanding of basic evolutionary processes in bacteria is still very limited. For example, multiple recent dating estimates are based on a universal inter-species molecular clock rate, but that rate was calibrated using estimates of geological dates that are no longer accepted. We therefore estimated the short-term rates of mutation and recombination in Helicobacter pylori by sequencing an average of 39,300 bp in 78 gene fragments from 97 isolates. These isolates included 34 pairs of sequential samples, which were sampled at intervals of 0.25 to 10.2 years. They also included single isolates from 29 individuals (average age: 45 years) from 10 families. The accumulation of sequence diversity increased with time of separation in a clock-like manner in the sequential isolates. We used Approximate Bayesian Computation to estimate the rates of mutation, recombination, mean length of recombination tracts, and average diversity in those tracts. The estimates indicate that the short-term mutation rate is 1.4×10−6 (serial isolates) to 4.5×10−6 (family isolates) per nucleotide per year and that three times as many substitutions are introduced by recombination as by mutation. The long-term mutation rate over millennia is 5–17-fold lower, partly due to the removal of non-synonymous mutations due to purifying selection. Comparisons with the recent literature show that short-term mutation rates vary dramatically in different bacterial species and can span a range of several orders of magnitude