Search CORE

arXiv.org e-Print Archive

Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects

Author: Alan R. Dabney
Ashoka D. Polpitiya
Gordon A. Anderson
Richard D. Smith
Yuliya V. Karpievitch
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used "bottom-up" approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS341 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

Normalization and missing value imputation for label-free LC-MS analysis

Author: Dabney Alan R
Karpievitch Yuliya V
Smith Richard D
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data

Springer - Publisher Connector

Public Library of Science (PLOS)

Optimality Driven Nearest Centroid Classification from Genomic Data

Author: A Alizadeh
Alan R. Dabney
AR Dabney
AR Dabney
B Efron
C Ambroise
C Stein
D Ross
I Hedenfalk
J Khan
J Schäfer
Ji Zhu
John D. Storey
JW Lee
K Mardia
P Bickel
R Shen
R Tibshirani
RJ McKay
RJ McKay
S Dudoit
T Golub
TH Bø
Y Guo
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers

CiteSeerX

Public Library of Science (PLOS)

An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

Author: A Vlahou
Alan R. Dabney
Anthony P. Leclerc
AR Dabney
B Efron
B Rosner
B Wu
BL Adam
C Strobl
C Strobl
D Agranoff
DS Palmer
EF Petricoin
EJ Finehout
Elizabeth G. Hill
ET Fung
Fabio Rapallo
G Izmirlian
GA Churchill
H Zhang
JM Koomen
Jonas S. Almeida
JR Quinlan
JS Morris
L Breiman
L Breiman
L Breiman
L Li
LE Breiman
M Hilario
MR Segal
PJ Adam
RW Garden
S Schaub
SK Lee
TM Pawlik
TP Conrads
V Svetnik
Y Yasui
YD Chen
Yuliya V. Karpievitch
YV Karpievitch
YV Karpievitch
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

Directory of Open Access Journals

Public Library of Science (PLOS)

Genome wide association mapping for arabinoxylan content in a collection of tetraploid wheats

Author: A Lovegrove
Agata Gadaleta
Antonio Blanco
CM Courtin
E Akhunov
E Sears
G Charmet
G Laido
Geoffrey B. Fincher
HM Collins
I Lempereur
Ilaria Marcotuli
J Crossa
J Xiao
JDS Alan Dabney
JP Martinant
K Tamura
Kelly Houston
KTC Zondervan
L Saulnier
MS Izydorczyk
P Colasuonno
Pilar Hernandez
PK Gupta
PM Coutinho
R Burton
R Ciccoritti
R Shewry P
RAC Mitchell
Rachel A. Burton
Robbie Waugh
SW Wang
TK Pellny
TK Pellny
UM Quraishi
VL Nguyen
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

BACKGROUND: Arabinoxylans (AXs) are major components of plant cell walls in bread wheat and are important in bread-making and starch extraction. Furthermore, arabinoxylans are components of soluble dietary fibre that has potential health-promoting effects in human nutrition. Despite their high value for human health, few studies have been carried out on the genetics of AX content in durum wheat. RESULTS: The genetic variability of AX content was investigated in a set of 104 tetraploid wheat genotypes and regions attributable to AX content were identified through a genome wide association study (GWAS). The amount of arabinoxylan, expressed as percentage (w/w) of the dry weight of the kernel, ranged from 1.8% to 5.5% with a mean value of 4.0%. The GWAS revealed a total of 37 significant marker-trait associations (MTA), identifying 19 quantitative trait loci (QTL) associated with AX content. The highest number of MTAs was identified on chromosome 5A (seven), where three QTL regions were associated with AX content, while the lowest number of MTAs was detected on chromosomes 2B and 4B, where only one MTA identified a single locus. Conservation of synteny between SNP marker sequences and the annotated genes and proteins in Brachypodium distachyon, Oryza sativa and Sorghum bicolor allowed the identification of nine QTL coincident with candidate genes. These included a glycosyl hydrolase GH35, which encodes Gal7 and a glucosyltransferase GT31 on chromosome 1A; a cluster of GT1 genes on chromosome 2B that includes TaUGT1 and cisZog1; a glycosyl hydrolase that encodes a CelC gene on chromosome 3A; Ugt12887 and TaUGT1genes on chromosome 5A; a (1,3)-β-D-glucan synthase (Gsl12 gene) and a glucosyl hydrolase (Cel8 gene) on chromosome 7A. CONCLUSIONS: This study identifies significant MTAs for the AX content in the grain of tetraploid wheat genotypes. We propose that these may be used for molecular breeding of durum wheat varieties with higher soluble fibre content.Ilaria Marcotuli, Kelly Houston, Robbie Waugh, Geoffrey B. Fincher, Rachel A. Burton, Antonio Blanco, Agata Gadalet

Adelaide Research & Scholarship

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Bari

University of Dundee Online Publications

FigShare

Reconstructing the Deep Population History of Central and South America

Author: Alan Cooper
Alberto Barioni
Amalia Nuevo Delaunay
Amorim
Andersen Liryo
André Strauss
Anja Furtwängler
Ann Marie Lawson
Bardill
Bastien Llamas
Brendan J. Culleton
Briggs
Cardich
Chala-Aldana
Chuan-Chao Wang
Clara Scabuzzo
Claudia R. Plens
Cosimo Posth
Crawford
César Méndez
Dabney
Daniel Corach
Danilo V. Bernardo
David Reich
DeBlasis
Dillehay
Domingo C. Salazar-García
Douglas J. Kennett
Edgar
Eisenmann
Eliane N. Chim
Elizabeth Nelson
Elsa Tomasto-Cagigao
Emilie Bertolini
Fehren-Schmitz
Fehren-Schmitz
Felsenstein
Fiedel
Fu
Fumagalli
Gansauge
Goldberg
Grieder
Gustavo Politis
Herrera
Hogg
Hubbe
Hubbe
Hugo Reyes-Centeno
Iosif Lazaridis
Iriarte
Jackson
Jakob Sedig
Jean-Jacques Hublin
Johannes Krause
Jonas Oppenheimer
Jostins
Judith Beier
Jónsson
Kamberov
Katerina Harvati
Kathrin Nägele
Keith M. Prufer
Kelly Harkins
Kennett
Korlević
Korneliussen
Kristin Stewardson
Kurt Rademaker
Lars Fehren-Schmitz
Lazaridis
Lazaridis
Lazaridis
Letunic
Levy Figuti
Li
Li
Li
Lindo
Lindo
Lipson
Lisiane Müller Plumm Gomes
Llamas
Lohse
Mallick
Marcony L. Alves
Mariana Inglez
Mario A. Rivera
Mark Hubbe
Mark Robinson
Markus Reindel
Matthew Ferry
Megan Michel
Michael Francken
Moreno-Mayar
Moreno-Mayar
Nadin Rohland
Nahuel A. Scheifler
Nasreen Broomandkhoshbacht
Nathan Nakatsuka
Nick Patterson
Nicole Adamski
Pablo G. Messineo
Patterson
Patterson
Paulo DeBlasis
Pearson
Peltzer
Perego
Peter Kaulicke
Plens
Plens
Politis
Pontus Skoglund
Pucciarelli
Racimo
Rademaker
Raghavan
Ramsey
Rasmussen
Reich
Reich
Reimer
Reimer
Renaud
Rodrigo E. Oliveira
Roewer
Rohland
Sabine Eggers
Sahra Talamo
Said M. Gutierrez
Scheib
Schubert
Silverman
Skoglund
Skoglund
Stephan Schiffels
Strauss
Swapan Mallick
Talamo
Tamm
Tamura
Thiseas C. Lamnidis
Thomas K. Harper
Tiago Ferraz
Torroni
Tábita Hünemeier
Veronica Wesolowski
Vianello
Villagran
von Cramon-Taubadel
Ward
Willa R. Trask
Williams
Wolfgang Haak
Yang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

We report genome-wide ancient DNA from 49 individuals forming four parallel time transects in Belize, Brazil, the Central Andes, and the Southern Cone, each dating to at least 9,000 years ago. The common ancestral population radiated rapidly from just one of the two early branches that contributed to Native Americans today. We document two previously unappreciated streams of gene flow between North and South America. One affected the Central Andes by 4,200 years ago, while the other explains an affinity between the oldest North American genome associated with the Clovis culture and the oldest Central and South Americans from Chile, Brazil, and Belize. However, this was not the primary source for later South Americans, as the other ancient individuals derive from lineages without specific affinity to the Clovis-associated genome, suggesting a population replacement that began at least 9,000 years ago and was followed by substantial population continuity in multiple regions