Search CORE

2,597 research outputs found

Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping

Author: Gad Getz
Matthew Meyerson
Scott L. Carter
Publication venue
Publication date: 05/10/2011
Field of study

Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples

Nature Precedings

Data analysis tools for mass spectrometry proteomics

Author: Suomi Tomi
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 17/08/2021
Field of study

ABSTRACT Proteins are large biomolecules which consist of amino acid chains. They differ from one another in their amino acid sequences, which are mainly dictated by the nucleotide sequence of their corresponding genes. Proteins fold into specific threedimensional structures that determine their activity. Because many of the proteins act as catalytes in biochemical reactions, they are considered as the executive molecules in the cells and therefore their research is fundamental in biotechnology and medicine. Currently the most common method to investigate the activity, interactions, and functions of proteins on a large scale, is high-throughput mass spectrometry (MS). The mass spectrometers are used for measuring the molecule masses, or more specifically, their mass-to-charge ratios. Typically the proteins are digested into peptides and their masses are measured by mass spectrometry. The masses are matched against known sequences to acquire peptide identifications, and subsequently, the proteins from which the peptides were originated are quantified. The data that are gathered from these experiments contain a lot of noise, leading to loss of relevant information and even to wrong conclusions. The noise can be related, for example, to differences in the sample preparation or to technical limitations of the analysis equipment. In addition, assumptions regarding the data might be wrong or the chosen statistical methods might not be suitable. Taken together, these can lead to irreproducible results. Developing algorithms and computational tools to overcome the underlying issues is of most importance. Thus, this work aims to develop new computational tools to address these problems. In this PhD Thesis, the performance of existing label-free proteomics methods are evaluated and new statistical data analysis methods are proposed. The tested methods include several widely used normalization methods, which are thoroughly evaluated using multiple gold standard datasets. Various statistical methods for differential expression analysis are also evaluated. Furthermore, new methods to calculate differential expression statistic are developed and their superior performance compared to the existing methods is shown using a wide set of metrics. The tools are published as open source software packages.TIIVISTELMÄ Proteiinit ovat aminohappoketjuista muodostuvia isoja biomolekyylejä. Ne eroavat toisistaan aminohappojen järjestyksen osalta, mikä pääosin määräytyy proteiineja koodaavien geenien perusteella. Lisäksi proteiinit laskostuvat kolmiulotteisiksi rakenteiksi, jotka osaltaan määrittelevät niiden toimintaa. Koska proteiinit toimivat katalyytteinä biokemiallisissa reaktioissa, niillä katsotaan olevan keskeinen rooli soluissa ja siksi myös niiden tutkimusta pidetään tärkeänä. Tällä hetkellä yleisin menetelmä laajamittaiseen proteiinien aktiivisuuden, interaktioiden sekä funktioiden tutkimiseen on suurikapasiteettinen massaspektrometria (MS). Massaspektrometreja käytetään mittaamaan molekyylien massoja – tai tarkemmin massan ja varauksen suhdetta. Tyypillisesti proteiinit hajotetaan peptideiksi massojen mittausta varten. Massaspektrometrillä havaittuja massoja verrataan tunnetuista proteiinisekvensseistä koottua tietokantaa vasten, jotta peptidit voidaan tunnistaa. Peptidien myötä myös proteiinit on mahdollista päätellä ja kvantitoida. Kokeissa kerätty data sisältää normaalisti runsaasti kohinaa, joka saattaa johtaa olennaisen tiedon hukkumiseen ja jopa pahimmillaan johtaa vääriin johtopäätöksiin. Tämä kohina voi johtua esimerkiksi näytteen käsittelystä johtuvista eroista tai mittalaitteiden teknisistä rajoitteista. Lisäksi olettamukset datan luonteesta saattavat olla virheellisiä tai käytetään datalle soveltumattomia tilastollisia malleja. Pahimmillaan tämä johtaa tilanteisiin, joissa tutkimuksen tuloksia ei pystytä toistamaan. Erilaisten laskennallisten työkalujen sekä algoritmien kehittäminen näiden ongelmien ehkäisemiseksi onkin ensiarvoisen tärkeää tutkimusten luotettavuuden kannalta. Tässä työssä keskitytäänkin sovelluksiin, joilla pyritään ratkaisemaan tällä osa-alueella ilmeneviä ongelmia. Tutkimuksessa vertaillaan yleisesti käytössä olevia kvantitatiivisen proteomiikan ohjelmistoja ja yleisimpiä datan normalisointimenetelmiä, sekä kehitetään uusia datan analysointityökaluja. Menetelmien keskinäiset vertailut suoritetaan useiden sellaisten standardiaineistojen kanssa, joiden todellinen sisältö tiedetään. Tutkimuksessa vertaillaan lisäksi joukko tilastollisia menetelmiä näytteiden välisten erojen havaitsemiseen sekä kehitetään kokonaan uusia tehokkaita menetelmiä ja osoitetaan niiden parempi suorituskyky suhteessa aikaisempiin menetelmiin. Kaikki tutkimuksessa kehitetyt työkalut on julkaistu avoimen lähdekoodin sovelluksina

UTUPub

Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure

Author: Brägelmann Johannes
Dressler Franz F.
Perner Sven
Reischl Markus
Publication venue: American Society for Biochemistry and Molecular Biology
Publication date: 01/01/2022
Field of study

Several algorithms for the normalization of proteomic data are currently available, each based on a priori assumptions. Among these is the extent to which differential expression (DE) can be present in the dataset. This factor is usually unknown in explorative biomarker screens. Simultaneously, the increasing depth of proteomic analyses often requires the selection of subsets with a high probability of being DE to obtain meaningful results in downstream bioinformatical analyses. Based on the relationship of technical variation and (true) biological DE of an unknown share of proteins, we propose the “Normics” algorithm: Proteins are ranked based on their expression level–corrected variance and the mean correlation with all other proteins. The latter serves as a novel indicator of the non-DE likelihood of a protein in a given dataset. Subsequent normalization is based on a subset of non-DE proteins only. No a priori information such as batch, clinical, or replicate group is necessary. Simulation data demonstrated robust and superior performance across a wide range of stochastically chosen parameters. Five publicly available spike-in and biologically variant datasets were reliably and quantitively accurately normalized by Normics with improved performance compared to standard variance stabilization as well as median, quantile, and LOESS normalizations. In complex biological datasets Normics correctly determined proteins as being DE that had been cross-validated by an independent transcriptome analysis of the same samples. In both complex datasets Normics identified the most DE proteins. We demonstrate that combining variance analysis and data-inherent correlation structure to identify non-DE proteins improves data normalization. Standard normalization algorithms can be consolidated against high shares of (one-sided) biological regulation. The statistical power of downstream analyses can be increased by focusing on Normics-selected subsets of high DE likelihood

KITopen

Kölner UniversitätsPublikationsServer

PubMed Central

Impact of the spotted microarray preprocessing method on fold-change compression and variance stability

Author: A Tarca
Annie Robert
B Durbin
B Durbin
Benoît Macq
Bernadette Govaerts
Bertrand Bearzatto
C Kooperberg
D Allison
D Edwards
D Zhang
G Hardiman
G Lee
G Smyth
H Parsons
J Quackenbush
Jean-Luc Gala
Jérôme Ambroise
L Shi
M Ritchie
M Yang
P de Cremoux
P Tran
R Gentleman
R Muller
R Scharpf
R Shippy
S Dudoit
S Lin
T Barrett
T Han
T Patterson
W Huber
W Huber
X Cui
Y Leung
Y Yang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The standard approach for preprocessing spotted microarray data is to subtract the local background intensity from the spot foreground intensity, to perform a log2 transformation and to normalize the data with a global median or a lowess normalization. Although well motivated, standard approaches for background correction and for transformation have been widely criticized because they produce high variance at low intensities. Whereas various alternatives to the standard background correction methods and to log2 transformation were proposed, impacts of both successive preprocessing steps were not compared in an objective way. Results In this study, we assessed the impact of eight preprocessing methods combining four background correction methods and two transformations (the log2 and the glog), by using data from the MAQC study. The current results indicate that most preprocessing methods produce fold-change compression at low intensities. Fold-change compression was minimized using the Standard and the Edwards background correction methods coupled with a log2 transformation. The drawback of both methods is a high variance at low intensities which consequently produced poor estimations of the p-values. On the other hand, effective stabilization of the variance as well as better estimations of the p-values were observed after the glog transformation. Conclusion As both fold-change magnitudes and p-values are important in the context of microarray class comparison studies, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DIAL UCLouvain

maigesPack: A Computational Environment for Microarray Data Analysis

Author: Esteves Gustavo H.
Hirata Jr Roberto
Publication venue
Publication date: 11/11/2015
Field of study

Microarray technology is still an important way to assess gene expression in molecular biology, mainly because it measures expression profiles for thousands of genes simultaneously, what makes this technology a good option for some studies focused on systems biology. One of its main problem is complexity of experimental procedure, presenting several sources of variability, hindering statistical modeling. So far, there is no standard protocol for generation and evaluation of microarray data. To mitigate the analysis process this paper presents an R package, named maigesPack, that helps with data organization. Besides that, it makes data analysis process more robust, reliable and reproducible. Also, maigesPack aggregates several data analysis procedures reported in literature, for instance: cluster analysis, differential expression, supervised classifiers, relevance networks and functional classification of gene groups or gene networks

arXiv.org e-Print Archive

CiteSeerX

Power analysis for RNA sequencing and mass spectrometry-based proteomics data

Author: Qiao Xu
Publication venue
Publication date: 12/02/2019
Field of study

RNA-sequencing and mass spectrometry technologies have facilitated the differential expression discoveries in transcriptome and proteome studies. However, the determination of sample size to achieve adequate statistical power has been a major challenge in experimental design. The objective of this study is to develop a power analysis tool applicable to both RNA-seq and MS-based proteomics data. The methods proposed in this study are capable of both prospective and retrospective power analyses. In terms of the performance, the benchmarking results indicated that the proposed methods can give distinct power estimates for both differentially and equivalently expressed genes or proteins without prior differential expression analysis and other assumptions, such as, expected fraction of differentially expressed features, minimal fold changes and expected mean expressions. Using the proposed methods, not only can researchers evaluate the reliability of their acquired significant results, but also estimate the sufficient sample size for a desired power. The proposed methods in this study were implemented as an R package, which can be freely accessed from Bioconductor project at http://bioconductor.org/packages/PowerExplorer/

UTUPub

Testing for Differentially-Expressed MicroRNAs with Errors-in-Variables Nonparametric Regression

Author: A Delaigle
A Delaigle
B Silverman
B Wang
B Wang
B Wang
Bin Wang
BM Bolstad
D Risso
D Rocke
DM Rocke
J Cohen
J Fleiss
J Fleiss
J Fleiss
J Sun
JJ Faraway
JR Landis
L Wasserman
MA Carmell
Ming Tan
N Mascellani
PA Northcott
Paolo Provero
R Garzon
R Qi
S Bruheim
S Dudoit
S Pradervand
Shu-Guang Zhang
SU Meyer
T Ideker
W Huber
XF Wang
XF Wang
XF Wang
Xiao-Feng Wang
Y Chen
Y Rao
Y Xi
Yaguang Xi
Publication venue: Public Library of Science
Publication date: 24/05/2012
Field of study

MicroRNA is a set of small RNA molecules mediating gene expression at post-transcriptional/translational levels. Most of well-established high throughput discovery platforms, such as microarray, real time quantitative PCR, and sequencing, have been adapted to study microRNA in various human diseases. The total number of microRNAs in humans is approximately 1,800, which challenges some analytical methodologies requiring a large number of entries. Unlike messenger RNA, the majority of microRNA (60%) maintains relatively low abundance in the cells. When analyzed using microarray, the signals of these low-expressed microRNAs are influenced by other non-specific signals including the background noise. It is crucial to distinguish the true microRNA signals from measurement errors in microRNA array data analysis. In this study, we propose a novel measurement error model-based normalization method and differentially-expressed microRNA detection method for microRNA profiling data acquired from locked nucleic acids (LNA) microRNA array. Compared with some existing methods, the proposed method significantly improves the detection among low-expressed microRNAs when assessed by quantitative real-time PCR assay

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Recommended from our members

The metabolome regulates the epigenetic landscape during naive-to-primed human embryonic stem cell transition.

Author: Battle Stephanie L
Bielas Jason H
Blau C Anthony
Detraux Damien
Devi Arikketh
Ericson Nolan G
Ferreccio Amy
Fiehn Oliver
Fischer Karin A
Gu Haiwei
Hawkins R David
Hesson Jennifer
Hockenbery David
Margaretha Lilyana
Margineantu Daciana
Margolin Adam A
Mathieu Julie
Moon Randall T
Raftery Daniel
Robitaille Aaron M
Ruohola-Baker Hannele
Showalter Megan
Sperber Henrik
Valensisi Cristina
Wang Yuliang
Ware Carol B
Xu Zhuojin
Publication venue: eScholarship, University of California
Publication date: 01/12/2015
Field of study

For nearly a century developmental biologists have recognized that cells from embryos can differ in their potential to differentiate into distinct cell types. Recently, it has been recognized that embryonic stem cells derived from both mice and humans exhibit two stable yet epigenetically distinct states of pluripotency: naive and primed. We now show that nicotinamide N-methyltransferase (NNMT) and the metabolic state regulate pluripotency in human embryonic stem cells (hESCs). Specifically, in naive hESCs, NNMT and its enzymatic product 1-methylnicotinamide are highly upregulated, and NNMT is required for low S-adenosyl methionine (SAM) levels and the H3K27me3 repressive state. NNMT consumes SAM in naive cells, making it unavailable for histone methylation that represses Wnt and activates the HIF pathway in primed hESCs. These data support the hypothesis that the metabolome regulates the epigenetic landscape of the earliest steps in human development

eScholarship - University of California

A gene selection method for GeneChip array data with small sample sizes

Author: Chen Zhongxue
Deng Youping
Huang Xudong
Kong Megan
Liu Qingzhong
McGee Monnie
Scheuermann Richard H
Publication venue: BioMed Central
Publication date: 01/07/2010
Field of study

Abstract Background In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods. Results We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate. Conclusion Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.</p

Crossref

Scholarly Works @ SHSU (Sam Houston State University)

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Normalization and Gene p-Value Estimation: Issues in Microarray Data Processing

Author: Aigner Thomas
Fundel Katrin
Küffner Robert
Zimmer Ralf
Publication venue: Libertas Academica
Publication date: 01/01/2008
Field of study

Introduction: Numerous methods exist for basic processing, e.g. normalization, of microarray gene expression data. These methods have an important effect on the final analysis outcome. Therefore, it is crucial to select methods appropriate for a given dataset in order to assure the validity and reliability of expression data analysis. Furthermore, biological interpretation requires expression values for genes, which are often represented by several spots or probe sets on a microarray. How to best integrate spot/probe set values into gene values has so far been a somewhat neglecte

CiteSeerX

Directory of Open Access Journals

PubMed Central