2,597 research outputs found
Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping
Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples
Data analysis tools for mass spectrometry proteomics
ABSTRACT
Proteins are large biomolecules which consist of amino acid chains. They differ from one another in their amino acid sequences, which are mainly dictated by the nucleotide sequence of their corresponding genes. Proteins fold into specific threedimensional structures that determine their activity. Because many of the proteins act as catalytes in biochemical reactions, they are considered as the executive molecules in the cells and therefore their research is fundamental in biotechnology and medicine.
Currently the most common method to investigate the activity, interactions, and functions of proteins on a large scale, is high-throughput mass spectrometry (MS). The mass spectrometers are used for measuring the molecule masses, or more specifically, their mass-to-charge ratios. Typically the proteins are digested into peptides and their masses are measured by mass spectrometry. The masses are matched against known sequences to acquire peptide identifications, and subsequently, the proteins from which the peptides were originated are quantified. The data that are gathered from these experiments contain a lot of noise, leading to loss of relevant information and even to wrong conclusions. The noise can be related, for example, to differences in the sample preparation or to technical limitations of the analysis equipment. In addition, assumptions regarding the data might be wrong or the chosen statistical methods might not be suitable. Taken together, these can lead to irreproducible results. Developing algorithms and computational tools to overcome the underlying issues is of most importance. Thus, this work aims to develop new computational tools to address these problems.
In this PhD Thesis, the performance of existing label-free proteomics methods are evaluated and new statistical data analysis methods are proposed. The tested methods include several widely used normalization methods, which are thoroughly evaluated using multiple gold standard datasets. Various statistical methods for differential expression analysis are also evaluated. Furthermore, new methods to calculate differential expression statistic are developed and their superior performance compared to the existing methods is shown using a wide set of metrics. The tools are published as open source software packages.TIIVISTELMĂ
Proteiinit ovat aminohappoketjuista muodostuvia isoja biomolekyylejÀ. Ne eroavat toisistaan aminohappojen jÀrjestyksen osalta, mikÀ pÀÀosin mÀÀrÀytyy proteiineja koodaavien geenien perusteella. LisÀksi proteiinit laskostuvat kolmiulotteisiksi rakenteiksi, jotka osaltaan mÀÀrittelevÀt niiden toimintaa. Koska proteiinit toimivat katalyytteinÀ biokemiallisissa reaktioissa, niillÀ katsotaan olevan keskeinen rooli soluissa ja siksi myös niiden tutkimusta pidetÀÀn tÀrkeÀnÀ.
TĂ€llĂ€ hetkellĂ€ yleisin menetelmĂ€ laajamittaiseen proteiinien aktiivisuuden, interaktioiden sekĂ€ funktioiden tutkimiseen on suurikapasiteettinen massaspektrometria (MS). Massaspektrometreja kĂ€ytetÀÀn mittaamaan molekyylien massoja â tai tarkemmin massan ja varauksen suhdetta. Tyypillisesti proteiinit hajotetaan peptideiksi massojen mittausta varten. MassaspektrometrillĂ€ havaittuja massoja verrataan tunnetuista proteiinisekvensseistĂ€ koottua tietokantaa vasten, jotta peptidit voidaan tunnistaa. Peptidien myötĂ€ myös proteiinit on mahdollista pÀÀtellĂ€ ja kvantitoida. Kokeissa kerĂ€tty data sisĂ€ltÀÀ normaalisti runsaasti kohinaa, joka saattaa johtaa olennaisen tiedon hukkumiseen ja jopa pahimmillaan johtaa vÀÀriin johtopÀÀtöksiin. TĂ€mĂ€ kohina voi johtua esimerkiksi nĂ€ytteen kĂ€sittelystĂ€ johtuvista eroista tai mittalaitteiden teknisistĂ€ rajoitteista. LisĂ€ksi olettamukset datan luonteesta saattavat olla virheellisiĂ€ tai kĂ€ytetÀÀn datalle soveltumattomia tilastollisia malleja. Pahimmillaan tĂ€mĂ€ johtaa tilanteisiin, joissa tutkimuksen tuloksia ei pystytĂ€ toistamaan. Erilaisten laskennallisten työkalujen sekĂ€ algoritmien kehittĂ€minen nĂ€iden ongelmien ehkĂ€isemiseksi onkin ensiarvoisen tĂ€rkeÀÀ tutkimusten luotettavuuden kannalta. TĂ€ssĂ€ työssĂ€ keskitytÀÀnkin sovelluksiin, joilla pyritÀÀn ratkaisemaan tĂ€llĂ€ osa-alueella ilmeneviĂ€ ongelmia.
Tutkimuksessa vertaillaan yleisesti kÀytössÀ olevia kvantitatiivisen proteomiikan ohjelmistoja ja yleisimpiÀ datan normalisointimenetelmiÀ, sekÀ kehitetÀÀn uusia datan analysointityökaluja. Menetelmien keskinÀiset vertailut suoritetaan useiden sellaisten standardiaineistojen kanssa, joiden todellinen sisÀltö tiedetÀÀn. Tutkimuksessa vertaillaan lisÀksi joukko tilastollisia menetelmiÀ nÀytteiden vÀlisten erojen havaitsemiseen sekÀ kehitetÀÀn kokonaan uusia tehokkaita menetelmiÀ ja osoitetaan niiden parempi suorituskyky suhteessa aikaisempiin menetelmiin. Kaikki tutkimuksessa kehitetyt työkalut on julkaistu avoimen lÀhdekoodin sovelluksina
Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure
Several algorithms for the normalization of proteomic data are currently available, each based on a priori assumptions. Among these is the extent to which differential expression (DE) can be present in the dataset. This factor is usually unknown in explorative biomarker screens. Simultaneously, the increasing depth of proteomic analyses often requires the selection of subsets with a high probability of being DE to obtain meaningful results in downstream bioinformatical analyses. Based on the relationship of technical variation and (true) biological DE of an unknown share of proteins, we propose the âNormicsâ algorithm: Proteins are ranked based on their expression levelâcorrected variance and the mean correlation with all other proteins. The latter serves as a novel indicator of the non-DE likelihood of a protein in a given dataset. Subsequent normalization is based on a subset of non-DE proteins only. No a priori information such as batch, clinical, or replicate group is necessary. Simulation data demonstrated robust and superior performance across a wide range of stochastically chosen parameters. Five publicly available spike-in and biologically variant datasets were reliably and quantitively accurately normalized by Normics with improved performance compared to standard variance stabilization as well as median, quantile, and LOESS normalizations. In complex biological datasets Normics correctly determined proteins as being DE that had been cross-validated by an independent transcriptome analysis of the same samples. In both complex datasets Normics identified the most DE proteins. We demonstrate that combining variance analysis and data-inherent correlation structure to identify non-DE proteins improves data normalization. Standard normalization algorithms can be consolidated against high shares of (one-sided) biological regulation. The statistical power of downstream analyses can be increased by focusing on Normics-selected subsets of high DE likelihood
Impact of the spotted microarray preprocessing method on fold-change compression and variance stability
<p>Abstract</p> <p>Background</p> <p>The standard approach for preprocessing spotted microarray data is to subtract the local background intensity from the spot foreground intensity, to perform a log2 transformation and to normalize the data with a global median or a lowess normalization. Although well motivated, standard approaches for background correction and for transformation have been widely criticized because they produce high variance at low intensities. Whereas various alternatives to the standard background correction methods and to log2 transformation were proposed, impacts of both successive preprocessing steps were not compared in an objective way.</p> <p>Results</p> <p>In this study, we assessed the impact of eight preprocessing methods combining four background correction methods and two transformations (the log2 and the glog), by using data from the MAQC study. The current results indicate that most preprocessing methods produce fold-change compression at low intensities. Fold-change compression was minimized using the Standard and the Edwards background correction methods coupled with a log2 transformation. The drawback of both methods is a high variance at low intensities which consequently produced poor estimations of the p-values. On the other hand, effective stabilization of the variance as well as better estimations of the p-values were observed after the glog transformation.</p> <p>Conclusion</p> <p>As both fold-change magnitudes and p-values are important in the context of microarray class comparison studies, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.</p
maigesPack: A Computational Environment for Microarray Data Analysis
Microarray technology is still an important way to assess gene expression in
molecular biology, mainly because it measures expression profiles for thousands
of genes simultaneously, what makes this technology a good option for some
studies focused on systems biology. One of its main problem is complexity of
experimental procedure, presenting several sources of variability, hindering
statistical modeling. So far, there is no standard protocol for generation and
evaluation of microarray data. To mitigate the analysis process this paper
presents an R package, named maigesPack, that helps with data organization.
Besides that, it makes data analysis process more robust, reliable and
reproducible. Also, maigesPack aggregates several data analysis procedures
reported in literature, for instance: cluster analysis, differential
expression, supervised classifiers, relevance networks and functional
classification of gene groups or gene networks
Power analysis for RNA sequencing and mass spectrometry-based proteomics data
RNA-sequencing and mass spectrometry technologies have facilitated the differential expression discoveries in transcriptome and proteome studies. However, the determination of sample size to achieve adequate statistical power has been a major challenge in experimental design. The objective of this study is to develop a power analysis tool applicable to both RNA-seq and MS-based proteomics data. The methods proposed in this study are capable of both prospective and retrospective power analyses. In terms of the performance, the benchmarking results indicated that the proposed methods can give distinct power estimates for both differentially and equivalently expressed genes or proteins without prior differential expression analysis and other assumptions, such as, expected fraction of differentially expressed features, minimal fold changes and expected mean expressions. Using the proposed methods, not only can researchers evaluate the reliability of their acquired significant results, but also estimate the sufficient sample size for a desired power. The proposed methods in this study were implemented as an R package, which can be freely accessed from Bioconductor project at http://bioconductor.org/packages/PowerExplorer/
Testing for Differentially-Expressed MicroRNAs with Errors-in-Variables Nonparametric Regression
MicroRNA is a set of small RNA molecules mediating gene expression at post-transcriptional/translational levels. Most of well-established high throughput discovery platforms, such as microarray, real time quantitative PCR, and sequencing, have been adapted to study microRNA in various human diseases. The total number of microRNAs in humans is approximately 1,800, which challenges some analytical methodologies requiring a large number of entries. Unlike messenger RNA, the majority of microRNA (60%) maintains relatively low abundance in the cells. When analyzed using microarray, the signals of these low-expressed microRNAs are influenced by other non-specific signals including the background noise. It is crucial to distinguish the true microRNA signals from measurement errors in microRNA array data analysis. In this study, we propose a novel measurement error model-based normalization method and differentially-expressed microRNA detection method for microRNA profiling data acquired from locked nucleic acids (LNA) microRNA array. Compared with some existing methods, the proposed method significantly improves the detection among low-expressed microRNAs when assessed by quantitative real-time PCR assay
Recommended from our members
The metabolome regulates the epigenetic landscape during naive-to-primed human embryonic stem cell transition.
For nearly a century developmental biologists have recognized that cells from embryos can differ in their potential to differentiate into distinct cell types. Recently, it has been recognized that embryonic stem cells derived from both mice and humans exhibit two stable yet epigenetically distinct states of pluripotency: naive and primed. We now show that nicotinamide N-methyltransferase (NNMT) and the metabolic state regulate pluripotency in human embryonic stem cells (hESCs). Â Specifically, in naive hESCs, NNMT and its enzymatic product 1-methylnicotinamide are highly upregulated, and NNMT is required for low S-adenosyl methionine (SAM) levels and the H3K27me3 repressive state. NNMT consumes SAM in naive cells, making it unavailable for histone methylation that represses Wnt and activates the HIF pathway in primed hESCs. These data support the hypothesis that the metabolome regulates the epigenetic landscape of the earliest steps in human development
A gene selection method for GeneChip array data with small sample sizes
<p>Abstract</p> <p>Background</p> <p>In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.</p> <p>Results</p> <p>We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.</p> <p>Conclusion</p> <p>Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.</p
Normalization and Gene p-Value Estimation: Issues in Microarray Data Processing
Introduction: Numerous methods exist for basic processing, e.g. normalization, of microarray gene expression data. These methods have an important effect on the final analysis outcome. Therefore, it is crucial to select methods appropriate for a given dataset in order to assure the validity and reliability of expression data analysis. Furthermore, biological interpretation requires expression values for genes, which are often represented by several spots or probe sets on a microarray. How to best integrate spot/probe set values into gene values has so far been a somewhat neglecte
- âŠ