2,597 research outputs found

    Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping

    Get PDF
    Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples

    Data analysis tools for mass spectrometry proteomics

    Get PDF
    ABSTRACT Proteins are large biomolecules which consist of amino acid chains. They differ from one another in their amino acid sequences, which are mainly dictated by the nucleotide sequence of their corresponding genes. Proteins fold into specific threedimensional structures that determine their activity. Because many of the proteins act as catalytes in biochemical reactions, they are considered as the executive molecules in the cells and therefore their research is fundamental in biotechnology and medicine. Currently the most common method to investigate the activity, interactions, and functions of proteins on a large scale, is high-throughput mass spectrometry (MS). The mass spectrometers are used for measuring the molecule masses, or more specifically, their mass-to-charge ratios. Typically the proteins are digested into peptides and their masses are measured by mass spectrometry. The masses are matched against known sequences to acquire peptide identifications, and subsequently, the proteins from which the peptides were originated are quantified. The data that are gathered from these experiments contain a lot of noise, leading to loss of relevant information and even to wrong conclusions. The noise can be related, for example, to differences in the sample preparation or to technical limitations of the analysis equipment. In addition, assumptions regarding the data might be wrong or the chosen statistical methods might not be suitable. Taken together, these can lead to irreproducible results. Developing algorithms and computational tools to overcome the underlying issues is of most importance. Thus, this work aims to develop new computational tools to address these problems. In this PhD Thesis, the performance of existing label-free proteomics methods are evaluated and new statistical data analysis methods are proposed. The tested methods include several widely used normalization methods, which are thoroughly evaluated using multiple gold standard datasets. Various statistical methods for differential expression analysis are also evaluated. Furthermore, new methods to calculate differential expression statistic are developed and their superior performance compared to the existing methods is shown using a wide set of metrics. The tools are published as open source software packages.TIIVISTELMÄ Proteiinit ovat aminohappoketjuista muodostuvia isoja biomolekyylejĂ€. Ne eroavat toisistaan aminohappojen jĂ€rjestyksen osalta, mikĂ€ pÀÀosin mÀÀrĂ€ytyy proteiineja koodaavien geenien perusteella. LisĂ€ksi proteiinit laskostuvat kolmiulotteisiksi rakenteiksi, jotka osaltaan mÀÀrittelevĂ€t niiden toimintaa. Koska proteiinit toimivat katalyytteinĂ€ biokemiallisissa reaktioissa, niillĂ€ katsotaan olevan keskeinen rooli soluissa ja siksi myös niiden tutkimusta pidetÀÀn tĂ€rkeĂ€nĂ€. TĂ€llĂ€ hetkellĂ€ yleisin menetelmĂ€ laajamittaiseen proteiinien aktiivisuuden, interaktioiden sekĂ€ funktioiden tutkimiseen on suurikapasiteettinen massaspektrometria (MS). Massaspektrometreja kĂ€ytetÀÀn mittaamaan molekyylien massoja – tai tarkemmin massan ja varauksen suhdetta. Tyypillisesti proteiinit hajotetaan peptideiksi massojen mittausta varten. MassaspektrometrillĂ€ havaittuja massoja verrataan tunnetuista proteiinisekvensseistĂ€ koottua tietokantaa vasten, jotta peptidit voidaan tunnistaa. Peptidien myötĂ€ myös proteiinit on mahdollista pÀÀtellĂ€ ja kvantitoida. Kokeissa kerĂ€tty data sisĂ€ltÀÀ normaalisti runsaasti kohinaa, joka saattaa johtaa olennaisen tiedon hukkumiseen ja jopa pahimmillaan johtaa vÀÀriin johtopÀÀtöksiin. TĂ€mĂ€ kohina voi johtua esimerkiksi nĂ€ytteen kĂ€sittelystĂ€ johtuvista eroista tai mittalaitteiden teknisistĂ€ rajoitteista. LisĂ€ksi olettamukset datan luonteesta saattavat olla virheellisiĂ€ tai kĂ€ytetÀÀn datalle soveltumattomia tilastollisia malleja. Pahimmillaan tĂ€mĂ€ johtaa tilanteisiin, joissa tutkimuksen tuloksia ei pystytĂ€ toistamaan. Erilaisten laskennallisten työkalujen sekĂ€ algoritmien kehittĂ€minen nĂ€iden ongelmien ehkĂ€isemiseksi onkin ensiarvoisen tĂ€rkeÀÀ tutkimusten luotettavuuden kannalta. TĂ€ssĂ€ työssĂ€ keskitytÀÀnkin sovelluksiin, joilla pyritÀÀn ratkaisemaan tĂ€llĂ€ osa-alueella ilmeneviĂ€ ongelmia. Tutkimuksessa vertaillaan yleisesti kĂ€ytössĂ€ olevia kvantitatiivisen proteomiikan ohjelmistoja ja yleisimpiĂ€ datan normalisointimenetelmiĂ€, sekĂ€ kehitetÀÀn uusia datan analysointityökaluja. Menetelmien keskinĂ€iset vertailut suoritetaan useiden sellaisten standardiaineistojen kanssa, joiden todellinen sisĂ€ltö tiedetÀÀn. Tutkimuksessa vertaillaan lisĂ€ksi joukko tilastollisia menetelmiĂ€ nĂ€ytteiden vĂ€listen erojen havaitsemiseen sekĂ€ kehitetÀÀn kokonaan uusia tehokkaita menetelmiĂ€ ja osoitetaan niiden parempi suorituskyky suhteessa aikaisempiin menetelmiin. Kaikki tutkimuksessa kehitetyt työkalut on julkaistu avoimen lĂ€hdekoodin sovelluksina

    Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure

    Get PDF
    Several algorithms for the normalization of proteomic data are currently available, each based on a priori assumptions. Among these is the extent to which differential expression (DE) can be present in the dataset. This factor is usually unknown in explorative biomarker screens. Simultaneously, the increasing depth of proteomic analyses often requires the selection of subsets with a high probability of being DE to obtain meaningful results in downstream bioinformatical analyses. Based on the relationship of technical variation and (true) biological DE of an unknown share of proteins, we propose the “Normics” algorithm: Proteins are ranked based on their expression level–corrected variance and the mean correlation with all other proteins. The latter serves as a novel indicator of the non-DE likelihood of a protein in a given dataset. Subsequent normalization is based on a subset of non-DE proteins only. No a priori information such as batch, clinical, or replicate group is necessary. Simulation data demonstrated robust and superior performance across a wide range of stochastically chosen parameters. Five publicly available spike-in and biologically variant datasets were reliably and quantitively accurately normalized by Normics with improved performance compared to standard variance stabilization as well as median, quantile, and LOESS normalizations. In complex biological datasets Normics correctly determined proteins as being DE that had been cross-validated by an independent transcriptome analysis of the same samples. In both complex datasets Normics identified the most DE proteins. We demonstrate that combining variance analysis and data-inherent correlation structure to identify non-DE proteins improves data normalization. Standard normalization algorithms can be consolidated against high shares of (one-sided) biological regulation. The statistical power of downstream analyses can be increased by focusing on Normics-selected subsets of high DE likelihood

    Impact of the spotted microarray preprocessing method on fold-change compression and variance stability

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The standard approach for preprocessing spotted microarray data is to subtract the local background intensity from the spot foreground intensity, to perform a log2 transformation and to normalize the data with a global median or a lowess normalization. Although well motivated, standard approaches for background correction and for transformation have been widely criticized because they produce high variance at low intensities. Whereas various alternatives to the standard background correction methods and to log2 transformation were proposed, impacts of both successive preprocessing steps were not compared in an objective way.</p> <p>Results</p> <p>In this study, we assessed the impact of eight preprocessing methods combining four background correction methods and two transformations (the log2 and the glog), by using data from the MAQC study. The current results indicate that most preprocessing methods produce fold-change compression at low intensities. Fold-change compression was minimized using the Standard and the Edwards background correction methods coupled with a log2 transformation. The drawback of both methods is a high variance at low intensities which consequently produced poor estimations of the p-values. On the other hand, effective stabilization of the variance as well as better estimations of the p-values were observed after the glog transformation.</p> <p>Conclusion</p> <p>As both fold-change magnitudes and p-values are important in the context of microarray class comparison studies, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.</p

    maigesPack: A Computational Environment for Microarray Data Analysis

    Full text link
    Microarray technology is still an important way to assess gene expression in molecular biology, mainly because it measures expression profiles for thousands of genes simultaneously, what makes this technology a good option for some studies focused on systems biology. One of its main problem is complexity of experimental procedure, presenting several sources of variability, hindering statistical modeling. So far, there is no standard protocol for generation and evaluation of microarray data. To mitigate the analysis process this paper presents an R package, named maigesPack, that helps with data organization. Besides that, it makes data analysis process more robust, reliable and reproducible. Also, maigesPack aggregates several data analysis procedures reported in literature, for instance: cluster analysis, differential expression, supervised classifiers, relevance networks and functional classification of gene groups or gene networks

    Power analysis for RNA sequencing and mass spectrometry-based proteomics data

    Get PDF
    RNA-sequencing and mass spectrometry technologies have facilitated the differential expression discoveries in transcriptome and proteome studies. However, the determination of sample size to achieve adequate statistical power has been a major challenge in experimental design. The objective of this study is to develop a power analysis tool applicable to both RNA-seq and MS-based proteomics data. The methods proposed in this study are capable of both prospective and retrospective power analyses. In terms of the performance, the benchmarking results indicated that the proposed methods can give distinct power estimates for both differentially and equivalently expressed genes or proteins without prior differential expression analysis and other assumptions, such as, expected fraction of differentially expressed features, minimal fold changes and expected mean expressions. Using the proposed methods, not only can researchers evaluate the reliability of their acquired significant results, but also estimate the sufficient sample size for a desired power. The proposed methods in this study were implemented as an R package, which can be freely accessed from Bioconductor project at http://bioconductor.org/packages/PowerExplorer/

    Testing for Differentially-Expressed MicroRNAs with Errors-in-Variables Nonparametric Regression

    Get PDF
    MicroRNA is a set of small RNA molecules mediating gene expression at post-transcriptional/translational levels. Most of well-established high throughput discovery platforms, such as microarray, real time quantitative PCR, and sequencing, have been adapted to study microRNA in various human diseases. The total number of microRNAs in humans is approximately 1,800, which challenges some analytical methodologies requiring a large number of entries. Unlike messenger RNA, the majority of microRNA (60%) maintains relatively low abundance in the cells. When analyzed using microarray, the signals of these low-expressed microRNAs are influenced by other non-specific signals including the background noise. It is crucial to distinguish the true microRNA signals from measurement errors in microRNA array data analysis. In this study, we propose a novel measurement error model-based normalization method and differentially-expressed microRNA detection method for microRNA profiling data acquired from locked nucleic acids (LNA) microRNA array. Compared with some existing methods, the proposed method significantly improves the detection among low-expressed microRNAs when assessed by quantitative real-time PCR assay

    A gene selection method for GeneChip array data with small sample sizes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.</p> <p>Results</p> <p>We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.</p> <p>Conclusion</p> <p>Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.</p

    Normalization and Gene p-Value Estimation: Issues in Microarray Data Processing

    Get PDF
    Introduction: Numerous methods exist for basic processing, e.g. normalization, of microarray gene expression data. These methods have an important effect on the final analysis outcome. Therefore, it is crucial to select methods appropriate for a given dataset in order to assure the validity and reliability of expression data analysis. Furthermore, biological interpretation requires expression values for genes, which are often represented by several spots or probe sets on a microarray. How to best integrate spot/probe set values into gene values has so far been a somewhat neglecte
    • 

    corecore