202 research outputs found

    Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics

    Get PDF
    Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables. There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi. Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast

    Design and Development of Oligonucleotide Microarrays and their Application in Diagnostic and Prognostic Estimation of Human Gliomas

    Get PDF
    DNA microarrays represent an ultra-high throughput gene expression assay employed to study the transcriptomic profiles of biological tissues. These devices are increasingly being used to study many aspects of gene regulation, and there is growing interest in the biotechnology and pharmaceutical industries for developing such devices in efforts toward rational product/drug design. The DNA microarray also provides a unique and objective means for diagnosis and prognosis of human diseases based on patterns of gene expression. This is especially important in cancer research and the thrust toward personalized medicine. This dissertation details the design and development of oligonucleotide microarrays and the design and execution of a gene expression study conducted using human glioma specimines. Chapter 2 details the design and development a ~10,000 gene human oligonucleotide microarray. This device consisted of a 21,168 features, each composed of a particular human gene-probe and was applied to the challenge of diagnostic and prognostic estimation for human gliomas (chapter 3). Gliomas are the most frequent and deadly neoplasms of the human brain characterized by a high misdiagnosis rate and low survival. The study in chapter 3 demonstrated that the specified design and development parameters were appropriate for conducting gene expression analysis and that this platform can be used successfully to predict malignancy grade and survival for glioma patients

    Statistical methods for differential proteomics at peptide and protein level

    Get PDF

    Statistical methods for preprocessing microarray gene expression data

    Get PDF

    Bayesian Model-based Methods for the Analysis of DNA Microarrays with Survival, Genetic, and Sequence Data

    Get PDF
    DNA microarrays measure the expression of thousands of genes or DNA fragments simultaneously in which probes have specific complementary hybridization. Gene expression or microarray data analysis problems have a prominent role in the biostatistics, biological sciences, and clinical medicine. The first paper proposes a method for finding associations between the survival time of the subjects and the gene expression of tumor microarrays. Measurement error is known to bias the estimates for survival regression coefficients, and this method minimizes bias. The latent variable model is shown to detect associations between potentially important genes and survival in a breast cancer dataset that conventional models did not detect, and the method is demonstrated to have robustness to misspecification with simulated data. The second paper considers the Expression Quantitative Trait Loci (eQTL) detection problem. An eQTL is a genetic locus that influences gene expression, and the major challenges with this type of data are multiple testing and computational issues. The proposed method extends the Mixture Over Marker (MOM) model to include a structured prior probability that accounts for the transcript location. The new technique exploits the fact that genetic markers are more likely to influence transcripts that share the same location on the genome. The third paper improves the analysis of Chromatin (Ch)-Immunoprecipitation (IP) (ChIP) microarray data. ChIP-chip data analysis estimates the motif of specific Transcription Factor Binding Sites (TFBSs) by comparing the IP DNA sample that is enriched for the TFBS and a control sample of general genomic DNA. The probes on the ChIP-chip array are uniformly spaced on the genome, and the probes that have relatively high intensity in the IP sample will have corresponding sequences that are likely to contain the TFBS motif. Present analytical methods use the array data to discover peaks or regions of IP enrichment then analyze the sequences of these peaks in a separate procedure to discover the motif. The proposed model will integrate enrichment peak finding and motif discovery through a Hidden Markov Model (HMM). Performance comparisons are made between the proposed HMM and the previously developed methods

    Proceedings of the 35th International Workshop on Statistical Modelling : July 20- 24, 2020 Bilbao, Basque Country, Spain

    Get PDF
    466 p.The InternationalWorkshop on Statistical Modelling (IWSM) is a reference workshop in promoting statistical modelling, applications of Statistics for researchers, academics and industrialist in a broad sense. Unfortunately, the global COVID-19 pandemic has not allowed holding the 35th edition of the IWSM in Bilbao in July 2020. Despite the situation and following the spirit of the Workshop and the Statistical Modelling Society, we are delighted to bring you the proceedings book of extended abstracts

    Resampling-based tests of functional categories in gene expression studies

    Get PDF
    DNA microarrays allow researchers to measure the coexpression of thousands of genes, and are commonly used to identify changes in expression either across experimental conditions or in association with some clinical outcome. With increasing availability of gene annotation, researchers have begun to ask global questions of functional genomics that explore the interactions of genes in cellular processes and signaling pathways. A common hypothesis test for gene categories is constructed as a post hoc analysis performed once a list of significant genes is identified, using classically derived tests for 2x2 contingency tables. We note several drawbacks to this approach including the violation of an independence assumption by the correlation in expression that exists among genes. To test gene categories in a more appropriate manner, we propose a flexible, permutation-based framework, termed SAFE (for Significance Analysis of Function and Expression). SAFE is a two-stage approach, whereby gene-specific statistics are calculated for the association between expression and the response of interest and then a global statistic is used to detect a shift within a gene category to more extreme associations. Significance is assessed by repeatedly permuting whole arrays whereby the correlation between all genes is held constant and accounted for. This permutation scheme also preserves the relatedness of categories containing overlapping genes, such that error rate estimates can be readily obtained for multiple dependent tests. Through a detailed survey of gene category tests and simulations based on real microarray, we demonstrate how SAFE generates appropriate Type I error rates as compared to other methods. Under a more rigorously defined null hypothesis, permutation-based tests of gene categories are shown to be conservative by inducing a special case with a maximum variance for the test statistic. A bootstrap-based approach to hypothesis testing is incorporated into the SAFE framework providing better coverage and improved power under a defined class of alternatives. Lastly, we extend the SAFE framework to consider gene categories in a probabilistic manner. This allows for a hypothesis test of co-regulation, using models of transcription factor binding sites to score for the presence of motifs in the upstream regions of genes

    Targeting the role of statins in breast cancer – through translationally edged clinical trials

    Get PDF
    AbstractBreast cancer incidence is increasing, and despite major progress in the treatment, breast cancer is still the leading cause of death from cancer among women. Thus, there is a constant need for new treatment options. Statins are peroral drugs that have been widely used since the early 1990s, due to their well-documented effect of lowering plasma cholesterol levels and preventing cardiovascular disease. Statins have also been recognized for their pleiotropic effects extending beyond their plasma cholesterol-lowering properties, and preclinical experiments have shown that statins exert anti-tumoral effects in breast cancer cell lines. Further, epidemiological studies have shown reduced breast cancer recurrence and mortality among statin users. These findings have led to the conduction of the phase II, window-of-opportunity, MAmmary cancer and STatins trial (MAST), aiming to further explore the statin effects of breast cancer. Papers I and II are based the MAST trial, which included 50 patients who received a high dose of atorvastatin (80mg/day) for two weeks during the treatment-free window between diagnosis and breast surgery. Before the start of atorvastatin treatment, core needle tumor biopsies were taken from the tumors and blood samples were collected. After two weeks of atorvastatin treatment, tumor tissue was retrieved during the standard surgical procedure and, at the same time, new blood samples were collected.In paper I, the protein expression of the cell-cycle regulators cyclin D1 and p27 was evaluated by immunohistochemistry on paired samples of formalin-fixed paraffin-embedded tumor tissue, before and after atorvastatin treatment. Project I revealed a significant down‐regulated expression of the oncogene cyclin D1 and a significant up‐regulated expression of the tumor suppressor p27 following two weeks of statin treatment.In paper II, fresh frozen paired tumor samples pre- and post-atorvastatin treatment were analyzed by extracting lipids from the tumor samples. Cholesterol levels were then measured using a cholesterol quantification assay in order to evaluate changes in the cholesterol levels. The expression of the LDL-receptor (LDLR) was analyzed by immunohistochemistry on formalin-fixed paraffin-embedded tumor tissue, pre- and post-atorvastatin treatment. Project II revealed a statin-induced up-regulation of the LDLR and preserved intratumoral cholesterol levels. In vitro experiments on MCF-7 cells treated with atorvastatin were performed for comparison on the cellular level and showed no significant changes in the intracellular cholesterol levels after atorvastatin treatment. There was a higher expression of the LDLR, in agreement with the clinical findings, but it was non-significant.Paper III is based on the large, prospective population-based Malmo Diet and Cancer Study. Tumor expression of HMGCR, the ratelimiting enzyme of the cholesterol biosynthesis pathway, which is inhibited by statins, was assessed by immunohistochemistry on tissue microarrays from 657 women diagnosed with primary invasive breast cancer between the years of 1991–2010. Tumoral expression of HMGCR was found to be associated with unfavorable tumor characteristics. The associations between statin use, HMGCR expression, and breast cancer mortality were investigated but no statistically significant associations were found.Paper IV is a descriptive publication of a clinical phase II trial – ABC-SE – in which the effect and tolerability of atorvastatin in combination with endocrine based treatment among patients with advanced breast cancer will be compared to standard endocrine based treatment. The goal of this study is to improve the understanding of the mechanisms behind resistance to endocrine treatment of breast cancer, and also to test the hypothesis that the addition of statins will enhance the effect of the endocrine based treatment. In conclusion, these results demonstrate new insights into the mechanisms of statins in breast cancer, which together with earlier published studies, and hopefully the results from the ABC-SE trial, will form the basis for future conduction of large, phase III randomized clinical trials, which are needed to clarify the role of statins in breast cancer

    Aitchison's Compositional Data Analysis 40 Years On: A Reappraisal

    Full text link
    The development of John Aitchison's approach to compositional data analysis is followed since his paper read to the Royal Statistical Society in 1982. Aitchison's logratio approach, which was proposed to solve the problematic aspects of working with data with a fixed sum constraint, is summarized and reappraised. It is maintained that the principles on which this approach was originally built, the main one being subcompositional coherence, are not required to be satisfied exactly -- quasi-coherence is sufficient, that is near enough to being coherent for all practical purposes. This opens up the field to using simpler data transformations, such as power transformations, that permit zero values in the data. The additional principle of exact isometry, which was subsequently introduced and not in Aitchison's original conception, imposed the use of isometric logratio transformations, but these are complicated and problematic to interpret, involving ratios of geometric means. If this principle is regarded as important in certain analytical contexts, for example unsupervised learning, it can be relaxed by showing that regular pairwise logratios, as well as the alternative quasi-coherent transformations, can also be quasi-isometric, meaning they are close enough to exact isometry for all practical purposes. It is concluded that the isometric and related logratio transformations such as pivot logratios are not a prerequisite for good practice, although many authors insist on their obligatory use. This conclusion is fully supported here by case studies in geochemistry and in genomics, where the good performance is demonstrated of pairwise logratios, as originally proposed by Aitchison, or Box-Cox power transforms of the original compositions where no zero replacements are necessary.Comment: 26 pages, 18 figures, plus Supplementary Material. This is a complete revision of the first version of this paper, placing the geochemical example upfront and adding a large section on CoDA of wide matrice
    corecore