50 research outputs found

    Unbalanced data processing using oversampling: machine Learning

    Get PDF
    Nowadays, the DL algorithms show good results when used in the solution of different problems which present similar characteristics as the great amount of data and high dimensionality. However, one of the main challenges that currently arises is the classification of high dimensionality databases, with very few samples and high-class imbalance. Biomedical databases of gene expression microarrays present the characteristics mentioned above, presenting problems of class imbalance, with few samples and high dimensionality. The problem of class imbalance arises when the set of samples belonging to one class is much larger than the set of samples of the other class or classes. This problem has been identified as one of the main challenges of the algorithms applied in the context of Big Data. The objective of this research is the study of genetic expression databases, using conventional methods of sub and oversampling for the balance of classes such as RUS, ROS and SMOTE. The databases were modified by applying an increase in their imbalance and in another case generating artificial noise

    Sufficient principal component regression for pattern discovery in transcriptomic data

    Full text link
    Methods for global measurement of transcript abundance such as microarrays and RNA-seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives, or ignore any unknown grouping structures for the features. We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.Comment: 26 pages, 9 figures, 9 table

    Gene expression studies from basic research to the clinic

    Get PDF

    Gene expression studies from basic research to the clinic

    Get PDF

    Gene expression studies from basic research to the clinic

    Get PDF
    Humane genetica is een spannend en multidisciplinair vakgebied. De focus van onderzoek binnen de humane genetica ligt op erfelijke ziekten, en het doel van het onderzoek is het verlichten van lijden door het behandelen en voorkomen van diverse ziekten.Het grootste project in de biologie tot dusver, het Humane Genoomproject (the Human Genome Project), is in 2003 afgerond en onthulde voor het eerst de hele DNA-sequentie van het menselijke genoom. Sindsdien zijn de kosten van en tijd die nodig is voor sequencing en genotypering van menselijke genen drastisch gedaald. Dit maakt grote studies van honderden duizenden individuen, en in de nabije toekomst, van miljoenen individuen, mogelijk.Als gevolg van deze grootschalige genetische studies, realiseren we ons voor het eerst hoe complex de genetica van de mens eigenlijk is. Honderden genetische varianten blijken in genoomwijde associatiestudies met een specifiek fenotype of ziekte geassocieerd. Deze varianten verklaren echter meestal slechts een klein deel van de erfelijkheid van dit fenotype. Bovendien zijn de mechanismen waarmee deze varianten tot ziekte leiden meestal nog steeds onbekend.DNA is een prachtig en veelzijdig molecuul, het bevat informatie waarmee het repliceert en waarmee het organisme zich in zijn omgeving kan ontwikkelen. In cellen worden delen van het DNA (de genen) overgeschreven naar RNA. Sommige RNA-moleculen (van voor eiwit coderende genen) worden verder omgezet naar eiwitten (translatie). Andere RNA-moleculen spelen uiteenlopende rollen, zoals bijvoorbeeld het reguleren van de transcriptie van andere genen.RNA transcriptie (ook wel genexpressie genoemd) is de eerste stap in het process van DNA naar fenotype. Daarom kan het bestuderen van kwantiteit aan RNA in monsters uit verschillende weefsels ons helpen bij het begrijpen van cellulaire fenomenen die tot ziekte leiden.In dit proefschrift heb ik grote hoeveelheden publiek beschikbare genexpressiegegevens gebruikt om de functies van genen te voorspellen en genen en de gen-specifieke netwerken (in het Engels: pathways) te prioriteren die relevant kunnen zijn voor verschillende fenotypen en ziekten.Human genetics is an exciting, multidisciplinary field of science. Its focus of investigation is on hereditary diseases in humans, with the ultimate aim of alleviating suffering by treating and preventing various diseases.The biggest project in biology to date, the Human Genome Project, was completed in 2003 and it unravelled the DNA sequence of the human genome. Since then, the cost and time needed for sequencing and genotyping human genomes have fallen dramatically, which now allows for massive studies to be performed consisting of hundreds of thousands of individuals. In the near future, studies of millions of people are likely to appear.As a result of the large-scale studies during the past decade, we have come to appreciate the true complexity of genetics. In genome-wide association studies, hundreds of genetic variants have been associated with a specific phenotype or disease. However, these variants together typically only explain a small proportion of the heritability of the phenotype, while the mechanisms by which such variants lead to disease are still mostly unknown.DNA is a marvellous molecule. It contains information which enables it to replicate itself and make the organism develop in its environment. In cells, parts of DNA (i.e. genes) are transcribed into RNA. Some RNA molecules (protein-coding genes) are further translated into proteins. Other RNAs play different roles, such as regulating the transcription of other genes.RNA transcription (gene expression) is the first step in the DNA-to-phenotype process. Therefore, studying abundances of RNA in samples from various tissues can help us to un- derstand the cellular phenomena that lead to disease.In this thesis, I report on my research using publicly available gene expression data on a large scale to predict the functions of genes and to prioritize genes and pathways that may be relevant to different phenotypes and diseases

    Modelling the transcriptional regulation of androgen receptor in prostate cancer

    Get PDF
    Transcription of genes and production of proteins are essential functions of a normal cell. If disturbed, misregulation of crucial genes leads to aberrant cell behaviour and in some cases, leads to the development of diseased states such as cancer. One major transcriptional regulation tool involves the binding of transcription factor onto enhancer sequences that will encourage or repress transcription depending on the role of the transcription factor. In prostate cells, misregulation of the androgen receptor(AR), a key transcriptional regulator, leads to the development and maintenance of prostate cancer. Androgen receptor binds to numerous locations in the genome, but it is still unclear how and which other key transcription factors aid and repress AR-mediated transcription. Here I analyzed the data that contained the transcriptional activity of 4139 putative AR binding sites (ARBS) in the genome with and without the presence of hormone using the STARR-seq assay. Only a small fraction of ARBS showed significant differential expression when treated with hormone. To understand the underlying essential factors behind hormone-dependent behaviour, we developed both machine learning and biophysical models to identify active enhancers in prostate cancer cells. We also identify potentially crucial transcription factors for androgen-dependent behaviour and discuss the benefits and shortcomings of each modelling method

    The use of scRNA-seq to characterise the tumour microenvironment of high grade serous ovarian carincoma (HGSOC)

    Get PDF
    High Grade Serous Ovarian Carcinoma (HGSOC) is the most common type of ovarian cancer. Patients with this disease typically experience relapse in their disease following surgical debulking and initially effective chemotherapy. HGSOC has been intensely studied at the genomic and transcriptomic levels in efforts to advance knowledge of the biological mechanisms that drive the behaviour of this malignancy, and so that new treatment strategies may curb the disease progression relapse. This body of work contributes an optimised protocol for generating robust 10X scRNA-seq libraries from fresh and preserved HGSOC tissue, aiming to dissect the cellular heterogeneity of HGSOC’s Tumour microenvironment (TME). Through unsupervised clustering analysis, it uncovers distinct cellular communities, elucidates transcriptomic signatures across HGSOC tumours, and augments bulk RNA-seq datasets via computational deconvolution, enhancing understanding of HGSOC's cellular complexity across an expanded clinical cohort. The sequencing and analysis of these HGSOC patient tumours revealed 11 distinct cell types, including 2 that are novel in this tumour type; namely ciliated epithelial cells and metallothionein expressing T-cells. These 11 distinct cell types can be broadly categorised into 3 TME components (Tumour, Stroma and Immune) as in other previous tumour scRNA-seq studies. An additional analysis of these components examined the copy number variation (CNV) in the profiled cells and revealed HGSOC tumour cells to be mostly aneuploid while ciliated epithelial cells were diploid. A novel integrative subcluster analysis of HGSOC aneuploid tumour cells identified several apparently tumourigenic gene expression signatures. These include a KRT17+, protease inhibitory signature, an increased cellular metabolism signature, and an immune-reactive signature. Additionally, a ciliated cluster re-emerged within the HGSOC tumour cells, even though the diploid ciliated epithelial cells were not included in the integrative analysis. Finally, the high granularity of HGSOC cellular composition revealed by scRNA-seq is utilised to perform deconvolution analyses to estimate cellular proportions and infer the TME of earlier bulk RNA-seq profiled HGSOC tumour samples. This investigation of earlier sequenced HGSOC samples revealed heterogeneity in the proportions of the TME compartments across the patient cohorts. Survival analysis using these inferred cellular proportions suggest that immune cell presence alone is not associated with survival, but metastatic fibroblast burden in tumour samples is significantly associated with worsen overall survival in HGSOC patients. In conclusion, the laboratory protocol, the scRNA-seq datasets produced, and their analysis and application presented in this work expands the collective knowledge base of HGSOC. Specifically by characterising the cells of the HGSOC tumour microenvironment, and nuances of expression signatures of the malignant cells. The deconvolution approach showcases how scRNA-seq data can expand the clinical utility of earlier RNA-seq HGSOC datasets in a way that is scalable

    Predicting phenotypes from microarrays using amplified, initially marginal, eigenvector regression

    No full text
    Motivation: The discovery of relationships between gene expression measurements and phenotypic responses is hampered by both computational and statistical impediments. Conventional statistical methods are less than ideal because they either fail to select relevant genes, predict poorly, ignore the unknown interaction structure between genes, or are computationally intractable. Thus, the creation of new methods which can handle many expression measurements on relatively small numbers of patients while also uncovering gene–gene relationships and predicting well is desirable. Results: We develop a new technique for using the marginal relationship between gene expression measurements and patient survival outcomes to identify a small subset of genes which appear highly relevant for predicting survival, produce a low-dimensional embedding based on this small subset, and amplify this embedding with information from the remaining genes. We motivate our methodology by using gene expression measurements to predict survival time for patients with diffuse large B-cell lymphoma, illustrate the behavior of our methodology on carefully constructed synthetic examples, and test it on a number of other gene expression datasets. Our technique is computationally tractable, generally outperforms other methods, is extensible to other phenotypes, and also identifies different genes (relative to existing methods) for possible future study

    Development of advanced methods for large-scale transcriptomic profiling and application to screening of metabolism disrupting compounds

    Get PDF
    High-throughput transcriptomic profiling has become a ubiquitous tool to assay an organism transcriptome and to characterize gene expression patterns in different cellular states or disease conditions, as well as in response to molecular and pharmacologic perturbations. Refinements to data preparation techniques have enabled integration of transcriptomic profiling into large-scale biomedical studies, generally devised to elucidate phenotypic factors contributing to transcriptional differences across a cohort of interest. Understanding these factors and the mechanisms through which they contribute to disease is a principal objective of numerous projects, such as The Cancer Genome Atlas and the Cancer Cell Line Encyclopedia. Additionally, transcriptomic profiling has been applied in toxicogenomic screening studies, which profile molecular responses of chemical perturbations in order to identify environmental toxicants and characterize their mechanisms-of-action. Further adoption of high-throughput transcriptomic profiling requires continued effort to improve and lower the costs of implementation. Accordingly, my dissertation work encompasses both the development and assessment of cost-effective RNA sequencing platforms, and of novel machine learning techniques applicable to the analyses of large-scale transcriptomic data sets. The utility of these techniques is evaluated through their application to a toxicogenomic screen in which our lab profiled exposures of adipocytes to metabolic disrupting chemicals. Such exposures have been implicated in metabolic dyshomeostasis, the predominant cause of obesity pathogenesis. Considering that an estimated 10% of the global population is obese, understanding the role these exposures play in disrupting metabolic balance has the potential to help combating this pervasive health threat. This dissertation consists of three sections. In the first section, I assess data generated by a highly-multiplexed RNA sequencing platform developed by our section, and report on its significantly better quality relative to similar platforms, and on its comparable quality to more expensive platforms. Next, I present the analysis of a toxicogenomic screen of metabolic disrupting compounds. This analysis crucially relied on novel supervised and unsupervised machine learning techniques which I specifically developed to take advantage of the experimental design we adopted for data generation. Lastly, I describe the further development, evaluation, and optimization of one of these methods, K2Taxonomer, into a computational tool for unsupervised molecular subgrouping of bulk and single-cell gene expression data, and for the comprehensive in-silico annotation of the discovered subgroups
    corecore