818 research outputs found

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets

    Get PDF
    The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets. The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o

    Nanogenomics and Nanoproteomics Enabling Personalized, Predictive and Preventive Medicine

    Get PDF
    Since the discovery of the nucleic acid, molecular biology has made tremendous progresses, achieving a lot of results. Despite this, there is still a gap between the classical and traditional medical approach and the molecular world. Inspired by the incredible wealth of data generated by the "omics"-driven techniques and the “high-trouhgput technologies” (HTTs), I have tried to develop a protocol that could reduce the actually extant barrier between the phenomenological medicine and the molecular medicine, facilitating a translational shift from the lab to the patient bedside. I also felt the urgent need to integrate the most important omics sciences, that is to say genomics and proteomics. Nucleic Acid Programmable Protein Arrays (NAPPA) can do this, by utilizing a complex mammalian cell free expression system to produce proteins in situ. In alternative to fluorescent-labeled approaches a new label free method, emerging from the combined utilization of three independent and complementary nanobiotechnological approaches, appears capable to analyze gene and protein function, gene-protein, gene-drug, protein-protein and protein-drug interactions in studies promising for personalized medicine. Quartz Micro Circuit nanogravimetry (QCM), based on frequency and dissipation factor, mass spectrometry (MS) and anodic porous alumina (APA) overcomes indeed the limits of correlated fluorescence detection plagued by the background still present after extensive washes. Work is in progress to further optimize this approach a homogeneous and well defined bacterial cell free expression system able to realize the ambitious objective to quantify the regulatory gene and protein networks in humans. Implications for personalized medicine of the above label free protein array using different test genes and proteins are reported in this PhD thesis

    Haiguste ja koespetsiifiliste DNA metülatsioonil põhinevate biomarkerite uurimine

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneDNA-s sisalduv geneetiline informatsioon annab vajalikud juhised organismi kasvuks ja arenguks. Lisaks DNA nukleotiidsele järjestusele mõjutavad neid protsesse ka DNA-s esinevad modifikatsioonid. Enim uuritud DNA modifikatsioon on DNA metülatsioon, mis tähendab metüülrühma lisamist tsütosiini külge. DNA on tihtilugu metüleeritud regiooniti, moodustades niinimetatud metülatsioonimustreid. Need “mustrid“ osalevad geeniekspressiooni regulatsioonis, lülitades teatud rakkudes geene sisse ja välja või kohandades nende aktiivsust. On oluline märkida, et DNA metülatsioon on tugevalt mõjutatud keskkonnateguritest, nimelt vastavalt keskkonnatingimustele võidakse teatud regioone metüleerida või vastupidi, metüülrühmi eemaldada. Seega on DNA metülatsioon üheks vahelüliks geneetika ja keskkonna vahel. Paljud neist “mustritest“ on omased tavalistele bioloogilistele protsessidele, kuid leidub ka selliseid, mis viitavad haiguse olemasolule. Näiteks on spetsiifilisi metülatsioonimustreid täheldatud diabeedi, neuroloogiliste häirete ja vähi puhul. Seetõttu peetakse neid “mustreid“ ka headeks biomarkeri kandidaatideks, sobides iseloomustama näiteks teatud haiguste kulgu. Käesolev väitekiri keskendubki DNA metülatsiooni uurimisele erinevates kudedes ja seisundites, et leida potentsiaalseid biomarkereid. Selleks kasutati erinevaid bioinformaatika ja statistika meetodeid. Kokku viidi läbi kolm publitseeritud uuringut, mille käigus uuriti nii koe- kui endometrioosispetsiifilisi biomarkeri kandidaate kui ka DNA metülatsiooni muutusi emaka endomeetriumi embrüole vastuvõtlikuks muutumise perioodil. Lisaks arendati doktoritöö raames välja uudne ja kasutajasõbralik veebirakendus – MethSurv, mis kasutades suurprojekti “The Cancer Genome Atlas” (TCGA) andmeid, võimaldab kasutajal uurida vähipatsientide elumust konkreetse DNA metülatsioonil põhineva prognostiliste markeri põhjal.DNA contains the genetic information required for the growth and development of the organism. In addition to the nucleotide sequence, certain chemical modifications influence the activity of the DNA. The most studied DNA modification is DNA methylation, where a methyl group is added to the cytosine base of the DNA. DNA is often methylated within a genomic region, forming so-called “methylation patterns.” These "patterns" are involved in the regulation of gene expression by switching genes in and out of certain cells or adjusting their activity. Environmental factors strongly influence DNA methylation; wherein certain genomic regions may be methylated or unmethylated. Thus, methylation patterns serve as a mediator between the environment and genomes. Many of these "patterns" are inherited in normal biological processes. However, some of these patterns indicate the presence of the disease. For example, specific methylation patterns have been observed in diabetes, neurological disorders, and cancer. Therefore, methylation patterns are considered as biomarker candidates to characterize the progression of certain diseases or normal biological process. This thesis focuses on the study of DNA methylation in different tissues and conditions to identify potential biomarker candidates using various bioinformatics and statistical methods. In total, three studies were included in this thesis to investigate both tissue and endometriosis-specific biomarker candidates as well as changes in DNA methylation during the transition from pre-receptive to the receptive state of the endometrium. In addition, a novel and user-friendly web application MethSurv was developed in this thesis. MethSurv uses methylation and clinical data from the publicly available “The Cancer Genome Atlas” (TCGA). The MethSurv tool is aimed at assisting the scientific community in exploring methylation-based prognostic biomarkers.https://www.ester.ee/record=b522744

    Predicting breast cancer risk, recurrence and survivability

    Full text link
    This thesis focuses on predicting breast cancer at early stages by using machine learning algorithms based on biological datasets. The accuracy of those algorithms has been improved to enable the physicians to enhance the success of treatment, thus saving lives and avoiding several further medical tests

    Studying the origins of primary tumours and residual disease in breast cancer

    Get PDF
    Breast cancer is the leading cause of death in women worldwide and these deaths are mostly attributed to metastasis and tumour recurrence following initially successful therapy. Metastasis refers to the development of invasive disease, wherein malignant cells dissociate from primary tumours, infiltrating other organs and tissues to give rise to secondary outgrowths. Previously, metastasis was thought to be initiated in advanced tumours, but breast cancer cellsh with metastatic potential have now been shown to disseminate very early from the primary site via largely unknown mechanisms. These early interactions of tumour cells with their cellular micro-environment and normal neighbours also results in early tumour cell heterogeneity and must therefore be elucidated such that we can prevent metastatic spread in the patient situation and better treat the resulting heterogenous tumours. However, studying tumour initiation is not possible in patients because it happens on a cellular level not detectable by current technology. Tumour recurrence is another major cause of breast cancer related death and is believed to be caused by residual disease cells that survive initial therapy. These are a reservoir of refractory cells that can lay dormant for many years (sometimes decades) before resulting in relapse tumours. They are also difficult to obtain from human patients, since they are very few and cannot be detected easily, and thus their molecular mechanisms have not been fully explored. In addition to the unavailability of human tissue, mouse models of breast cancer also fall short in helping us study early cancer initiation, because they allow oncogenic expression in all cells of the tissue instead of initiating cancer like in the human situation|one neoplastic transformed cell proliferating unchecked in a normal epithelium. To address this issue, we used primary organoids from an inducible mouse model of breast cancer and lentivirally transduced single cells within these organoids to express oncogenes. We further optimized parameters for long term imaging using light sheet microscopy and developed big data analysis pipelines that lead us to discern that single transformed cells had a lower chance at establishing tumorigenic foci, when compared to clusters of cells. Thus, we postulate a proximity-controlled signalling that is imperative to tumour initiation within epithelial tissues using the first ever in vitro stochastic breast tumorigenesis model system. This new stochastic tumorigenesis system can be further used to identify the molecular interactions in the early breast cancer cells. Our group has already revealed distinct characteristics, such as dysregulated lipid metabolism, of the residual disease correlate obtained from an inducible mouse model. As survival mechanisms invoked by residual cells remain largely unknown, we analysed the dynamic transcriptome of regressing tumours at important timepoints during the establishment of residual disease. Key molecular players upregulated during regression {like c-Jun and BCL6 { were identified and the inflammatory arm of the Nf-kB cascade was found to be dysregulated among others. Further validation of these molecular targets as potentially synthetic lethal interactors remains to be performed so that they can be used to limit the residual disease reservoir and eventually tumour recurrence

    Cluster analysis of gene expression data on cancerous tissue samples.

    Get PDF
    The cluster analysis of gene expression data is an important unsupervised learning method that is commonly used to discover the inherent structure in the large amounts of data generated by microarray measurements. The focus of this research is to develop a novel clustering algorithm that adheres to the definition of unsupervised learning whilst minimising any sources of bias. The developed diffractive clustering algorithm is based on the fundamental diffraction properties of light, which presents a novel view and framework for clustering data. The algorithm is tested on multiple cancerous tissue data sets that are well established in the literature. The overall result is a clustering algorithm that outperforms the conventional clustering algorithms, such as k-means and fuzzy cmeans, by 10% in terms of accuracy and more than 30% in terms of cluster validity. The diffraction-based clustering algorithm is also independent of any parameters and is able to automatically determine the correct number of clusters in the data
    corecore