1,157 research outputs found

    Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects

    Full text link
    Time course microarray data provide insight about dynamic biological processes. While several clustering methods have been proposed for the analysis of these data structures, comparison and selection of appropriate clustering methods are seldom discussed. We compared 33 probabilistic based clustering methods and 33 distance based clustering methods for time course microarray data. Among probabilistic methods, we considered: smoothing spline clustering also known as model based functional data analysis (MFDA), functional clustering models for sparsely sampled data (FCM) and model-based clustering (MCLUST). Among distance based methods, we considered: weighted gene co-expression network analysis (WGCNA), clustering with dynamic time warping distance (DTW) and clustering with autocorrelation based distance (ACF). We studied these algorithms in both simulated settings and case study data. Our investigations showed that FCM performed very well when gene curves were short and sparse. DTW and WGCNA performed well when gene curves were medium or long (>=10>=10 observations). SSC performed very well when there were clusters of gene curves similar to one another. Overall, ACF performed poorly in these applications. In terms of computation time, FCM, SSC and DTW were considerably slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more accurate and biological meaningful clustering results. WGCNA and MCLUST are the best methods among the 6 methods compared, when performance and computation time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides model based inference and uncertainty measure of clustering results

    Optimizing parameters in fuzzy k-means for clustering microarray data.

    Get PDF
    Rapid advances of microarray technologies are making it possible to analyze and manipulate large amounts of gene expression data. Clustering algorithms, such as hierarchical clustering, self-organizing maps, k-means clustering and fuzzy k-means clustering, have become important tools for expression analysis of microarray data. However, the need of prior knowledge of the number of clusters, k, and the fuzziness parameter, b, limits the usage of fuzzy clustering. Few approaches have been proposed for assigning best possible values for such parameters. In this thesis, we use simulated annealing and fuzzy k-means clustering to determine the optimal parameters, namely the number of clusters, k, and the fuzziness parameter, b. To assess the performance of our method, we have used synthetic and real gene experiment data sets. To improve our approach, two methods, searching with Tabu List and Shrinking the scope of randomization, are applied. Our results show that a nearly-optimal pair of k and b can be obtained without exploring the entire search space.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .Y37. Source: Masters Abstracts International, Volume: 44-03, page: 1419. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Micro- and macroarray technologies help acquire thousands of gene expression patterns covering important biological processes during plant ontogeny. Particularly, faithful visualization methods are beneficial for revealing interesting gene expression patterns and functional relationships of coexpressed genes. Such screening helps to gain deeper insights into regulatory behavior and cellular responses, as will be discussed for expression data of developing barley endosperm tissue. For that purpose, high-throughput multidimensional scaling (HiT-MDS), a recent method for similarity-preserving data embedding, is substantially refined and used for (a) assessing the quality and reliability of centroid gene expression patterns, and for (b) derivation of functional relationships of coexpressed genes of endosperm tissue during barley grain development (0–26 days after flowering).</p> <p>Results</p> <p>Temporal expression profiles of 4824 genes at 14 time points are faithfully embedded into two-dimensional displays. Thereby, similar shapes of coexpressed genes get closely grouped by a correlation-based similarity measure. As a main result, by using power transformation of correlation terms, a characteristic cloud of points with bipolar sandglass shape is obtained that is inherently connected to expression patterns of pre-storage, intermediate and storage phase of endosperm development.</p> <p>Conclusion</p> <p>The new HiT-MDS-2 method helps to create global views of expression patterns and to validate centroids obtained from clustering programs. Furthermore, functional gene annotation for developing endosperm barley tissue is successfully mapped to the visualization, making easy localization of major centroids of enriched functional categories possible.</p

    Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets

    Get PDF
    The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets. The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o

    Identifying Three-Way Gene Interactions from Microarray Data using Kolmogorov-Smirnov and Cross-Match Tests

    Get PDF
    Human gene network is much more complex than just pairwise interaction among the genes. Zhang et al. [6] extracted microarray data from International Genomics Consortium (IGC), and presented the detection of three-way gene interactions in their paper using Fisher’s z-transformation test. Three-way gene interactions are closer than pairwise correlations in representing the complex gene structures. Additionally, it was more tractable than assessing four or more gene interactions. In this paper, we are simulating different models where Fisher’s test might not be as effective. Zhang et al.’s approach utilized Pearson’s correlation coefficients and involved detection of linear interactions only. Since gene interactions could show any kind of behavior, their evaluation approach might not work most of the time. Therefore, we are utilizing the dataset Zhang et al. provided in order to detect the three-way gene interaction using non-parametric tests like Kolmogorov-Smirnov and Cross-Match

    Modeling And Identification Of Differentially Regulated Genes Using Transcriptomics And Proteomics Data

    Get PDF
    Photosynthetic organisms are complex dynamical systems, showing a remarkable ability to adapt to different environmental conditions for their survival. Mechanisms underlying the coordination between different cellular processes in these organisms are still poorly understood. In this dissertation we utilize various computational and modeling techniques to analyze transcriptomics and proteomics data sets from several photosynthetic organisms. We try to use changes in expression levels of genes to study responses of these organisms to various environmental conditions such as availability of nutrients, concentrations of chemicals in growth media, and temperature. Three specific problems studied here are transcriptomics modifications in photosynthetic organisms under reduction-oxidation: redox) stress conditions, circadian and diurnal rhythms of cyanobacteria and the effect of incident light patterns on these rhythms, and the coordination between biological processes in cyanobacteria under various growth conditions. Under redox stresses caused by high light treatments, a strong transcriptomic level response, spread across many biological processes, is discovered in the cyanobacterium Synechocystis sp. PCC 6803. Based on statistical tests, expression levels of about 20% of genes in Synechocystis 6803 are identified as significantly affected due to influence of high light. Gene clustering methods reveal that these responses can mainly be classified as transient and consistent responses, depending on the duration of modified behaviors. Many genes related to energy production as well as energy utilization are shown to be strongly affected. Analysis of microarray data under two stress conditions, high light and DCMU treatment, combined with data mining and motif finding algorithms led to a discovery of novel transcription factor, RRTF1 that responds to redox stresses in Arabidopsis thaliana. Time course transcriptomics data from Cyanothece sp. ATCC 51142 have shown strong diurnal rhythms. By combining multiple experimental conditions and using gene classification algorithms based on Fourier scores and angular distances, it is shown that majority of the diurnal genes are in fact light responding. Only about 10% of genes in the genome are categorized as being circadian controlled. A transcription control model based on dynamical systems is employed to identify the interactions between diurnal genes. A phase oscillator network is proposed to model the behavior of different biological processes. Both these models are shown to carry biologically meaningful features. To study the coordination between different biological processes to various environment and genetic modifications, an interaction model is derived using Bayesian network approach, combining all publicly available microarray data sets for Synechocystis sp. PCC 6803. Several novel relationships between biological processes are discovered from the model. Model is used to simulate several experimental conditions, and the response of the model is shown to agree with the experimentally observed behaviors

    ClockOME: searching for oscillatory genes in early vertebrate development

    Get PDF
    Embryo development is a dynamic process regulated in space and time. Cells must integrate biochemical and mechanical signals to generate fully functional organisms, where oscillatory gene expression plays a key role. The embryo molecular clock (EMC) is the best known genetic oscillator active in embryo segmentation, involving genes from the Notch, FGF, and WNT pathways. However, the list of cyclic genes is still incomplete mostly due to the challenges involved with studying periodic systems. Recently, such studies have become more feasible with the development of pseudo-time ordering algorithms that search for candidate oscillatory genes using large transcriptomics datasets sampled without explicit time measurements. This study aims at finding candidate oscillatory genes - ClockOME - active in early chick embryo development. Two Gallus gallus microarray transcriptomics datasets from Presomitic mesoderm (PSM), and one dataset from limb segmentation were gathered from GEO and ArrayExpress. To normalize these data from different experiments, an RData package - FrozenChicken - was developed to apply a frozen Robust MultiArray (fRMA) normalization to the data. Next the datasets were processed with Oscope (a pseudo-time ordering algorithm) to search for candidate periodic genes clustered by similar oscillatory behaviour. The clusters of predicted oscillators were then subject to functional enrichment and interaction network analyses to highlight the biological functions associated with these genes. Oscope predicted three clusters of oscillators: two in PSM (106 and 32 genes), and one in Limb (162 genes). Overall, the genes are associated with regulatory, morphological, and developmental processes. Mesp2, a gene involved with the EMC, was found in this dataset, validating the approach, however, the majority of genes are novel oscillatory candidates, associated with chromatin and transcriptional regulation, as well as protein and oxygen metabolism. The list of candidate oscillators represents a valuable resource for guided experimental validation to discover additional members of the chick EMC. Six genes have been proposed for high-priority experimental validation: SRC, PTCH1, NOTCH2, YAP1, KDR, CTR9.O desenvolvimento embrionário é um processo dinâmico que envolve alterações moleculares no espaço e no tempo. As células embrionárias são constantemente expostas a estímulos bioquímicos e mecânicos, e respondem ao ambiente em que se encontram alterando o seu programa genético. Quando corretamente integradas, estas respostas celulares culminam com o desenvolvimento bem-sucedido de um organismo funcional. Assim, a embriogénese envolve processos moleculares estritamente regulados, sendo a expressão oscilatória de genes uma das formas possíveis para a regulação do comportamento das células ao longo do tempo. O relógio molecular embrionário é um conhecido oscilador genético, e está envolvido na segmentação do tecido paraxial embrionário. O conceito de relógio molecular foi inicialmente proposto em 1976 por Cooke e Zeeman, ao qual chamaram o modelo Clock and Wavefront (Relógio e Frente de Onda)1. Este modelo foi concebido para descrever teoricamente a formação rítmica de sómitos em ambos os lados da mesoderme paraxial (PSM) nos vertebrados, e baseia-se na existência de osciladores genéticos que regulam esse processo de segmentação da PSM ao longo do tempo. Para além do relógio, como diz o nome, o modelo inclui a existência de uma frente de onda, que determina espacialmente o comportamento das células presentes na mesoderme pré-somítica (PSM). Assim, os dois mecanismos guiam a diferenciação das células da PSM, que consequentemente sofrem transformações genéticas que precedem a formação dos sómitos. A base deste relógio molecular consiste na expressão periódica de genes que fazem parte das vias moleculares Notch, FGF e WNT. Contudo, a lista de genes envolvidos no relógio embrionário ainda não se encontra completa, facto este que se deve principalmente às dificuldades experimentais relacionadas com o estudo de sistemas periódicos quando não se conhece de antemão a periodicidade/ritmo da expressão dos genes envolvidos. Com o advento de novas técnicas de transcriptómica que permitem o estudo dos valores de expressão de todos os genes simultaneamente, nomeadamente usando Microarrays, ou mais recentemente através de métodos de sequenciação, como RNA-sequencing ou Single-Cell RNA-sequencing, surge a oportunidade de procurar alargar a lista de genes com expressão oscilatória. Porém, estes métodos implicam a extração do RNA das células amostradas resultando na morte celular. Assim, este processamento inviabiliza o estudo das mesmas células ao longo do tempo, originando dados moleculares estáticos, isto é, os níveis de expressão obtidos representam uma única amostra temporal. Para o estudo de processos periódicos, seria então necessário fazer uma série temporal amostrando diferentes indivíduos ao longo do tempo de desenvolvimento, aumentando grandemente o número de amostras biológicas necessárias para resolver o ciclo de oscilação para cada gene estudado. Assim, sem informação temporal medida explicitamente, a expressão oscilatória de genes pode apenas ser estudada usando modelos matemáticos apropriados, nomeadamente através da aplicação de algoritmos de ordenação pseudo-temporal. Estes métodos ordenam as amostras ao longo do tempo de uma oscilação de forma a obter o padrão do comportamento cíclico para todos os genes cuja expressão oscila concomitantemente. Torna-se assim possível, bioinformaticamente, inferir o potencial oscilatório de genes medidos por estas técnicas de transcriptómica, sem informação temporal explícita. Deste modo, o objetivo deste estudo é encontrar novos genes oscilatórios, a que coletivamente chamamos ClockOME, que estão ativos durante as primeiras etapas do desenvolvimento embrionário (somitogénese) da galinha, nos tecidos da mesoderme présomítica (PSM), e no membro superior (Limb); tecidos estes onde o relógio molecular foi descrito, atuando como regulador temporal das alterações genéticas subjacentes. Para tal, recolheu-se 3 conjuntos de dados (datasets) de transcriptómica obtidos por microarray de dois repositórios de dados públicos: GEO (da instituição americana NCBI) e ArrayExpress (da instituição europeia EMBL-EBI). Dois datasets continham dados de mesoderme paraxial (PSM) – tecido onde ocorre a somitogénese; e um dataset de dados de obtidos do membro superior do embrião de galinha. Com o objetivo de normalizar os três datasets de forma a torná-los comparáveis (uma vez que são oriundos de processos experimentais diferentes), foi desenvolvido um pacote de R denominado “FrozenChicken: Promoting the meta-analysis of chicken microarray data” (publicado em 2021) (https://doi.org/10.1101/2021.02.25.432894). Este pacote contém dados sumarizados de 472 datasets de microarrays de embriões de galinha, tornando possível a normalização por fRMA (frozen Robust MultiArray) de microarrays de Gallus gallus. Após normalização e controlo de qualidade dos valores de expressão genética, os dados da PSM e do membro foram processados com o Oscope (algoritmo de ordenação pseudo-temporal), com o propósito de prever genes oscilatórios. Este algoritmo avalia todas as combinações de pares de genes, agrupando aqueles que apresentem padrões de expressão semelhantes, ou seja, cujos valores de expressão ao longo das amostras seguem trajetórias semelhantes, indiciando um período de oscilação potencialmente semelhante. Os clusters de genes previstos pelo Oscope foram posteriormente submetidos a uma análise de enriquecimento funcional e a uma análise de interações funcionais, com o intuito de perceber o seu potencial papel biológico, e funções moleculares subjacentes. O Oscope reportou três listas de genes potencialmente oscilatórios: dois grupos foram encontrados a partir dos dados da PSM (com 106 e 32 genes cada) e o terceiro grupo de 162 genes foi encontrado nos dados do membro superior. No total, a lista de genes que denominamos ClockOME é composta por 296 genes potencialmente oscilatórios, envolvidos em diversos mecanismos regulatórios importantes para o desenvolvimento embrionário e para a morfogénese. A maioria dos genes presentes nesta lista não estão descritos na literatura como sendo oscilatórios (novel candidates), representando, portanto, uma mais-valia para a comunidade científica que estuda o relógio molecular embrionário. Estes genes parecem estar associados a funções como remodelação da cromatina, regulação da transcrição, metabolismo proteico e metabolismo do oxigénio, sendo, portanto, bons candidatos para futura validação experimental. Notavelmente, o Oscope identificou com sucesso o Mesp2, um gene oscilatório bem descrito na literatura, mostrando assim a validade e o potencial desta abordagem teórica. Em suma, este trabalho produziu uma lista de 296 genes potencialmente oscilatórios. Com base na sua novidade e na função molecular anotada, foi proposta uma lista de seis genes candidatos de particular relevância para validação experimental no futuro próximo, nomeadamente: SRC, PTCH1, NOTCH2, YAP1, KDR, CTR9. Assim, as listas resultantes do trabalho desta tese poderão agora guiar futuras experiências laboratoriais capazes de adicionar novos interactores moleculares ao atual modelo do relógio molecular embrionário
    corecore