1,157 research outputs found
Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects
Time course microarray data provide insight about dynamic biological
processes. While several clustering methods have been proposed for the analysis
of these data structures, comparison and selection of appropriate clustering
methods are seldom discussed. We compared probabilistic based clustering
methods and distance based clustering methods for time course microarray
data. Among probabilistic methods, we considered: smoothing spline clustering
also known as model based functional data analysis (MFDA), functional
clustering models for sparsely sampled data (FCM) and model-based clustering
(MCLUST). Among distance based methods, we considered: weighted gene
co-expression network analysis (WGCNA), clustering with dynamic time warping
distance (DTW) and clustering with autocorrelation based distance (ACF). We
studied these algorithms in both simulated settings and case study data. Our
investigations showed that FCM performed very well when gene curves were short
and sparse. DTW and WGCNA performed well when gene curves were medium or long
( observations). SSC performed very well when there were clusters of gene
curves similar to one another. Overall, ACF performed poorly in these
applications. In terms of computation time, FCM, SSC and DTW were considerably
slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more
accurate and biological meaningful clustering results. WGCNA and MCLUST are the
best methods among the 6 methods compared, when performance and computation
time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides
model based inference and uncertainty measure of clustering results
Optimizing parameters in fuzzy k-means for clustering microarray data.
Rapid advances of microarray technologies are making it possible to analyze and manipulate large amounts of gene expression data. Clustering algorithms, such as hierarchical clustering, self-organizing maps, k-means clustering and fuzzy k-means clustering, have become important tools for expression analysis of microarray data. However, the need of prior knowledge of the number of clusters, k, and the fuzziness parameter, b, limits the usage of fuzzy clustering. Few approaches have been proposed for assigning best possible values for such parameters. In this thesis, we use simulated annealing and fuzzy k-means clustering to determine the optimal parameters, namely the number of clusters, k, and the fuzziness parameter, b. To assess the performance of our method, we have used synthetic and real gene experiment data sets. To improve our approach, two methods, searching with Tabu List and Shrinking the scope of randomization, are applied. Our results show that a nearly-optimal pair of k and b can be obtained without exploring the entire search space.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .Y37. Source: Masters Abstracts International, Volume: 44-03, page: 1419. Thesis (M.Sc.)--University of Windsor (Canada), 2005
Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue
<p>Abstract</p> <p>Background</p> <p>Micro- and macroarray technologies help acquire thousands of gene expression patterns covering important biological processes during plant ontogeny. Particularly, faithful visualization methods are beneficial for revealing interesting gene expression patterns and functional relationships of coexpressed genes. Such screening helps to gain deeper insights into regulatory behavior and cellular responses, as will be discussed for expression data of developing barley endosperm tissue. For that purpose, high-throughput multidimensional scaling (HiT-MDS), a recent method for similarity-preserving data embedding, is substantially refined and used for (a) assessing the quality and reliability of centroid gene expression patterns, and for (b) derivation of functional relationships of coexpressed genes of endosperm tissue during barley grain development (0–26 days after flowering).</p> <p>Results</p> <p>Temporal expression profiles of 4824 genes at 14 time points are faithfully embedded into two-dimensional displays. Thereby, similar shapes of coexpressed genes get closely grouped by a correlation-based similarity measure. As a main result, by using power transformation of correlation terms, a characteristic cloud of points with bipolar sandglass shape is obtained that is inherently connected to expression patterns of pre-storage, intermediate and storage phase of endosperm development.</p> <p>Conclusion</p> <p>The new HiT-MDS-2 method helps to create global views of expression patterns and to validate centroids obtained from clustering programs. Furthermore, functional gene annotation for developing endosperm barley tissue is successfully mapped to the visualization, making easy localization of major centroids of enriched functional categories possible.</p
Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets
The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets.
The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o
Identifying Three-Way Gene Interactions from Microarray Data using Kolmogorov-Smirnov and Cross-Match Tests
Human gene network is much more complex than just pairwise interaction among the genes. Zhang et al. [6] extracted microarray data from International Genomics Consortium (IGC), and presented the detection of three-way gene interactions in their paper using Fisher’s z-transformation test. Three-way gene interactions are closer than pairwise correlations in representing the complex gene structures. Additionally, it was more tractable than assessing four or more gene interactions. In this paper, we are simulating different models where Fisher’s test might not be as effective. Zhang et al.’s approach utilized Pearson’s correlation coefficients and involved detection of linear interactions only. Since gene interactions could show any kind of behavior, their evaluation approach might not work most of the time. Therefore, we are utilizing the dataset Zhang et al. provided in order to detect the three-way gene interaction using non-parametric tests like Kolmogorov-Smirnov and Cross-Match
Modeling And Identification Of Differentially Regulated Genes Using Transcriptomics And Proteomics Data
Photosynthetic organisms are complex dynamical systems, showing a remarkable ability to adapt to different environmental conditions for their survival. Mechanisms underlying the coordination between different cellular processes in these organisms are still poorly understood. In this dissertation we utilize various computational and modeling techniques to analyze transcriptomics and proteomics data sets from several photosynthetic organisms. We try to use changes in expression levels of genes to study responses of these organisms to various environmental conditions such as availability of nutrients, concentrations of chemicals in growth media, and temperature. Three specific problems studied here are transcriptomics modifications in photosynthetic organisms under reduction-oxidation: redox) stress conditions, circadian and diurnal rhythms of cyanobacteria and the effect of incident light patterns on these rhythms, and the coordination between biological processes in cyanobacteria under various growth conditions. Under redox stresses caused by high light treatments, a strong transcriptomic level response, spread across many biological processes, is discovered in the cyanobacterium Synechocystis sp. PCC 6803. Based on statistical tests, expression levels of about 20% of genes in Synechocystis 6803 are identified as significantly affected due to influence of high light. Gene clustering methods reveal that these responses can mainly be classified as transient and consistent responses, depending on the duration of modified behaviors. Many genes related to energy production as well as energy utilization are shown to be strongly affected. Analysis of microarray data under two stress conditions, high light and DCMU treatment, combined with data mining and motif finding algorithms led to a discovery of novel transcription factor, RRTF1 that responds to redox stresses in Arabidopsis thaliana. Time course transcriptomics data from Cyanothece sp. ATCC 51142 have shown strong diurnal rhythms. By combining multiple experimental conditions and using gene classification algorithms based on Fourier scores and angular distances, it is shown that majority of the diurnal genes are in fact light responding. Only about 10% of genes in the genome are categorized as being circadian controlled. A transcription control model based on dynamical systems is employed to identify the interactions between diurnal genes. A phase oscillator network is proposed to model the behavior of different biological processes. Both these models are shown to carry biologically meaningful features. To study the coordination between different biological processes to various environment and genetic modifications, an interaction model is derived using Bayesian network approach, combining all publicly available microarray data sets for Synechocystis sp. PCC 6803. Several novel relationships between biological processes are discovered from the model. Model is used to simulate several experimental conditions, and the response of the model is shown to agree with the experimentally observed behaviors
ClockOME: searching for oscillatory genes in early vertebrate development
Embryo development is a dynamic process regulated in space and time. Cells must
integrate biochemical and mechanical signals to generate fully functional organisms, where
oscillatory gene expression plays a key role. The embryo molecular clock (EMC) is the best
known genetic oscillator active in embryo segmentation, involving genes from the Notch, FGF,
and WNT pathways. However, the list of cyclic genes is still incomplete mostly due to the
challenges involved with studying periodic systems. Recently, such studies have become more
feasible with the development of pseudo-time ordering algorithms that search for candidate
oscillatory genes using large transcriptomics datasets sampled without explicit time
measurements.
This study aims at finding candidate oscillatory genes - ClockOME - active in early
chick embryo development.
Two Gallus gallus microarray transcriptomics datasets from Presomitic mesoderm
(PSM), and one dataset from limb segmentation were gathered from GEO and ArrayExpress.
To normalize these data from different experiments, an RData package - FrozenChicken - was
developed to apply a frozen Robust MultiArray (fRMA) normalization to the data. Next the
datasets were processed with Oscope (a pseudo-time ordering algorithm) to search for candidate
periodic genes clustered by similar oscillatory behaviour. The clusters of predicted oscillators
were then subject to functional enrichment and interaction network analyses to highlight the
biological functions associated with these genes. Oscope predicted three clusters of oscillators:
two in PSM (106 and 32 genes), and one in Limb (162 genes). Overall, the genes are associated
with regulatory, morphological, and developmental processes. Mesp2, a gene involved with the
EMC, was found in this dataset, validating the approach, however, the majority of genes are
novel oscillatory candidates, associated with chromatin and transcriptional regulation, as well
as protein and oxygen metabolism. The list of candidate oscillators represents a valuable
resource for guided experimental validation to discover additional members of the chick EMC.
Six genes have been proposed for high-priority experimental validation: SRC, PTCH1,
NOTCH2, YAP1, KDR, CTR9.O desenvolvimento embrionário é um processo dinâmico que envolve alterações
moleculares no espaço e no tempo. As células embrionárias são constantemente expostas a
estímulos bioquímicos e mecânicos, e respondem ao ambiente em que se encontram alterando
o seu programa genético. Quando corretamente integradas, estas respostas celulares culminam
com o desenvolvimento bem-sucedido de um organismo funcional. Assim, a embriogénese
envolve processos moleculares estritamente regulados, sendo a expressão oscilatória de genes
uma das formas possíveis para a regulação do comportamento das células ao longo do tempo.
O relógio molecular embrionário é um conhecido oscilador genético, e está envolvido na
segmentação do tecido paraxial embrionário. O conceito de relógio molecular foi inicialmente
proposto em 1976 por Cooke e Zeeman, ao qual chamaram o modelo Clock and Wavefront
(Relógio e Frente de Onda)1. Este modelo foi concebido para descrever teoricamente a
formação rítmica de sómitos em ambos os lados da mesoderme paraxial (PSM) nos vertebrados,
e baseia-se na existência de osciladores genéticos que regulam esse processo de segmentação
da PSM ao longo do tempo. Para além do relógio, como diz o nome, o modelo inclui a existência
de uma frente de onda, que determina espacialmente o comportamento das células presentes na
mesoderme pré-somítica (PSM). Assim, os dois mecanismos guiam a diferenciação das células
da PSM, que consequentemente sofrem transformações genéticas que precedem a formação dos
sómitos. A base deste relógio molecular consiste na expressão periódica de genes que fazem
parte das vias moleculares Notch, FGF e WNT. Contudo, a lista de genes envolvidos no relógio
embrionário ainda não se encontra completa, facto este que se deve principalmente às
dificuldades experimentais relacionadas com o estudo de sistemas periódicos quando não se
conhece de antemão a periodicidade/ritmo da expressão dos genes envolvidos.
Com o advento de novas técnicas de transcriptómica que permitem o estudo dos valores
de expressão de todos os genes simultaneamente, nomeadamente usando Microarrays, ou mais
recentemente através de métodos de sequenciação, como RNA-sequencing ou Single-Cell
RNA-sequencing, surge a oportunidade de procurar alargar a lista de genes com expressão
oscilatória. Porém, estes métodos implicam a extração do RNA das células amostradas
resultando na morte celular. Assim, este processamento inviabiliza o estudo das mesmas células
ao longo do tempo, originando dados moleculares estáticos, isto é, os níveis de expressão
obtidos representam uma única amostra temporal. Para o estudo de processos periódicos, seria
então necessário fazer uma série temporal amostrando diferentes indivíduos ao longo do tempo de desenvolvimento, aumentando grandemente o número de amostras biológicas necessárias
para resolver o ciclo de oscilação para cada gene estudado.
Assim, sem informação temporal medida explicitamente, a expressão oscilatória de
genes pode apenas ser estudada usando modelos matemáticos apropriados, nomeadamente
através da aplicação de algoritmos de ordenação pseudo-temporal. Estes métodos ordenam as
amostras ao longo do tempo de uma oscilação de forma a obter o padrão do comportamento
cíclico para todos os genes cuja expressão oscila concomitantemente. Torna-se assim possível,
bioinformaticamente, inferir o potencial oscilatório de genes medidos por estas técnicas de
transcriptómica, sem informação temporal explícita.
Deste modo, o objetivo deste estudo é encontrar novos genes oscilatórios, a que
coletivamente chamamos ClockOME, que estão ativos durante as primeiras etapas do
desenvolvimento embrionário (somitogénese) da galinha, nos tecidos da mesoderme présomítica
(PSM), e no membro superior (Limb); tecidos estes onde o relógio molecular foi
descrito, atuando como regulador temporal das alterações genéticas subjacentes.
Para tal, recolheu-se 3 conjuntos de dados (datasets) de transcriptómica obtidos por
microarray de dois repositórios de dados públicos: GEO (da instituição americana NCBI) e
ArrayExpress (da instituição europeia EMBL-EBI). Dois datasets continham dados de
mesoderme paraxial (PSM) – tecido onde ocorre a somitogénese; e um dataset de dados de
obtidos do membro superior do embrião de galinha. Com o objetivo de normalizar os três
datasets de forma a torná-los comparáveis (uma vez que são oriundos de processos
experimentais diferentes), foi desenvolvido um pacote de R denominado “FrozenChicken:
Promoting the meta-analysis of chicken microarray data” (publicado em 2021)
(https://doi.org/10.1101/2021.02.25.432894). Este pacote contém dados sumarizados de 472
datasets de microarrays de embriões de galinha, tornando possível a normalização por fRMA
(frozen Robust MultiArray) de microarrays de Gallus gallus. Após normalização e controlo de
qualidade dos valores de expressão genética, os dados da PSM e do membro foram processados
com o Oscope (algoritmo de ordenação pseudo-temporal), com o propósito de prever genes
oscilatórios. Este algoritmo avalia todas as combinações de pares de genes, agrupando aqueles
que apresentem padrões de expressão semelhantes, ou seja, cujos valores de expressão ao longo
das amostras seguem trajetórias semelhantes, indiciando um período de oscilação
potencialmente semelhante. Os clusters de genes previstos pelo Oscope foram posteriormente submetidos a uma análise de enriquecimento funcional e a uma análise de interações funcionais,
com o intuito de perceber o seu potencial papel biológico, e funções moleculares subjacentes.
O Oscope reportou três listas de genes potencialmente oscilatórios: dois grupos foram
encontrados a partir dos dados da PSM (com 106 e 32 genes cada) e o terceiro grupo de 162
genes foi encontrado nos dados do membro superior. No total, a lista de genes que
denominamos ClockOME é composta por 296 genes potencialmente oscilatórios, envolvidos
em diversos mecanismos regulatórios importantes para o desenvolvimento embrionário e para
a morfogénese. A maioria dos genes presentes nesta lista não estão descritos na literatura como
sendo oscilatórios (novel candidates), representando, portanto, uma mais-valia para a
comunidade científica que estuda o relógio molecular embrionário. Estes genes parecem estar
associados a funções como remodelação da cromatina, regulação da transcrição, metabolismo
proteico e metabolismo do oxigénio, sendo, portanto, bons candidatos para futura validação
experimental. Notavelmente, o Oscope identificou com sucesso o Mesp2, um gene oscilatório
bem descrito na literatura, mostrando assim a validade e o potencial desta abordagem teórica.
Em suma, este trabalho produziu uma lista de 296 genes potencialmente oscilatórios.
Com base na sua novidade e na função molecular anotada, foi proposta uma lista de seis genes
candidatos de particular relevância para validação experimental no futuro próximo,
nomeadamente: SRC, PTCH1, NOTCH2, YAP1, KDR, CTR9. Assim, as listas resultantes do
trabalho desta tese poderão agora guiar futuras experiências laboratoriais capazes de adicionar
novos interactores moleculares ao atual modelo do relógio molecular embrionário
- …