794 research outputs found

    Machine learning approaches to supporting the identification of photoreceptor-enriched genes based on expression data

    Get PDF
    BACKGROUND: Retinal photoreceptors are highly specialised cells, which detect light and are central to mammalian vision. Many retinal diseases occur as a result of inherited dysfunction of the rod and cone photoreceptor cells. Development and maintenance of photoreceptors requires appropriate regulation of the many genes specifically or highly expressed in these cells. Over the last decades, different experimental approaches have been developed to identify photoreceptor enriched genes. Recent progress in RNA analysis technology has generated large amounts of gene expression data relevant to retinal development. This paper assesses a machine learning methodology for supporting the identification of photoreceptor enriched genes based on expression data. RESULTS: Based on the analysis of publicly-available gene expression data from the developing mouse retina generated by serial analysis of gene expression (SAGE), this paper presents a predictive methodology comprising several in silico models for detecting key complex features and relationships encoded in the data, which may be useful to distinguish genes in terms of their functional roles. In order to understand temporal patterns of photoreceptor gene expression during retinal development, a two-way cluster analysis was firstly performed. By clustering SAGE libraries, a hierarchical tree reflecting relationships between developmental stages was obtained. By clustering SAGE tags, a more comprehensive expression profile for photoreceptor cells was revealed. To demonstrate the usefulness of machine learning-based models in predicting functional associations from the SAGE data, three supervised classification models were compared. The results indicated that a relatively simple instance-based model (KStar model) performed significantly better than relatively more complex algorithms, e.g. neural networks. To deal with the problem of functional class imbalance occurring in the dataset, two data re-sampling techniques were studied. A random over-sampling method supported the implementation of the most powerful prediction models. The KStar model was also able to achieve higher predictive sensitivities and specificities using random over-sampling techniques. CONCLUSION: The approaches assessed in this paper represent an efficient and relatively inexpensive in silico methodology for supporting large-scale analysis of photoreceptor gene expression by SAGE. They may be applied as complementary methodologies to support functional predictions before implementing more comprehensive, experimental prediction and validation methods. They may also be combined with other large-scale, data-driven methods to facilitate the inference of transcriptional regulatory networks in the developing retina. Furthermore, the methodology assessed may be applied to other data domains

    Clustering-based approaches to SAGE data mining

    Get PDF
    Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation

    Network inference from sparse single-cell transcriptomics data: Exploring, exploiting, and evaluating the single-cell toolbox

    Get PDF
    Large-scale transcriptomics data studies revolutionised the fields of systems biology and medicine, allowing to generate deeper mechanistic insights into biological pathways and molecular functions. However, conventional bulk RNA-sequencing results in the analysis of an averaged signal of many input cells, which are homogenised during the experimental procedure. Hence, those insights represent only a coarse-grained picture, potentially missing information from rare or unidentified cell types. Allowing for an unprecedented level of resolution, single-cell transcriptomics may help to identify and characterise new cell types, unravel developmental trajectories, and facilitate inference of cell type-specific networks. Besides all these tempting promises, there is one main limitation that currently hampers many downstream tasks: single-cell RNA-sequencing data is characterised by a high degree of sparsity. Due to this limitation, no reliable network inference tools allowed to disentangle the hidden information in the single-cell data. Single-cell correlation networks likely hold previously masked information and could allow inferring new insights into cell type-specific networks. To harness the potential of single-cell transcriptomics data, this dissertation sought to evaluate the influence of data dropout on network inference and how this might be alleviated. However, two premisses must be met to fulfil the promise of cell type-specific networks: (I) cell type annotation and (II) reliable network inference. Since any experimentally generated scRNA-seq data is associated with an unknown degree of dropout, a benchmarking framework was set up using a synthetic gold data set, which was subsequently affected with different defined degrees of dropout. Aiming to desparsify the dropout-afflicted data, the influence of various imputations tools on the network structure was further evaluated. The results highlighted that for moderate dropout levels, a deep count autoencoder (DCA) was able to outperform the other tools and the unimputed data. To fulfil the premiss of cell type annotation, the impact of data imputation on cell-cell correlations was investigated using a human retina organoid data set. The results highlighted that no imputation tool intervened with cell cluster annotation. Based on the encouraging results of the benchmarking analysis, a window of opportunity was identified, which allowed for meaningful network inference from imputed single-cell RNA-seq data. Therefore, the inference of cell type-specific networks subsequent to DCA-imputation was evaluated in a human retina organoid data set. To understand the differences and commonalities of cell type-specific networks, those were analysed for cones and rods, two closely related photoreceptor cell types of the retina. Comparing the importance of marker genes for rods and cones between their respective cell type-specific networks exhibited that these genes were of high importance, i.e. had hub-gene-like properties in one module of the corresponding network but were of less importance in the opposing network. Furthermore, it was analysed how many hub genes in general preserved their status across cell type-specific networks and whether they associate with similar or diverging sub-networks. While a set of preserved hub genes was identified, a few were linked to completely different network structures. One candidate was EIF4EBP1, a eukaryotic translation initiation factor binding protein, which is associated with a retinal pathology called age-related macular degeneration (AMD). These results suggest that given very defined prerequisites, data imputation via DCA can indeed facilitate cell type-specific network inference, delivering promising biological insights. Referring back to AMD, a major cause for the loss of central vision in patients older than 65, neither the defined mechanisms of pathogenesis nor treatment options are at hand. However, light can be shed on this disease through the employment of organoid model systems since they resemble the in vivo organ composition while reducing its complexity and ethical concerns. Therefore, a recently developed human retina organoid system (HRO) was investigated using the single-cell toolbox to evaluate whether it provides a useful base to study the defined effects on the onset and progression of AMD in the future. In particular, different workflows for a robust and in-depth annotation of cell types were used, including literature-based and transfer learning approaches. These allowed to state that the organoid system may reproduce hallmarks of a more central retina, which is an important determinant of AMD pathogenesis. Also, using trajectory analysis, it could be detected that the organoids in part reproduce major developmental hallmarks of the retina, but that different HRO samples exhibited developmental differences that point at different degrees of maturation. Altogether, this analysis allowed to deeply characterise a human retinal organoid system, which revealed in vivo-like outcomes and features as pinpointing discrepancies. These results could be used to refine culture conditions during the organoid differentiation to optimise its utility as a disease model. In summary, this dissertation describes a workflow that, in contrast to the current state of the art in the literature enables the inference of cell type-specific gene regulatory networks. The thesis illustrated that such networks indeed differ even between closely related cells. Thus, single-cell transcriptomics can yield unprecedented insights into so far not understood cell regulatory principles, particularly rare cell types that are so far hardly reflected in bulk-derived RNA-seq data

    Novel Approaches to Studying the Effects of Cis-Regulatory Variants in the Central Nervous System

    Get PDF
    For decades, studies of the genetic basis of disease have focused on rare coding mutations that disrupt protein function, leading to the identification of hundreds of genes underlying Mendelian diseases. However, many complex diseases are non-Mendelian, and less than 2% of the genome is coding. It is now clear that non-coding variants contribute to disease susceptibility, but the precise underlying mechanisms are generally unknown. Cis-regulatory elements (CREs) are transcription factor (TF)-bound genomic regions that regulate gene expression, and variants within CREs can therefore modify gene expression. The putative locations of CREs in a variety of cell types have been identified through genome-wide assays of TF binding and epigenomic signatures, providing a starting point for probing the effects of cis-regulatory variants. Unlike coding mutations, which can be interpreted based on the genetic code, the functional consequence of any given cis-regulatory variant is difficult to predict even at the molecular level. Therefore, a major bottleneck lies in interpreting the functional significance of these variants. In the present work, I study the effects of cis-regulatory variants in the central nervous system (CNS), specifically in retina and brain. The retina is composed of well-characterized neuronal cell types and an extensively studied transcriptional network, while the brain is the center of human cognition and a target of devastating neuropsychiatric diseases. First, I take advantage of the genetic diversity between two distantly related mouse strains to describe the relationship between cis-regulatory variants and differences in retinal gene expression. I identify cis- and trans-regulatory effects, as well as parent-of-origin effects. Second, I develop a new technology based on an existing massively parallel reporter assay, CRE-seq, to enable the functional study of long CREs in the CNS in vivo for the first time. I demonstrate the ability of this approach to measure tissue-specific cis-regulatory activity in the brain and to pinpoint DNA bases critical for activity. Finally, I conduct a detailed mechanistic study of a non-coding region containing variants associated with both human cognitive performance and bipolar disorder. This last study illustrates the complexities and challenges of establishing the causal role of non-coding variants in disease

    Data Representation for Learning and Information Fusion in Bioinformatics

    Get PDF
    This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data

    Transcriptomic Analysis of Light-Induced Genes in Nasonia vitripennis:Possible Implications for Circadian Light Entrainment Pathways

    Get PDF
    Circadian entrainment to the environmental day–night cycle is essential for the optimal use of environmental resources. In insects, opsin-based photoreception in the compound eye and ocelli and CRYPTOCHROME1 (CRY1) in circadian clock neurons are thought to be involved in sensing photic information, but the genetic regulation of circadian light entrainment in species without light-sensitive CRY1 remains unclear. To elucidate a possible CRY1-independent light transduction cascade, we analyzed light-induced gene expression through RNA-sequencing in Nasonia vitripennis. Entrained wasps were subjected to a light pulse in the subjective night to reset the circadian clock, and light-induced changes in gene expression were characterized at four different time points in wasp heads. We used co-expression, functional annotation, and transcription factor binding motif analyses to gain insight into the molecular pathways in response to acute light stimulus and to form hypotheses about the circadian light-resetting pathway. Maximal gene induction was found after 2 h of light stimulation (1432 genes), and this included the opsin gene opblue and the core clock genes cry2 and npas2. Pathway and cluster analyses revealed light activation of glutamatergic and GABA-ergic neurotransmission, including CREB and AP-1 transcription pathway signaling. This suggests that circadian photic entrainment in Nasonia may require pathways that are similar to those in mammals. We propose a model for hymenopteran circadian light-resetting that involves opsin-based photoreception, glutamatergic neurotransmission, and gene induction of cry2 and npas2 to reset the circadian clock.</p

    Combining Support Vector Machines to Predict Novel Angiogenesis Genes

    Get PDF
    VĂ€hk on tĂ€napĂ€eval ĂŒks levinumaid ja ohtlikumaid haigusi pĂ”hjustades igal aastal 13% kĂ”igist surmajuhtumitest ĂŒle maailma. Hoolimata aastatepikkustest jĂ”upingutustest ei ole seni ikka veel efektiivset ravi selle haiguse vastu leitud. KĂŒll on aga teada, et vĂ€hi arengus on olulisel kohal angiogenees, mille kĂ€igus vĂ€hk paneb enda ĂŒmber asuvad veresooned hargnema ja kasvama. Parem arusaamine sellest protsessist vĂ”imaldaks potentsiaalselt luua uusi ja efektiivsemaid ravimeetodeid. Aastate jooksul tehtud eksperimentide kĂ€igus on mÔÔdetud enamiku inimese geenide ekpressiooni rohkem kui 5000 tingimuses. Lisaks on meie koostööpartnerid koostanud nimekirja 341-st veresoonte loomega seotud geenist. KĂ€esoleva töö eesmĂ€rgiks ongi uurida, kuidas geeniekspressiooni andmete ja vĂ€ikese hulga tuntud angiogeneesi geenide pĂ”hjal on vĂ”imalik ennustada uusi angiogeneesiga seotud geene. Selleks vĂ”rreldakse kĂ”igepealt mitmeid olemasolevaid masinĂ”ppe meetodeid ja avalikult kĂ€ttesaadavaid bioinformaatika tööriistu, mida saaks kasutada kandidaatgeenide ennustamiseks. KĂ”igi nende meetodite puhul kasutatakse sisendiks vĂ”imalikult sarnaseid andmeid ning mÔÔdetakse siis 10-kordse ristvalideerimise abil, kui edukad need on juba tuntud angiogeneesi geenide ĂŒlesleidmisel. Töö teises osas pakutakse vĂ€lja uudne Comb-SVM meetod kandidaatgeenide ennustamiseks. Selle pĂ”hiidee baseerub kolmel sammul. KĂ”igepealt kasutatakse juba tuntud angiogeneesi geene ning juhuslikult valitud negatiivseid geene, et treenida paralleelselt mitu tugivektormasinal (ingl k Support Vector Machine) pĂ”hinevat klassifitseerijat. JĂ€rgnevalt kasutakse neid klassifitseerijaid uute angiogeneesi geenide ennustamiseks. Viimaks agregeeritakse kĂ”igi klassifitseerijate tulemused kokku ĂŒheks ennustuseks. Töö lĂ”pus nĂ€idatakse, et 10-kordse ristvalideerimise pĂ”hjal on Comb-SVM tĂ€psem kui enamik olemasolevaid meetodeid. Lisaks nĂ€idatakse, et Comb-SVM ennustused on oluliselt stabiilsemad vĂ€ikeste muudatuste suhtes treeningandmetes kui paremuselt teise algoritmi tulemused. KĂ”ige lĂ”puks kasu- tatakse teaduskirjandust ning Gene Ontology andmebaasi veendumaks, et uued ennustatud geenid on tĂ”poolest seotud angiogeneesiga.Angiogenesis is the process of growing new blood vessels. It is part of normal bodily functions like wound healing, but it also plays an important role in cancer development. Without angiogenesis, tumors would not be able to grow larger than 1-2 millimeters in diameter due to the lack of oxygen and nutrients. However, only a part of the genes involved in angiogenesis are known. In this work, we proposed a new Comb-SVM machine learning method to predict new members to the positive class, that does not require a clearly defined negative examples. The idea is to train multiple Support Vector Machines (SVMs) using known genes as positive samples and various randomly selected sets of genes as negative examples. The multiple SVMs are then used to separately classify all remaining human genes and the results are finally aggregated using a rank aggregation algorithm. The outcome is a list of genes ranked according to their similarity to known input genes. We applied this method to 341 known angiogenesis genes. Experiments were conducted on a large Affymetrix microarray gene expression matrix consisting of 5732 experiments and 22283 probe sets obtained from ArrayExpress. We compared Comb-SVM to many other state-of-the-art approaches. According to cross-validation experiments, our method outperformed most of the existing methods when looking at areas under Receiver Operator Characteristic and Precision-Recall curves. We also determined that our method gave significantly more stable results than the second best approach. Finally, we verified the biological relevance of the predicted genes by searching the literature and Gene Ontology

    Discovering circadian clocks in microbes

    Get PDF
    We humans experience the influence of our circadian clock every day. This clock mechanism causes, for example, a jet lag during transatlantic air travel. We now believe that almost all organisms have developed a circadian clock mechanism.In this thesis I describe the analysis techniques we developed and the newly discovered molecular components of a circadian mechanism in Saccharomyces cerevisiae and Bacillus subtilis. To identify these molecular components, I applied structured zeitgebers, i.e. light and temperature cycling, to yeast and bacillus cultures. All this in conjunction with bioinformatic in-silico approachesIn Bacillus biofilm populations, we found a free-running rhythm of ytvA and KinC activity of nearly 24 hours after entrainment and release to constant dark and temperature conditions. The free-running oscillations are temperature compensated. This is one of the most important features of a circadian clock mechanism, making it very likely that such a system exists in B. subtilis.We found in yeasts that temperature appears to mainly regulate metabolic processes. Light appears to act more indirectly via photo-oxidation of mitochondrial cytochromes.Finally, I present a hypothetical model for an integrated circadian clock mechanism in unicellular microbes with an emphasis on S. cerevisiae. This mechanism involves several metabolic pathways and the main regulator is the stress sensitive transcriptional activator Msn2p. The model shows that in the circadian clock mechanism in yeast, energy metabolism appears to be an important theme. Other processes that are relevant: metabolic process of nitrogen compounds, oxidation-reduction process and fatty acid metabolism. All could serve as a starting point for further research on the circadian clock in yeast
    • 

    corecore