227 research outputs found

    Disease gene recognition and editing optimization through knowledge learned from domain feature spaces

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.This thesis presents computational methods used for the recognition of disease genes and for the optimal design of disease gene CRISPR/Cas9 editing systems. The key innovation in these computational methods is the feature space and characteristics captured from the biology domain knowledge through machine learning algorithms. The disease-gene association prediction problems are studied in Chapters 3-5. Disease gene recognition is a hot topic in various fields, especially in biology, medicine and pharmacology. Non-coding genes, a type of genes without protein products, have been proved to play important roles in disease development. Particularly, the two kinds of non-coding gene products such as microRNA (miRNA) and long non-coding RNA (lncRNA) have caught much attention as they are abundantly expressed in various tissues and frequently interact with other biomolecules, e.g. DNA, RNA and protein. The disease-ncRNA relationships remain largely unknown. Computational methods can immensely help replenish this kind of knowledge. To overcome existing computational methods’ limitations such as significantly relying on network structures and similarity measurements, or lacking reliable negative samples, this thesis presents two novel methods. One is the precomputed kernel matrix support vector machine (SVM) method to predict disease related miRNAs in Chapter 3. The precomputed kernel matrix was built by integrating several kinds of similarities computed with effective characteristics for miRNAs and diseases. The reliable negative samples were collected through analyzing the published array and sequencing data. This binary classification method accurately predicts disease-miRNA associations, which outperforms those state-of-the-art methods. In Chapter 4, the predicted novel disease-miRNA associations were combined with known relationships of diseases, miRNAs and genes to reconstruct a disease-gene-miRNA (DGR) tripartite network. Reliable multi-disease associated co-functional miRNA pairs were extracted from this DGR for cross-disease analysis by defining the co-function score. This not only proves the proposed method’s effectiveness but also contributes to the study of multi-purpose miRNA therapeutics. Another is the bagging SVM-based positive-unlabeled learning method for disease-lncRNA prioritizing that is described in Chapter 5. It creatively characterized a disease with its related genes’ chromosome distribution and pathway enrichment properties. The disease-lncRNA pairs were represented as novel feature vectors to train the bagging SVM for predicting disease-lncRNA associations. This novel representation contributes to the superior performance of the proposed method in disease-lncRNA prediction even when a given disease has no currently recognized lncRNA genes. After confirming the relationships between genes and diseases, one of the most difficult tasks is to investigate the molecular mechanism and treatment of the diseases considering their related genes. The CRISPR/Cas9 system is a promising gene editing tool for operating the genes to achieve the goals of disease-gene function clarification and genetic disease curing. Designing an optimal CRISPR/Cas9 system can not only improve its editing efficiency but also reduce its side effect, i.e. off-target editing. Furthermore, the off-target site detection problem involves genome-wide sequence observing which makes it a more challenging job. The CRISPR/Cas9 system on-target cutting efficiency prediction and off-target site detection questions are discussed in Chapters 6 and 7 respectively. To accurately measure the CRISPR/Cas9 system’s cutting efficiency, the profiled Markov properties and some cutting position related features were merged into the feature space for representing the single-guide RNAs (sgRNAs). These features were learned by a two-step averaging method where an XGBoost’s predictions and an SVM’s predictions were averaged as the final results. Later performance evaluations and comparisons demonstrate that this method can predict a sgRNA’s cutting efficiency with consistently good performance no matter it is expressed from a U6 promoter in cells or from a T7 promoter in vitro. In the off-target site detection, a sample was defined as an on-target-off-target site sequence pair to turn this problem into a classification issue. Each sample was numerically depicted with the nucleotide composition change features and the mismatch distribution properties. An ensemble classifier was constructed to distinguish real off-target sites and no-editing sites of a given sgRNA. Its excellent performance was confirmed with different test scenarios and case studies

    Genetic and epigenetic changes associated with polygenic left ventricular hypertrophy

    Get PDF
    Cardiac hypertrophy (CH) is the thickening of heart muscles reducing functionality and increasing risk of cardiac disease. Commonly, pathological CH is presented as left ventricular hypertrophy (LVH) and genetic factors are known to be involved but their contribution is still poorly understood. I used the hypertrophic heart rat (HHR), a unique normotensive polygenic model of LVH, and its control strain, the normal heart rat (NHR) to investigate genetic and epigenetic contributions to LVH independent of high blood pressure. To address this study, I used a systematic approach. Firstly, I sequenced the whole genome of HHR and NHR to identify genes related to LVH, focusing on quantitative trait locus Cm22. I found the gene for tripartite motif-containing 55 (Trim55) was significantly downregulated and also presented decreased protein expression with the presence of one exonic missense mutation that altered the protein structure. Interestingly, Trim55 mRNA expression was reduced in idiopathic dilated cardiomyopathic hearts. Secondly, I selected 42 genes previously described in monogenic forms of human cardiomyopathies and studied DNA variants, mRNA and micro RNA (miRNA) expression to determine their involvement in this polygenic model of LVH at five ages. This comprehensive approach identified the differential expression of 29 genes in at least one age group and two miRNAs in validated miRNA-mRNA interactions. These two miRNAs have binding sites for five of the genes studied. Lastly, I found circular RNA (circRNA) Hrcr was upregulated in the hypertrophic heart. I then silenced Hrcr expression in human primary cardiomyocytes to investigate its miRNA downstream targets and elucidate possible regulatory mechanisms. I described four miRNAs (miR-1-3p, miR-330, miR-27a-5p, miR-299-5p) as novel targets for HRCR and predicted 359 mRNA targets in the circRNA-miRNA-mRNA regulatory axis. In silico analysis identified 206 enriched gene ontology based on the predicted mRNA target list, including cardiomyocyte differentiation and ventricular cardiac muscle cell differentiation. The findings in this thesis suggest that 1) Trim55 is a novel functional candidate gene for polygenic LVH; 2) genes implicated in monogenic forms of cardiomyopathy may be involved in this condition and 3) circRNA expression is associated with changes in hypertrophic hearts and deserve further attention.Doctor of Philosoph

    Development of bioinformatics tools and studies in biomedical association networks for the analysis of human genetic diseases

    Get PDF
    Fecha de lectura de Tesis Doctoral: 18 de marzo 2019.El presente trabajo de tesis doctoral se centra en el análisis en red y desarrollo de herramientas bioinformáticas para la determinación de las causas que dan lugar a las enfermedades con base genética. Mediante el análisis de sistemas de red se pueden asociar fenotipos patológicos y las regiones del genoma que potencialmente sean su causa a partir de información de pacientes. Estas asociaciones fenotipo-genotipo pueden emplearse para el desarrollo de herramientas de apoyo al diagnóstico genético de pacientes con un cuadro fenotípico complejo, de manera que puedan dar información sobre las regiones del genoma que potencialmente estén afectadas en un paciente a partir de sus fenotipos patológicos observados. Del mismo modo, estas regiones asociadas a fenotipos patológicos pueden analizarse para determinar los elementos funcionales del genoma que sean la causa de la enfermedad. Este análisis incluye tanto genes como elementos reguladores, ya que se ha demostrado que un 80% de las enfermedades caracterizadas mediante análisis del genoma completo han sido asociadas a regiones no codificantes del genoma, en las cuales se encuentran los elementos reguladores. Una vez determinados los elementos funcionales existentes en las regiones del genoma asociadas a fenotipos patológicos, se pueden determinar los sistemas biológicos que estén afectados en el paciente. Sin embargo, no todos los genes tienen anotaciones funcionales que muestren a qué sistemas afectan. Esta funcionalidad viene dada por el producto génico, las proteínas, que a su vez constan de dominios que les confieren su función y/o estructura. De nuevo, mediante análisis de red se pueden asociar dominios de proteínas con anotaciones funciones a partir de información de proteínas, con el fin de poder usar esas asociaciones dominio-función para predecir la posible función desconocida de proteínas en base a sus dominios

    The genetic architecture of language functional connectivity

    Get PDF
    Available online 18 December 2021Language is a unique trait of the human species, of which the genetic architecture remains largely unknown. Through language disorders studies, many candidate genes were identified. However, such complex and multi- factorial trait is unlikely to be driven by only few genes and case-control studies, suffering from a lack of power, struggle to uncover significant variants. In parallel, neuroimaging has significantly contributed to the under- standing of structural and functional aspects of language in the human brain and the recent availability of large scale cohorts like UK Biobank have made possible to study language via image-derived endophenotypes in the general population. Because of its strong relationship with task-based fMRI (tbfMRI) activations and its easiness of acquisition, resting-state functional MRI (rsfMRI) have been more popularised, making it a good surrogate of functional neuronal processes. Taking advantage of such a synergistic system by aggregating effects across spa- tially distributed traits, we performed a multivariate genome-wide association study (mvGWAS) between genetic variations and resting-state functional connectivity (FC) of classical brain language areas in the inferior frontal (pars opercularis, triangularis and orbitalis), temporal and inferior parietal lobes (angular and supramarginal gyri), in 32,186 participants from UK Biobank. Twenty genomic loci were found associated with language FCs, out of which three were replicated in an independent replication sample. A locus in 3p11.1, regulating EPHA3 gene expression, is found associated with FCs of the semantic component of the language network, while a lo- cus in 15q14, regulating THBS1 gene expression is found associated with FCs of the perceptual-motor language processing, bringing novel insights into the neurobiology of language.This research was conducted using the UK Biobank resource un- der application #64984. This project was supported by the Marie Sklodowska-Curie program awarded to Stephanie J. Forkel (Grant agree- ment No. 101028551). Amaia Carrion-Castillo was supported by a Juan de la Cierva fellowship from the Spanish Ministry of Science and Innova- tion, and a Gipuzkoa Fellows fellowship from the Basque Governmen

    DUAL FUNCTIONS FOR INSULINOMA-ASSOCIATED 1 IN RETINAL DEVELOPMENT

    Get PDF
    Proper visual system function requires tightly controlled proliferation of a pool of relatively homogeneous retinal progenitor cells, followed by the stepwise specification and differentiation of multiple distinct cell types. These retinal cells, both neuronal and glial, must be generated in the correct numbers, and the correct laminar location to permit the formation of synaptic connections between individual cell types. After synapses are made, constant signaling is required as part of normal retinal function, and to maintain cellular identity and connectivity. These processes rely on both extrinsic and intrinsic signaling, with regulation of gene expression by cascades of transcription factors having a key role. While considerable work has been done to identify key regulators of retinal development, maturation, and homeostasis, many factors remain unidentified or poorly characterized, either at large or within the retina. One such factor is Insulinoma-associated 1 (Insm1). Known to function in endocrine cell, sympathetic and monoaminergic neuron, and olfactory epithelial cell differentiation and maturation, Insm1 is also a regulator of cell cycle progression in the adrenal system and cerebral cortex. Although Insm1 was previously considered a transcriptional regulator, recently, non-nuclear functions have also been identified. However, the retinal function of Insm1 remained a mystery. To determine the role of Insm1 in retinal development, I characterized the retinal expression pattern of Insm1, as well as the effect of perturbation of Insm1 expression levels at both the cellular and molecular level. Chapter 1 of this dissertation provides an overview of the retina and its development, vision and retinal degenerative diseases, and a review of Insm1 expression and function in tissues outside the retina. Chapter 2 presents data generated from the knockdown of the retinal-expressed zebrafish co-ortholog of Insm1, insm1a, which demonstrated a requirement for insm1a in proper differentiation of photoreceptor cells. Additionally, these experiments showed a cell cycle regulatory function for insm1a in retinal development. Characterization of a zebrafish insm1a mutant and functional examination of insm1a truncation variants is discussed in Chapter 3. Chapter 4 presents data from an RNA-Seq analysis of wild-type and Insm1 knockout mouse retinas at two developmental time points, and details transcriptional changes during retinal development in the absence of Insm1. Finally, Chapter 5 discusses the conclusions from the data generated for this dissertation, additional studies identified as the result of this work, and the implications of these results on our understanding of retinal development

    Characterizing alternative splicing and long non-coding RNA with high-throughput sequencing technology

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Several experimental methods has been developed for the study of the central dogma since late 20th century. Protein mass spectrometry and next generation sequencing (including DNA-Seq and RNA-Seq) forms a triangle of experimental methods, corresponding to the three vertices of the central dogma, i.e., DNA, RNA and protein. Numerous RNA sequencing and protein mass spectrometry experiments has been carried out in attempt to understand how the expression change of known genes affect biological functions in various of organisms, however, it has been once overlooked that the result data of these experiments are in fact holograms which also reveals other delicate biological mechanisms, such as RNA splicing and the expression of long non-coding RNAs. In this dissertation, we carried out five studies based on high-throughput sequencing data, in an attempt to understand how RNA splicing and differential expression of long non-coding RNAs is associated biological functions. In the first two studies, we identified and characterized 197 stimulant induced and 477 developmentally regulated alternative splicing events from RNA sequencing data. In the third study, we introduced a method for identifying novel alternative splicing events that were never documented. In the fourth study, we introduced a method for identifying known and novel RNA splicing junctions from protein mass spectrometry data. In the fifth study, we introduced a method for identifying long non-coding RNAs from poly-A selected RNA sequencing data. Taking advantage of these methods, we turned RNA sequencing and protein mass spectrometry data into an information gold mine of splicing and long non-coding RNA activities.2019-05-0
    • …
    corecore