11 research outputs found

    Computational methods for the discovery and analysis of genes and other functional DNA sequences

    Get PDF
    The need for automating genome analysis is a result of the tremendous amount of genomic data. As of today, a high-throughput DNA sequencing machine can run millions of sequencing reactions in parallel, and it is becoming faster and cheaper to sequence the entire genome of an organism. Public databases containing genomic data are growing exponentially, and hence the rise in demand for intuitive automated methods of DNA analysis and subsequent gene identification. However, the complexity of gene organization makes automation a challenging task, and smart algorithm design and parallelization are necessary to perform accurate analyses in reasonable amounts of time. This work describes two such automated methods for the identification of novel genes within given DNA sequences. The first method utilizes negative selection patterns as an evolutionary rationale for the identification of additional members of a gene family. As input it requires a known protein coding gene in that family. The second method is a massively parallel data mining algorithm that searches a whole genome for inverted repeats (palindromic sequences) and identifies potential precursors of non-coding RNA genes. Both methods were validated successfully on the fully sequenced and well studied plant species, Arabidopsis thaliana --Abstract, page iv

    A Predictive Model Which Uses Descriptors of RNA Secondary Structures Derived from Graph Theory.

    Get PDF
    The secondary structures of ribonucleic acid (RNA) have been successfully modeled with graph-theoretic structures. Often, simple graphs are used to represent secondary RNA structures; however, in this research, a multigraph representation of RNA is used, in which vertices represent stems and edges represent the internal motifs. Any type of RNA secondary structure may be represented by a graph in this manner. We define novel graphical invariants to quantify the multigraphs and obtain characteristic descriptors of the secondary structures. These descriptors are used to train an artificial neural network (ANN) to recognize the characteristics of secondary RNA structure. Using the ANN, we classify the multigraphs as either RNA-like or not RNA-like. This classification method produced results similar to other classification methods. Given the expanding library of secondary RNA motifs, this method may provide a tool to help identify new structures and to guide the rational design of RNA molecules

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    Targeting protein kinases to manage or prevent Alzheimer’s disease

    Get PDF
    Due to the pressing need for new disease-modifying drugs for Alzheimer’s disease (AD), new treatment strategies and alternative drug targets are currently being heavily researched. One such strategy is to modulate protein kinases such as cyclin-dependent kinase 1 (CDK1), cyclin-dependent kinase 5 (CDK5), glycogen synthase kinase-3 (GSK-3α and GSK-3β), and the protein kinase RNA-like endoplasmic reticulum kinase (PERK). AD intervention by reduction of amyloid beta (Aβ) levels is also possible through development of protein kinase C-epsilon (PKC-ϵ) activators to recover α-secretase levels and decrease toxic Aβ levels, thereby restoring synaptogenesis and cognitive function. In this way, we aim to develop new AD drugs by targeting kinases that participate in AD pathophysiology. In our studies, comparative modeling was performed to construct 3D models for kinases whose crystal structures have not yet been identified. The information from structurally similar proteins was used to define the amino acid residues in the ATP binding site as well as other important sites and motifs. We searched for the comstructural motifs and domains of GSK-3β, CDK5 and PERK. Further, we identified the conserved water molecules in GSK-3β, CDK5 and PERK through calculation of the degree of water conservation. We investigated the protein-ligand interaction profiles of CDK1, CDK5, GSK-3α, GSK-3β and PERK based on molecular dynamics (MD) simulations, which provided a time-dependent demonstration of the interactions and contacts for each ligand. In addition, we explored the protein-protein interactions between CDK5 and p25. Small molecules which target this interaction may offer a prospective therapeutic benefit for AD. In order to identify new modulators for protein kinase targets in AD, we implemented three virtual screening protocols. The first protocol was a combined ligand- and protein structure-based approach to find new PERK inhibitors. In the second protocol, protein structure-based virtual screening was applied to find multiple-kinase inhibitors through parallel docking simulations into validated models of CDK1, CDK5 and GSK-3 kinases. In the third protocol, we searched for potential activators of PKC-ϵ based on the structure of its C1B domain

    A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA

    Get PDF
    The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases

    Knowledge discovery on the integrative analysis of electrical and mechanical dyssynchrony to improve cardiac resynchronization therapy

    Get PDF
    Cardiac resynchronization therapy (CRT) is a standard method of treating heart failure by coordinating the function of the left and right ventricles. However, up to 40% of CRT recipients do not experience clinical symptoms or cardiac function improvements. The main reasons for CRT non-response include: (1) suboptimal patient selection based on electrical dyssynchrony measured by electrocardiogram (ECG) in current guidelines; (2) mechanical dyssynchrony has been shown to be effective but has not been fully explored; and (3) inappropriate placement of the CRT left ventricular (LV) lead in a significant number of patients. In terms of mechanical dyssynchrony, we utilize an autoencoder to extract new predictive features from nuclear medicine images, characterizing local mechanical dyssynchrony and improving the CRT response rate. Although machine learning can identify complex patterns and make accurate predictions from large datasets, the low interpretability of these black box methods makes it difficult to integrate them with clinical decisions made by physicians in the healthcare setting. Therefore, we use visualization techniques to enable physicians to understand the physical meaning of new features and the reasoning behind the clinical decisions made by the artificial intelligent model. For electrical dyssynchrony, we use short-time Fourier transform (STFT) to transform one-dimensional waveforms into two-dimensional frequency-time spectra. And transfer learning is used to leverage the knowledge learned from a large arrhythmia ECG dataset of related medical conditions to improve patient selection for CRT with limited data. This improves prediction accuracy, reduces the time and resources required, and potentially leads to better patient outcomes. Furthermore, an innovative approach is proposed for using three-dimensional spatial VCG information to describe the characteristics of electrical dyssynchrony, locate the latest activation site, and combine it with the latest mechanical contraction site to select the optimal LV lead position. In addition, we apply deep reinforcement learning to the decision-making problem of CRT patients. We investigate discrete state space/specific action space models to find the best treatment strategy, improve the reward equation based on the physician\u27s experience, and learn the approximation of the best action-value function that can improve the treatment policy used by clinicians and provide interpretability
    corecore