10 research outputs found

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    Exploring spatial-frequency-sequential relationships for motor imagery classification with recurrent neural network

    Get PDF
    Abstract Background Conventional methods of motor imagery brain computer interfaces (MI-BCIs) suffer from the limited number of samples and simplified features, so as to produce poor performances with spatial-frequency features and shallow classifiers. Methods Alternatively, this paper applies a deep recurrent neural network (RNN) with a sliding window cropping strategy (SWCS) to signal classification of MI-BCIs. The spatial-frequency features are first extracted by the filter bank common spatial pattern (FB-CSP) algorithm, and such features are cropped by the SWCS into time slices. By extracting spatial-frequency-sequential relationships, the cropped time slices are then fed into RNN for classification. In order to overcome the memory distractions, the commonly used gated recurrent unit (GRU) and long-short term memory (LSTM) unit are applied to the RNN architecture, and experimental results are used to determine which unit is more suitable for processing EEG signals. Results Experimental results on common BCI benchmark datasets show that the spatial-frequency-sequential relationships outperform all other competing spatial-frequency methods. In particular, the proposed GRU-RNN architecture achieves the lowest misclassification rates on all BCI benchmark datasets. Conclusion By introducing spatial-frequency-sequential relationships with cropping time slice samples, the proposed method gives a novel way to construct and model high accuracy and robustness MI-BCIs based on limited trials of EEG signals

    Clustering of Bulk RNA-Seq Data and Missing Data Methods in Deep Learning

    Get PDF
    Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters is. In addition, few methods exist for unsupervised clustering of bulk RNA-seq samples, and no method exists that can do so while simultaneously adjusting for between-sample global normalization factors, accounting for potential confounding variables, and selecting cluster-discriminatory genes. In Chapter 2, we present FSCseq (Feature Selection and Clustering of RNA-seq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and employs a quadratic penalty method with a SCAD penalty. The maximization is done by a penalized EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership. The field of deep learning has also boomed in popularity in recent years, fueled initially by its performance in the classification and manipulation of image data, and, more recently, in areas of public health, medicine, and biology. However, the presence of missing data in these latter areas is very common, and involves more complicated mechanisms of missingness than the former. While a rich statistical literature exists regarding the characterization and treatment of missing data in traditional statistical models, it is unclear how such methods may extend to deep learning methods. In Chapter 3, we present NIMIWAE (Non-Ignorably Missing Importance Weighted AutoEncoder), an unsupervised learning algorithm which provides a formal treatment of missing data in the context of Importance Weighted Autoencoders (IWAEs), an unsupervised Bayesian deep learning architecture, in order to perform single and multiple imputation of missing data. We review existing methods that handle up to the missing at random (MAR) missingness, and propose methods to handle the more difficult missing not at random (MNAR) scenario. We show that this extension is critical to ensure the performance of data imputation, as well as downstream coefficient estimation. We utilize simulation examples to illustrate the impact of missingness on such tasks, and compare the performance of several proposed methods in handling missing data. We applied our proposed methods to a large electronic healthcare record dataset, and illustrated its utility through a qualitative look at the downstream fitted models after imputation. Finally, in Chapter 4, we present dlglm (deeply-learned generalized linear model), a supervised learning algorithm that extends the missing data methods from Chapter 3 directly to supervised learning tasks such as classification and regression. We show that dlglm can be trained in the presence of missing data in both the predictors and the response, and under the MCAR, MAR, and MNAR missing data settings. We also demonstrate that the trained dlglm model can directly predict response on partially-observed samples in the prediction or test set, drawing from the learned variational posterior distribution of the missing values conditional on the observed values during model training. We utilize statistical simulation and real-world datasets to show the impact of our method in increasing accuracy of coefficient estimation and predictionunder different mechanisms of missingness.Doctor of Philosoph

    Bioinformatics Applications Based On Machine Learning

    Get PDF
    The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems

    Engineering Tools to Probe and Manipulate the Immune System at Single-Cell Resolution

    Get PDF
    My thesis focuses on developing experimental and computational tools to probe and manipulate cellular transcriptomes in the context of human health and disease. Chapter 1 and 2 focus on published work where we leverage single-cell RNA sequencing (scRNA-seq) to understand human immune variability, characterize cell-type specific biases of multiple viral variants within an animal, and assess temporal immune response in the brain to delivery of genetic cargo via an adeno-associated virus (AAV). Chapter 3 and 4 present progress I have made on tools for exporting RNA extracellularly and engineering of a transcription factor for modulating macrophage state. For probing cellular transcriptome states, we have developed a platform using multiplexed single-cell sequencing and out-of-clinic capillary blood extraction to understand temporal and inter-individual variability of gene expression within immune cell types. Our platform enables simplified, cost-effective profiling of the human immune system across subjects and time at single-cell resolution. To demonstrate the power of our platform, we performed a three day time-of-day study of four healthy individuals, generating gene expression data for 24,087 cells across 22 samples. We detected genes with cell type-specific time-of-day expression and identified robust genes and pathways particular to each individual, all of which could have been missed if analyzed with bulk RNA-sequencing. Also, using scRNA-seq, we have developed a method to screen and characterize cellular tropism of multiple AAV variants. Additionally, I have looked at AAV-mediated transcriptomic changes in animals injected with AAV-PHP.eB three days and twenty-five days post-injection. I have found that there is an upregulation of genes involved in p53 signaling in endothelial cells three days post-injection. In the context of manipulating cellular transcriptomic states, I demonstrate that a fusion between RNA targeting enzyme, dCas13, and capsid-forming neuronal protein, Arc, is able to form a capsid-like structure capable of encapsulating RNA. I also present methods and preliminary data for tuning macrophage states through mutations in transcription factor EB (TFEB) using scRNA-seq as a readout.</p

    Principles of Massively Parallel Sequencing for Engineering and Characterizing Gene Delivery

    Get PDF
    The advent of massively parallel sequencing and synthesis technologies have ushered in a new paradigm of biology, where high throughput screening of billions of nucleid acid molecules and production of libraries of millions of genetic mutants are now routine in labs and clinics. During my Ph.D., I worked to develop data analysis and experimental methods that take advantage of the scale of this data, while making the minimal assumptions necessary for deriving value from their application. My Ph.D. work began with the development of software and principles for analyzing deep mutational scanning data of libraries of engineered AAV capsids. By looking at not only the top variant in a round of directed evolution, but instead a broad distribution of the variants and their phenotypes, we were able to identify AAV variants with enhanced ability to transduce specific cells in the brain after intravenous injection. I then shifted to better understand the phenotypic profile of these engineered variants. To that end, I turned to single-cell RNA sequencing to seek to identify, with high resolution, the delivery profile of these variants in all cell types present in the cortex of a mouse brain. I began by developing infrastructure and tools for dealing with the data analysis demands of these experiments. Then, by delivering an engineered variant to the animal, I was able to use the single-cell RNA sequencing profile, coupled with a sequencing readout of the delivered genetic cargo present in each cell type, to define the variant’s tropism across the full spectrum of cell types in a single step. To increase the throughput of this experimental paradigm, I then worked to develop a multiplexing strategy for delivering up to 7 engineered variants in a single animal, and obtain the same high resolution readout for each variant in a single experiment. Finally, to take a step towards translation to human diagnostics, I leveraged the tools I built for scaling single-cell RNA sequencing studies and worked to develop a protocol for obtaining single-cell immune profiles of low volumes of self-collected blood. This study enabled repeat sampling in a short period of time, and revealed an incredible richness in individual variability and time-of-day dependence of human immune gene expression. Together, my Ph.D. work provides strategies for employing massively parallel sequencing and synthesis for new biological applications, and builds towards a future paradigm where personalized, high-resolution sequencing might be coupled with modular, customized gene therapy delivery.</p

    A deep learning method for lincRNA detection using auto-encoder algorithm

    No full text
    Abstract Background RNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition. Results The auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction. Conclusions The transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences
    corecore