Search CORE

6,647 research outputs found

Deep Learning to Analyze RNA-Seq Gene Expression Data

Author: B Li
DC Cireşan
F Ciompi
J Friedman
M Leung
N Srivastava
R Tibshirani
Y Bengio
Y LeCun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Deep learning models are currently being applied in several areas with great success. However, their application for the analysis of high-throughput sequencing data remains a challenge for the research community due to the fact that this family of models are known to work very well in big datasets with lots of samples available, just the opposite scenario typically found in biomedical areas. In this work, a first approximation on the use of deep learning for the analysis of RNA-Seq gene expression profiles data is provided. Three public cancer-related databases are analyzed using a regularized linear model (standard LASSO) as baseline model, and two deep learning models that differ on the feature selection technique used prior to the application of a deep neural net model. The results indicate that a straightforward application of deep nets implementations available in public scientific tools and under the conditions described within this work is not enough to outperform simpler models like LASSO. Therefore, smarter and more complex ways that incorporate prior biological knowledge into the estimation procedure of deep learning models may be necessary in order to obtain better results in terms of predictive performance.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Crossref

Repositorio Institucional Universidad de Málaga

Applications of High-Throughput Sequencing Data Analysis in Transcriptional Studies

Author: Guo Zhengyu
Publication venue
Publication date: 16/01/2019
Field of study

High-throughput sequencing has become one of the most powerful tools for studies in genomics, transcriptomics, epigenomics, and metagenomics. In recent years, HTS protocols for enhancing the understanding of the diverse cellular roles of RNA have been designed, such as RNA-Seq, CLIP-Seq, and RIP-Seq. In this work, we explore the applications of HTS data analysis in transcriptional studies. First, the differential expression analysis of RNA-Seq data is discussed and applied to a sheep RNA-Seq dataset to examine the biological mechanisms of the sheep resistance to worm infection. We develop an automatic pipeline to analyze the RNA-Seq dataset, and use a negative binomial model for gene expression analysis. Functional analysis is conducted over the differentially expressed genes, and a broad range of mechanisms providing protection against the parasite are identified in the resistant sheep breed. This study provides insights into the underlying biology of sheep host resistance. Then, a deep learning method is proposed to predict the RNA binding protein binding preferences using CLIP-Seq data. The proposed method uses a deep convolutional autoencoder to effectively learn the robust sequence features, and a softmax classifier to predict the RBP binding sites. To demonstrate the efficacy of the proposed method, we evaluate its performance over a dataset containing 31 CLIP-Seq experiments. This benchmarking shows that the proposed method improves the prediction performance in terms of AUC, compared with the existing methods. The analysis also shows that the proposed method is able to provide insights to identify new RBP binding motifs. Therefore, the proposed method will be of great help in understanding the dynamic regulations of RBPs in various biological processes and diseases. Finally, a database is created to facilitate the reuse of the public available mouse RNA-Seq dataset. The metadata of the publicly available mouse RNA-Seq datasets is manually curated and is served by a well-designed website. The database can be scaled up in the future to serve more types of HTS data

Texas A&M Repository

A Robust scRNA-seq Data Analysis Pipeline for Measuring Gene Expression Noise

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: The past decade has seen a drastic increase in collaboration between Computer Science (CS) and Molecular Biology (MB). Current foci in CS such as deep learning require very large amounts of data, and MB research can often be rapidly advanced by analysis and models from CS. One of the places where CS could aid MB is during analysis of sequences to find binding sites, prediction of folding patterns of proteins. Maintenance and replication of stem-like cells is possible for long terms as well as differentiation of these cells into various tissue types. These behaviors are possible by controlling the expression of specific genes. These genes then cascade into a network effect by either promoting or repressing downstream gene expression. The expression level of all gene transcripts within a single cell can be analyzed using single cell RNA sequencing (scRNA-seq). A significant portion of noise in scRNA-seq data are results of extrinsic factors and could only be removed by customized scRNA-seq analysis pipeline. scRNA-seq experiments utilize next-gen sequencing to measure genome scale gene expression levels with single cell resolution. Almost every step during analysis and quantification requires the use of an often empirically determined threshold, which makes quantification of noise less accurate. In addition, each research group often develops their own data analysis pipeline making it impossible to compare data from different groups. To remedy this problem a streamlined and standardized scRNA-seq data analysis and normalization protocol was designed and developed. After analyzing multiple experiments we identified the possible pipeline stages, and tools needed. Our pipeline is capable of handling data with adapters and barcodes, which was not the case with pipelines from some experiments. Our pipeline can be used to analyze single experiment scRNA-seq data and also to compare scRNA-seq data across experiments. Various processes like data gathering, file conversion, and data merging were automated in the pipeline. The main focus was to standardize and normalize single-cell RNA-seq data to minimize technical noise introduced by disparate platforms.Dissertation/ThesisMasters Thesis Bioengineering 201

ASU Digital Repository

GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization

Author: Chen Hung-I Harry
Chen Yidong
Chiu Yu-Chiao
Huang Yufei
Zhang Songyao
Zhang Tinghe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/12/2018
Field of study

Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.Comment: Presented in the International Conference on Intelligent Biology and Medicine (ICIBM 2018) at Los Angeles, CA, USA and published in BMC Systems Biology 2018, 12(Suppl 8):14

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Recommended from our members

The Expanding Landscape of Alternative Splicing Variation in Human Populations.

Author: Lin Lan
Pan Zhicheng
Park Eddie
Xing Yi
Zhang Zijun
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine

eScholarship - University of California

Immune DNA signature of T-cell infiltration in breast tumor exomes.

Author: Armisen Ricardo
Carter Hannah
Dow Michelle
Gárate Calderón Valentina
Harismendy Olivier
Levy Eric
Marty Rachel
Woo Brian
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

Tumor infiltrating lymphocytes (TILs) have been associated with favorable prognosis in multiple tumor types. The Cancer Genome Atlas (TCGA) represents the largest collection of cancer molecular data, but lacks detailed information about the immune environment. Here, we show that exome reads mapping to the complementarity-determining-region 3 (CDR3) of mature T-cell receptor beta (TCRB) can be used as an immune DNA (iDNA) signature. Specifically, we propose a method to identify CDR3 reads in a breast tumor exome and validate it using deep TCRB sequencing. In 1,078 TCGA breast cancer exomes, the fraction of CDR3 reads was associated with TILs fraction, tumor purity, adaptive immunity gene expression signatures and improved survival in Her2+ patients. Only 2/839 TCRB clonotypes were shared between patients and none associated with a specific HLA allele or somatic driver mutations. The iDNA biomarker enriches the comprehensive dataset collected through TCGA, revealing associations with other molecular features and clinical outcomes

PubMed Central

eScholarship - University of California

Repositorio Académico de la Universidad de Chile

Recommended from our members

Functional interpretation of single cell similarity maps.

Author: Ashuach Tal
DeTomaso David
Jones Matthew G
Subramaniam Meena
Ye Chun J
Yosef Nir
Publication venue: eScholarship, University of California
Publication date: 01/09/2019
Field of study

We present Vision, a tool for annotating the sources of variation in single cell RNA-seq data in an automated and scalable manner. Vision operates directly on the manifold of cell-cell similarity and employs a flexible annotation approach that can operate either with or without preconceived stratification of the cells into groups or along a continuum. We demonstrate the utility of Vision in several case studies and show that it can derive important sources of cellular variation and link them to experimental meta-data even with relatively homogeneous sets of cells. Vision produces an interactive, low latency and feature rich web-based report that can be easily shared among researchers, thus facilitating data dissemination and collaboration

eScholarship - University of California