5,150 research outputs found
Recommended from our members
Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types.
Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data-driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta-analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell-type-specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" Challenge for predicting effects of single-nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest
Recommended from our members
The Expanding Landscape of Alternative Splicing Variation in Human Populations.
Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine
Unconventional machine learning of genome-wide human cancer data
Recent advances in high-throughput genomic technologies coupled with
exponential increases in computer processing and memory have allowed us to
interrogate the complex aberrant molecular underpinnings of human disease from
a genome-wide perspective. While the deluge of genomic information is expected
to increase, a bottleneck in conventional high-performance computing is rapidly
approaching. Inspired in part by recent advances in physical quantum
processors, we evaluated several unconventional machine learning (ML)
strategies on actual human tumor data. Here we show for the first time the
efficacy of multiple annealing-based ML algorithms for classification of
high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas.
To assess algorithm performance, we compared these classifiers to a variety of
standard ML methods. Our results indicate the feasibility of using
annealing-based ML to provide competitive classification of human cancer types
and associated molecular subtypes and superior performance with smaller
training datasets, thus providing compelling empirical evidence for the
potential future application of unconventional computing architectures in the
biomedical sciences
Cell type identification, differential expression analysis and trajectory inference in single-cell transcriptomics
Single-cell RNA-sequencing (scRNA-seq) is a cutting-edge technology that enables to quantify the transcriptome, the set of expressed RNA transcripts, of a group of cells at the single-cell level. It represents a significant upgrade from bulk RNA-seq, which measures the combined signal of thousands of cells. Measuring gene expression by bulk RNA-seq is an invaluable tool for biomedical researchers who want to understand how cells alter their gene expression due to an illness, differentiation, ternal stimulus, or other events. Similarly, scRNA-seq has become an essential method for biomedical researchers, and it has brought several new applications previously unavailable with bulk RNA-seq.
scRNA-seq has the same applications as bulk RNA-seq. However, the single-cell resolution also enables cell annotation based on gene markers of clusters, that is, cell populations that have been identified based on machine learning to be, on average, dissimilar at the transcriptomic level. Researchers can use the cell clusters to detect cell-type-specific gene expression changes between conditions such as case and control groups. Clustering can sometimes even discover entirely new cell types. Besides the cluster-level representation, the single-cell resolution also enables to model cells as a trajectory, representing how the cells are related at the cell level and what is the dynamic differentiation process that the cells undergo in a tissue.
This thesis introduces new computational methods for cell type identification and trajectory inference from scRNA-seq data. A new cell type identification method (ILoReg) was proposed, which enables high-resolution clustering of cells into populations with subtle transcriptomic differences. In addition, two new trajectory inference methods were developed: scShaper, which is an accurate and robust method for inferring linear trajectories; and Totem, which is a user-friendly and flexible method for inferring tree-shaped trajectories. In addition, one of the works benchmarked methods for detecting cell-type-specific differential states from scRNA-seq data with multiple subjects per comparison group, requiring tailored methods to confront false discoveries.
KEYWORDS: Single-cell RNA sequencing, transcriptome, cell type identification, trajectory inference, differential expressionYksisoluinen RNA-sekvensointi on huipputeknologia, joka mahdollistaa transkriptomin eli ilmentyneiden RNA-transkriptien laskennallisen määrittämisen joukolle soluja yhden solun tarkkuudella, ja sen kehittäminen oli merkittävä askel eteenpäin perinteisestä bulkki-RNA-sekvensoinnista, joka mittaa tuhansien solujen yhteistä signaalia. Bulkki-RNA-sekvensointi on tärkeä työväline biolääketieteen tutkijoille, jotka haluavat ymmärtää miten solut muuttavat geenien ilmentymistä sairauden, erilaistumisen, ulkoisen ärsykkeen tai muun tapahtuman seurauksena. Yksisoluisesta RNA-sekvensoinnista on vastaavasti kehittynyt tärkeä työväline tutkijoille, ja se on tuonut useita uusia sovelluksia.
Yksisoluisella RNA-sekvensoinnilla on samat sovellukset kuin bulkki-RNA-sekvensoinnilla, mutta sen lisäksi se mahdollistaa solujen tunnistamisen geenimarkkerien perusteella. Geenimarkkerit etsitään tilastollisin menetelmin solupopulaatioille, joiden on tunnistettu koneoppimisen menetelmin muodostavan transkriptomitasolla keskenään erilaisia joukkoja eli klustereita. Tutkijat voivat hyödyntää soluklustereita tutkimaan geeniekspressioeroja solutyyppien sisällä esimerkiksi sairaiden ja terveiden välillä, ja joskus klusterointi voi jopa tunnistaa uusia solutyyppejä. Yksisolutason mittaukset mahdollistavat myös solujen mallintamisen trajektorina, joka esittää kuinka solut kehittyvät dynaamisesti toisistaan geenien ilmentymistä vaativien prosessien aikana.
Tämä väitöskirja esittelee uusia laskennallisia menetelmiä solutyyppien ja trajektorien tunnistamiseen yksisoluisesta RNA-sekvensointidatasta. Väitöskirja esittelee uuden solutyyppitunnistusmenetelmän (ILoReg), joka mahdollistaa hienovaraisia geeniekspressioeroja sisältävien solutyyppien tunnistamisen. Sen lisäksi väitöskirjassa kehitettiin kaksi uutta trajektorin tunnistusmenetelmää: scShaper, joka on tarkka ja robusti menetelmä lineaaristen trajektorien tunnistamiseen, sekä Totem, joka on käyttäjäystävällinen ja joustava menetelmä puumallisten trajektorien tunnistamiseen. Lopuksi väitöskirjassa vertailtiin menetelmiä solutyyppien sisäisten geeniekspressioerojen tunnistamiseen ryhmien välillä, joissa on useita koehenkilöitä tai muita biologisia replikaatteja, mikä vaatii erityisiä menetelmiä väärien positiivisten löydösten vähentämiseen.
ASIASANAT: yksisoluinen RNA-sekvensointi, klusterointi, trajektorin tunnistus, geeniekspressi
A genetic algorithm approach for predicting ribonucleic acid sequencing data classification using KNN and decision tree
Malaria larvae accept explosive variable lifecycle as they spread across numerous mosquito vector stratosphere. Transcriptomes arise in thousands of diverse parasites. Ribonucleic acid sequencing (RNA-seq) is a prevalent gene expression that has led to enhanced understanding of genetic queries. RNA-seq tests transcript of gene expression, and provides methodological enhancements to machine learning procedures. Researchers have proposed several methods in evaluating and learning biological data. Genetic algorithm (GA) as a feature selection process is used in this study to fetch relevant information from the RNA-Seq Mosquito Anopheles gambiae malaria vector dataset, and evaluates the results using kth nearest neighbor (KNN) and decision tree classification algorithms. The experimental results obtained a classification accuracy of 88.3 and 98.3 percents respectively
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
- …