38 research outputs found
NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.
Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus
MicroScope: ChIP-seq and RNA-seq software analysis suite for gene expression heatmaps
BACKGROUND: Heatmaps are an indispensible visualization tool for examining large-scale snapshots of genomic activity across various types of next-generation sequencing datasets. However, traditional heatmap software do not typically offer multi-scale insight across multiple layers of genomic analysis (e.g., differential expression analysis, principal component analysis, gene ontology analysis, and network analysis) or multiple types of next-generation sequencing datasets (e.g., ChIP-seq and RNA-seq). As such, it is natural to want to interact with a heatmap’s contents using an extensive set of integrated analysis tools applicable to a broad array of genomic data types. RESULTS: We propose a user-friendly ChIP-seq and RNA-seq software suite for the interactive visualization and analysis of genomic data, including integrated features to support differential expression analysis, interactive heatmap production, principal component analysis, gene ontology analysis, and dynamic network analysis. CONCLUSIONS: MicroScope is hosted online as an R Shiny web application based on the D3 JavaScript library: http://microscopebioinformatics.org/. The methods are implemented in R, and are available as part of the MicroScope project at: https://github.com/Bohdan-Khomtchouk/Microscope
Enhanced single-cell RNA-seq workflow reveals coronary artery disease cellular cross-talk and candidate drug targets
BACKGROUND AND AIMS: The atherosclerotic plaque microenvironment is highly complex, and selective agents that modulate plaque stability are not yet available. We sought to develop a scRNA-seq analysis workflow to investigate this environment and uncover potential therapeutic approaches. We designed a user-friendly, reproducible workflow that will be applicable to other disease-specific scRNA-seq datasets. METHODS: Here we incorporated automated cell labeling, pseudotemporal ordering, ligand-receptor evaluation, and drug-gene interaction analysis into a ready-to-deploy workflow. We applied this pipeline to further investigate a previously published human coronary single-cell dataset by Wirka et al. Notably, we developed an interactive web application to enable further exploration and analysis of this and other cardiovascular single-cell datasets. RESULTS: We revealed distinct derivations of fibroblast-like cells from smooth muscle cells (SMCs), and showed the key changes in gene expression along their de-differentiation path. We highlighted several key ligand-receptor interactions within the atherosclerotic environment through functional expression profiling and revealed several avenues for future pharmacological development for precision medicine. Further, our interactive web application, PlaqView (www.plaqview.com), allows lay scientists to explore this and other datasets and compare scRNA-seq tools without prior coding knowledge. CONCLUSIONS: This publicly available workflow and application will allow for more systematic and user-friendly analysis of scRNA datasets in other disease and developmental systems. Our analysis pipeline provides many hypothesis-generating tools to unravel the etiology of coronary artery disease. We also highlight potential mechanisms for several drugs in the atherosclerotic cellular environment. Future releases of PlaqView will feature more scRNA-seq and scATAC-seq atherosclerosis-related datasets to provide a critical resource for the field, and to promote data harmonization and biological interpretation
Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
In this study, we investigate how an organism's codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement
Gaussian-Distributed Codon Frequencies of Genomes
DNA encodes protein primary structure using 64 different codons to specify 20 different amino acids and a stop signal. Frequencies of codon occurrence when ordered in descending sequence provide a global characterization of a genome’s preference (bias) for using the different codons of the redundant genetic code. Whereas frequency/rank relations have been described by empirical expressions, here we propose a statistical model in which two different forms of codon usage co-exist in a genome. We investigate whether such a model can account for the range of codon usages observed in a large set of genomes from different taxa. The differences in frequency/rank relations across these genomes can be expressed in a single parameter, the proportion of the two codon compartments. One compartment uses different codons with weak bias according to a Gaussian distribution of frequency, the other uses different codons with strong bias. In prokaryotic genomes both compartments appear to be present in a wide range of proportions, whereas in eukaryotic genomes the compartment with Gaussian distribution tends to dominate. Codon frequencies that are Gaussian-distributed suggest that many evolutionary conditions are involved in shaping weakly-biased codon usage, whereas strong bias in codon usage suggests dominance of few evolutionary conditions
A global perspective of codon usage
Codon usage in 2730 genomes is analyzed for evolutionary patterns in the usage of synonymous codons and amino acids across prokaryotic and eukaryotic taxa. We group genomes together that have similar amounts of intra-genomic bias in their codon usage, and then compare how usage of particular different codons is diversified across each genome group, and how that usage varies from group to group. Inter-genomic diversity of codon usage increases with intra-genomic usage bias, following a universal pattern. The frequencies of the different codons vary in robust mutual correlation, and the implied synonymous codon and amino acid usages drift together. This kind of correlation indicates that the variation of codon usage across organisms is chiefly a consequence of lateral DNA transfer among diverse organisms. The group of genomes with the greatest intra-genomic bias comprises two distinct subgroups, with each one restricting its codon usage to essentially one unique half of the genetic code table. These organisms include eubacteria and archaea thought to be closest to the hypothesized last universal common ancestor (LUCA). Their codon usages imply genetic diversity near the hypothesized base of the tree of life. There is a continuous evolutionary progression across taxa from the two extremely diversified usages toward balanced usage of different codons (as approached, e.g. in mammals). In that progression, codon frequency variations are correlated as expected from a blending of the two extreme codon usages seen in prokaryotes. AUTHOR SUMMARY The redundancy intrinsic to the genetic code allows different amino acids to be encoded by up to six synonymous codons. Genomes of different organisms prefer different synonymous codons, a phenomenon known as ‘codon usage bias.’ The phenomenon of codon usage bias is of fundamental interest for evolutionary biology, and is important in a variety of applied settings (e.g., transgene expression). The spectrum of codon usage biases seen in current organisms is commonly thought to have arisen by the combined actions of mutations and selective pressures. This view focuses on codon usage in specific genomes and the consequences of that usage for protein expression. Here we investigate an unresolved question of molecular genetics: are there global rules governing the usage of synonymous codons made by genomic DNA across organisms? To answer this question, we employed a data-driven approach to surveying 2730 species from all kingdoms of the ‘tree of life’ in order to classify their codon usage. A first major result was that the large majority of these organisms use codons rather uniformly on the genome-wide scale, without giving preference to particular codons among possible synonymous alternatives. A second major result was that two compartments of codon usage seem to co-exist and to be expressed in different proportions by different organisms. As such, we investigate how individual different codons are used in different organisms from all taxa. Whereas codon usage is generally believed to be the evolutionary result of both mutations and natural selection, our results suggest a different perspective: the usage of different codons (and amino acids) by different organisms follows a superposition of two distinct patterns of usage. One distinction locates to the third base pair of all different codons, which in one pattern is U or A, and in the other pattern is G or C. This result has two major implications: (1) the variation of codon usage as seen across different organisms is best accounted for by lateral gene transfer among diverse organisms; (2) the organisms that are by protein homology grouped near the base of the ‘tree of life’ comprise two genetically distinct lineages. We find that, over evolutionary time, codon usages have converged from two distinct, non-overlapping usages (e.g., as evident in bacteria and archaea) to a near-uniform, balanced usage of synonymous codons (e.g., in mammals). This shows that the variations of codon (and amino acid) biases reveal a distinct evolutionary progression. We also find that codon usage in bacteria and archaea is most diverse between organisms thought to be closest to the hypothesized last universal common ancestor (LUCA). The dichotomy in codon (and amino acid usages) present near the origin of the current ‘tree of life’ might provide information about the evolutionary development of the genetic code
SUPERmerge: ChIP-seq coverage island analysis algorithm for broad histone marks
Abstract SUPERmerge is a ChIP-seq read pileup analysis and annotation algorithm for investigating alignment (BAM) files of diffuse histone modification ChIP-seq datasets with broad chromatin domains at a single base pair resolution level. SUPERmerge allows flexible regulation of a variety of read pileup parameters, thereby revealing how read islands aggregate into areas of coverage across the genome and what annotation features they map to within individual biological replicates. SUPERmerge is especially useful for investigating low sample size ChIP-seq experiments in which epigenetic histone modifications (e.g., H3K9me1, H3K27me3) result in inherently broad peaks with a diffuse range of signal enrichment spanning multiple consecutive genomic loci and annotated features