Search CORE

29 research outputs found

HeatmapGenerator: high performance RNAseq and microarray visualization software suite to examine differential gene expression levels using an R and C++ hybrid computational pipeline

Author: AI Saeed
Bohdan B Khomtchouk
C Sotiriou
C Trapnell
Claes Wahlestedt
Derek J Van Booven
M Reich
R Core Team
S Anders
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.

Author: Alachram Halima
Ambite José Luis
Ananiadou Sophia
Beißbarth Tim
Chambers Brendan
Christopoulou Fenia
Evans James A
Galstyan Aram
Gao Xin
Garg Sahil
Hermjakob Ulf
Khomtchouk Bohdan B
King Ross
Li Maolin
Li Yu
Marcu Daniel
Matthew Joel
Pan Weidi
Rzhetsky Andrey
Schoene Annika M
Sheng Emily
Soldatova Larisa
Stevens Robert
Wang Kanix
Wingender Edgar
Publication venue: NPJ Syst Biol Appl
Publication date: 01/01/2021
Field of study

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus

Goldsmiths Research Online

Directory of Open Access Journals

Chalmers Research

Apollo (Cambridge)

MicroScope: ChIP-seq and RNA-seq software analysis suite for gene expression heatmaps

Author: A Conesa
AI Saeed
AJ Saldanha
BB Khomtchouk
Bohdan B. Khomtchouk
C Kibbey
C Perez-Llamas
C Soneson
C Turkay
C Škuta
Claes Wahlestedt
G Caraux
H Shin
HM Wu
James R. Hennessy
M Reich
MD Robinson
MD Young
MH Tan
R Core Team
RGW Verhaak
S Babicki
T Bailey
T Metsalu
VT Chu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/12/2015
Field of study

BACKGROUND: Heatmaps are an indispensible visualization tool for examining large-scale snapshots of genomic activity across various types of next-generation sequencing datasets. However, traditional heatmap software do not typically offer multi-scale insight across multiple layers of genomic analysis (e.g., differential expression analysis, principal component analysis, gene ontology analysis, and network analysis) or multiple types of next-generation sequencing datasets (e.g., ChIP-seq and RNA-seq). As such, it is natural to want to interact with a heatmap’s contents using an extensive set of integrated analysis tools applicable to a broad array of genomic data types. RESULTS: We propose a user-friendly ChIP-seq and RNA-seq software suite for the interactive visualization and analysis of genomic data, including integrated features to support differential expression analysis, interactive heatmap production, principal component analysis, gene ontology analysis, and dynamic network analysis. CONCLUSIONS: MicroScope is hosted online as an R Shiny web application based on the D3 JavaScript library: http://microscopebioinformatics.org/. The methods are implemented in R, and are available as part of the MicroScope project at: https://github.com/Bohdan-Khomtchouk/Microscope

Crossref

PubMed Central

University of Miami: Scholarship Miami

Gaussian-Distributed Codon Frequencies of Genomes

Author: Bohdan B. Khomtchouk
Wolfgang Nonner
Publication venue: 'Genetics Society of America'
Publication date: 01/05/2019
Field of study

DNA encodes protein primary structure using 64 different codons to specify 20 different amino acids and a stop signal. Frequencies of codon occurrence when ordered in descending sequence provide a global characterization of a genome’s preference (bias) for using the different codons of the redundant genetic code. Whereas frequency/rank relations have been described by empirical expressions, here we propose a statistical model in which two different forms of codon usage co-exist in a genome. We investigate whether such a model can account for the range of codon usages observed in a large set of genomes from different taxa. The differences in frequency/rank relations across these genomes can be expressed in a single parameter, the proportion of the two codon compartments. One compartment uses different codons with weak bias according to a Gaussian distribution of frequency, the other uses different codons with strong bias. In prokaryotic genomes both compartments appear to be present in a wide range of proportions, whereas in eukaryotic genomes the compartment with Gaussian distribution tends to dominate. Codon frequencies that are Gaussian-distributed suggest that many evolutionary conditions are involved in shaping weakly-biased codon usage, whereas strong bias in codon usage suggests dominance of few evolutionary conditions

Directory of Open Access Journals

University of Miami: Scholarship Miami

Recommended from our members

A global perspective of codon usage

Author: Khomtchouk Bohdan B
Nonner Wolfgang
Wahlestedt Claes
Publication venue: Cold Spring Harbor Laboratory
Publication date: 21/09/2016
Field of study

Codon usage in 2730 genomes is analyzed for evolutionary patterns in the usage of synonymous codons and amino acids across prokaryotic and eukaryotic taxa. We group genomes together that have similar amounts of intra-genomic bias in their codon usage, and then compare how usage of particular different codons is diversified across each genome group, and how that usage varies from group to group. Inter-genomic diversity of codon usage increases with intra-genomic usage bias, following a universal pattern. The frequencies of the different codons vary in robust mutual correlation, and the implied synonymous codon and amino acid usages drift together. This kind of correlation indicates that the variation of codon usage across organisms is chiefly a consequence of lateral DNA transfer among diverse organisms. The group of genomes with the greatest intra-genomic bias comprises two distinct subgroups, with each one restricting its codon usage to essentially one unique half of the genetic code table. These organisms include eubacteria and archaea thought to be closest to the hypothesized last universal common ancestor (LUCA). Their codon usages imply genetic diversity near the hypothesized base of the tree of life. There is a continuous evolutionary progression across taxa from the two extremely diversified usages toward balanced usage of different codons (as approached, e.g. in mammals). In that progression, codon frequency variations are correlated as expected from a blending of the two extreme codon usages seen in prokaryotes. AUTHOR SUMMARY The redundancy intrinsic to the genetic code allows different amino acids to be encoded by up to six synonymous codons. Genomes of different organisms prefer different synonymous codons, a phenomenon known as ‘codon usage bias.’ The phenomenon of codon usage bias is of fundamental interest for evolutionary biology, and is important in a variety of applied settings (e.g., transgene expression). The spectrum of codon usage biases seen in current organisms is commonly thought to have arisen by the combined actions of mutations and selective pressures. This view focuses on codon usage in specific genomes and the consequences of that usage for protein expression. Here we investigate an unresolved question of molecular genetics: are there global rules governing the usage of synonymous codons made by genomic DNA across organisms? To answer this question, we employed a data-driven approach to surveying 2730 species from all kingdoms of the ‘tree of life’ in order to classify their codon usage. A first major result was that the large majority of these organisms use codons rather uniformly on the genome-wide scale, without giving preference to particular codons among possible synonymous alternatives. A second major result was that two compartments of codon usage seem to co-exist and to be expressed in different proportions by different organisms. As such, we investigate how individual different codons are used in different organisms from all taxa. Whereas codon usage is generally believed to be the evolutionary result of both mutations and natural selection, our results suggest a different perspective: the usage of different codons (and amino acids) by different organisms follows a superposition of two distinct patterns of usage. One distinction locates to the third base pair of all different codons, which in one pattern is U or A, and in the other pattern is G or C. This result has two major implications: (1) the variation of codon usage as seen across different organisms is best accounted for by lateral gene transfer among diverse organisms; (2) the organisms that are by protein homology grouped near the base of the ‘tree of life’ comprise two genetically distinct lineages. We find that, over evolutionary time, codon usages have converged from two distinct, non-overlapping usages (e.g., as evident in bacteria and archaea) to a near-uniform, balanced usage of synonymous codons (e.g., in mammals). This shows that the variations of codon (and amino acid) biases reveal a distinct evolutionary progression. We also find that codon usage in bacteria and archaea is most diverse between organisms thought to be closest to the hypothesized last universal common ancestor (LUCA). The dichotomy in codon (and amino acid usages) present near the origin of the current ‘tree of life’ might provide information about the evolutionary development of the genetic code

University of Miami: Scholarship Miami

SUPERmerge: ChIP-seq coverage island analysis algorithm for broad histone marks

Author: Booven Derek
Khomtchouk Bohdan B
Wahlestedt Claes
Publication venue: Cold Spring Harbor Laboratory
Publication date: 29/03/2017
Field of study

Abstract SUPERmerge is a ChIP-seq read pileup analysis and annotation algorithm for investigating alignment (BAM) files of diffuse histone modification ChIP-seq datasets with broad chromatin domains at a single base pair resolution level. SUPERmerge allows flexible regulation of a variety of read pileup parameters, thereby revealing how read islands aggregate into areas of coverage across the genome and what annotation features they map to within individual biological replicates. SUPERmerge is especially useful for investigating low sample size ChIP-seq experiments in which epigenetic histone modifications (e.g., H3K9me1, H3K27me3) result in inherently broad peaks with a diffuse range of signal enrichment spanning multiple consecutive genomic loci and annotated features

Crossref

University of Miami: Scholarship Miami

Codon usage is a stochastic process across genetic codes of the kingdoms of life

Author: Khomtchouk Bohdan B
Nonner Wolfgang
Wahlestedt Claes
Publication venue: Cold Spring Harbor Laboratory
Publication date: 27/07/2016
Field of study

DNA encodes protein primary structure using 64 different codons to specify 20 different amino acids and a stop signal. To uncover rules of codon use, ranked codon frequencies have previously been analyzed in terms of empirical or statistical relations for a small number of genomes. These descriptions fail on most genomes reported in the Codon Usage Tabulated from GenBank (CUTG) database. Here we model codon usage as a random variable. This stochastic model provides accurate, one-parameter characterizations of 2210 nuclear and mitochondrial genomes represented with > 10 4 codons/genome in CUTG. We show that ranked codon frequencies are well characterized by a truncated normal (Gaussian) distribution. Most genomes use codons in a nearuniform manner. Lopsided usages are also widely distributed across genomes but less frequent. Our model provides a universal framework for investigating determinants of codon use

Crossref

University of Miami: Scholarship Miami

Recommended from our members

Survival guide to organic chemistry bridging the gap from general chemistry

Author: Khomtchouk Bohdan B
McMahon Patrick E
Wahlestedt Claes
Publication venue: Boca Raton
Publication date: 01/01/2017
Field of study

"The Survival Guide to Organic Chemistry: Bridging the Gap from General Chemistry enables organic chemistry students to bridge the gap between general chemistry and organic chemistry. It makes sense of the myriad of in-depth concepts of organic chemistry, without overwhelming them in the necessary detail often given in a complete organic chemistry text. Here, the topics covered span the entire standard organic chemistry curriculum. The authors describe subjects which require further explanation, offer alternate viewpoints for understanding and provide hands-on practical problems and solutions to help master the material. This text ultimately allows students to apply key ideas from their general chemistry curriculum to key concepts in organic chemistry."--Back cover

University of Miami: Scholarship Miami

shinyheatmap: Ultra fast low memory heatmap web interface for big data genomics.

Author: Bohdan B Khomtchouk
Claes Wahlestedt
James R Hennessy
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2017
Field of study

BACKGROUND:Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets, especially across single-cell sequencing studies. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. RESULTS:We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Also, shinyheatmap features a built-in high performance web plug-in, fastheatmap, for rapidly plotting interactive heatmaps of datasets as large as 105-107 rows within seconds, effectively shattering previous performance benchmarks of heatmap rendering speed. CONCLUSIONS:shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap. Users can access fastheatmap directly from within the shinyheatmap web interface, and all source code has been made publicly available on Github: https://github.com/Bohdan-Khomtchouk/fastheatmap

Directory of Open Access Journals