14 research outputs found
Network-based visualisation and analysis of next-generation sequencing (NGS) data
Next-generation sequencing (NGS) technologies have revolutionised research into
nature and diversity of genomes and transcriptomes. Since the initial description of
these technology platforms over a decade ago, massively parallel RNA sequencing
(RNA-seq) has driven many advances in the characterization and quantification of
transcriptomes. RNA-seq is a powerful gene expression profiling technology
enabling transcript discovery and provides a far more precise measure of the levels of
transcripts and their isoforms than other methods e.g. microarray.
However, the analysis of RNA-seq data remains a significant challenge for many
biologists. The data generated is large and the tools for its assembly, analysis and
visualisation are still under development. Assemblies of reads can be inspected using
tools such as the Integrative Genomics Viewer (IGV) where visualisation of results
involves ‘stacking’ the reads onto a reference genome. Whilst sufficient for many
needs, when the underlying variance of the genome or transcript assemblies is
complex, this visualisation method can be limiting; errors in assembly can be
difficult to spot and visualisation of splicing events may be challenging.
Data visualisation is increasingly recognised as an essential component of genomic
and transcriptomic data analysis, enabling large and complex datasets to be better
understood. An approach that has been gaining traction in biological research is
based on the application of network visualisation and analysis methods. Networks
consist of nodes connected by edges (lines), where nodes usually represent an entity
and edge a relationship between them. These are now widely used for plotting
experimentally or computationally derived relationships between genes and proteins.
The overall aim of this PhD project was to explore the use of network-based
visualisation in the analysis and interpretation of RNA-seq data. In chapter 2, I
describe the development of a data pipeline that has been designed to go from ‘raw’
RNA-seq data to a file format which supports data visualisation as a ‘DNA assembly
graph’. In DNA assembly graphs, nodes represent sequence reads and edges denote a
homology between reads above a defined threshold. Following the mapping of reads
to a reference sequence and defining which reads a map to a given loci, pairwise
sequence alignments are performed between reads using MegaBLAST. This provides
a weighted similarity score that is used to define edges between reads. Visualisation
of the resulting networks is then carried out using BioLayout Express3D that can
render large networks in 3-D, thereby allowing a better appreciation of the often-complex
network structure. This pipeline has formed the basis for my subsequent
work on the exploring and analysing alternative splicing in human RNA-seq data. In
the second half of this chapter, I provide a series of tutorials aimed at different types
of users allowing them to perform such analyses. The first tutorial is aimed at
computational novices who might want to generate networks using a web-browser
and pre-prepared data. Other tutorials are designed for use by more advanced users
who can access the code for the pipeline through GitHub or via an Amazon Machine
Image (AMI).
In chapter 3, the utility of network-based visualisations of RNA-seq data is explored
using data processed through the pipeline described in Chapter 2. The aim of the
work described in this chapter was to better understand the basic principles and
challenges associated with network visualisation of RNA-seq data, in particular how
it could be used to visualise transcript structure and splice-variation. These analyses
were performed on data generated from four samples of human fibroblasts taken at
different time points during their entry into cell division. One of the first challenges
encountered was the fact that the existing network layout algorithm (Fruchterman-
Reingold) implemented within BioLayout Express3D did not result in an optimal
layout of the unusual graph structures produced by these analyses. Following the
implementation of the more advanced layout algorithm FMMM within the tool,
network structure could be far better appreciated. Using this layout method, the
majority of genes sequenced to an adequate depth assemble into networks with a
linear ‘corkscrew’ appearance and when representing single isoform transcripts add
little to existing views of these data. However, in a small number of cases (~5%), the
networks generated from transcripts expressed in human fibroblasts possess more
complex structures, with ‘loops’, ‘knots’ and multiple ends being observed. In a
majority of cases examined, these loops were associated with alternative splicing
events, a fact confirmed by RT-PCR analyses. Other DNA assembly networks
representing the mRNAs for genes such as MKI67 showed knot-like structures,
which was found to be due to the presence of repetitive sequence within an exon of
the gene. In another case, CENPO the unusual structure observed was due to reads
derived from an overlapping gene of ADCY3 gene present on the opposite strand
with reads being wrongly mapped to CENPO. Finally, I explored the use of a
network reduction strategy as an approach to visualising highly expressed genes such
as GAPDH and TUBA1C. Having successfully demonstrated the utility of networks
in analysing transcript isoforms in data derived from a single cell type I set out to
explore its utility in analysing transcript variation in tissue data where multiple
isoforms expressed by different cells within the tissue might be present in a given
sample.
In chapter 4, I explore the analysis of transcript variation in an RNA-seq dataset
derived from human tissue. The first half of this chapter describes the quality control
of these data again using a network-based approach but this time based the
correlation in expression between genes and samples. Of the 95 samples derived
from 27 human tissues, 77 passed the quality control. A network was constructed
using a correlation threshold of r ≥ 0.9, which comprised 6,109 nodes (genes) and
1,091,477 edges (correlations) and clustered. Subsequently, the profile and gene
content of each cluster was examined and enrichment of GO terms analysed. In the
second half of this chapter, the aim was to detect and analyse alternative splicing
events between different tissues using the rMATS tool. By using a false-discovery
rate (FDR) cut-off of < 0.01, I found that in comparisons of brain vs. heart, brain vs.
liver and heart vs. liver, the program reported 4,992, 4,804 and 3,990 splicing events,
respectively. Of these events, only 78 splicing events (52 genes) with more than 50%
of exon inclusion level and expression level more than FPKM 30. To further explore
the sometimes-complex structure of transcripts diversity derived from tissue, RNAseq
assembly networks for KLC1, SORBS2, GUK1, and TPM1 were explored. Each
of these networks showed different types of alternative splicing events and it was
sometimes difficult to determine the isoforms expressed between tissues using other
approaches. For instance, there is an issue in visualising the read assembly of long
genes such as KLC1 and SORBS2, using a Sashimi plots or even Vials, just because
of the number of exons and the size of their genomic loci. In another case of GUK1,
tissue-specific isoform expression was observed when a network of three tissues was
combined. Arguably the most complex analysis is the network of TPM1 where the
uniquification step was employed for this highly expressed gene.
In chapter 5, I perform a usability testing for NGS Graph Generator web application
and visualising RNA-seq assemblies as a network using BioLayout Express3D. This
test was important to ensure that the application is well received and utilised by the
user. Almost all participants of this usability test agree that this application would
encourage biologists to visualise and understand the alternative splicing together
with existing tools. The participants agreed that Sashimi plots rather difficult to view
and visualise and perhaps would lose something interesting features. However, there
were also reviews of this application that need improvements such as the capability
to analyse big network in a short time, side-by-side analysis of network with Sashimi
plot and Ensembl. Additional information of the network would be necessary to
improve the understanding of the alternative splicing.
In conclusion, this work demonstrates the utility of network visualisation of RNAseq
data, where the unusual structure of these networks can be used to identify issues
in assembly, repetitive sequences within transcripts and splice variation. As such,
this approach has the potential to significantly improve our understanding of
transcript complexity. Overall, this thesis demonstrates that network-based
visualisation provides a new and complementary approach to characterise alternative
splicing from RNA-seq data and has the potential to be useful for the analysis and
interpretation of other kinds of sequencing data
Two sides of the same coin: The roles of KLF6 in physiology and pathophysiology
The Krüppel-like factors (KLFs) family of proteins control several key biological processes that include proliferation, differentiation, metabolism, apoptosis and inflammation. Dysregulation of KLF functions have been shown to disrupt cellular homeostasis and contribute to disease development. KLF6 is a relevant example; a range of functional and expression assays suggested that the dysregulation of KLF6 contributes to the onset of cancer, inflammation-associated diseases as well as cardiovascular diseases. KLF6 expression is either suppressed or elevated depending on the disease, and this is largely due to alternative splicing events producing KLF6 isoforms with specialised functions. Hence, the aim of this review is to discuss the known aspects of KLF6 biology that covers the gene and protein architecture, gene regulation, post-translational modifications and functions of KLF6 in health and diseases. We put special emphasis on the equivocal roles of its full-length and spliced variants. We also deliberate on the therapeutic strategies of KLF6 and its associated signalling pathways. Finally, we provide compelling basic and clinical questions to enhance the knowledge and research on elucidating the roles of KLF6 in physiological and pathophysiological processes. © 2020 by the authors. Licensee MDPI, Basel, Switzerland
An investigation of cytokines and chemokines interaction using network analysis in covid-19
The emergence of a pandemic coronavirus disease 2019 (COVID-19) caused by infection with SARS-CoV-2 have become threats to humanity. In terms of their physio pathological pathways, a wide variety of biomolecules have been activated, based on immunological responses. It is therefore relevant to compare the human respiratory cell lines to infections with the SARS-CoV-2 and other respiratory viruses. In this study, we examined gene expression profiles of GSE147507 from the Gene Expression Omnibus (GEO) were used to explore the transcriptional response of SARS-CoV-2 with other respiratory viruses, including human parainfluenza virus 3 (HPIV3), respiratory syncytial virus (RSV), and influenza A virus (IAV), in human respiratory cell lines. Network Analyst 3.0 software was used to perform this gene expression data via intuitive web interface. Through its well-established statistical procedures with state- of-the-art data visualization techniques, it allows us to navigate large complex gene expression data sets to determine important features, patterns, functions and connections that would lead us to a new biological hypothesis. The raw RNA-sequencing data undergoes data processing, including filtering, quality check and normalization before it corresponds to data analysis and interpretation. The edger package was used to identify differentially expressed genes (DEGs) on respiratory viruses infected in human lung epithelium- derived cell lines, such as lung alveolar cells (A549), A549 cells expressing ACE2 (A549-ACE2) and cultured human airway epithelial cells (NHBE). P |2| were set as thresholds for identifying DEGs. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were applied for the functional annotation and pathway analysis. Interpretation of gene expression data obtained from the visualization of volcano plot and analysis of protein-protein interaction (PPI) to reveal the functional associations between proteins on a genome-wide scale using STRING interactome
Draft genome sequence of Pasteurella multocida subsp. multocida strain PMTB, isolated from a buffalo
Pasteurella multocida serotypes B:2 and E:2 are the main causative agents of ruminant hemorrhagic septicemia in Asia and Africa, respectively. Pasteurella multocida strain PMTB was isolated from a buffalo with hemorrhagic septicemia and has been determined to be serotype B:2. Here we report the draft genome sequence of strain PMTB
Integration of RNA-Seq and proteomics data identifies glioblastoma multiforme surfaceome signature
Background: Glioblastoma multiforme (GBM) is a highly lethal, stage IV brain tumour with a prevalence of approximately 2 per 10,000 people globally. The cell surface proteins or surfaceome serve as information gateway in many oncogenic signalling pathways and are important in modulating cancer phenotypes. Dysregulation in surfaceome expression and activity have been shown to promote tumorigenesis. The expression of GBM surfaceome is a case in point; OMICS screening in a cell-based system identified that this sub-proteome is largely perturbed in GBM. Additionally, since these cell surface proteins have ‘direct’ access to drugs, they are appealing targets for cancer therapy. However, a comprehensive GBM surfaceome landscape has not been fully defined yet. Thus, this study aimed to define GBM-associated surfaceome genes and identify key cell-surface genes that could potentially be developed as novel GBM biomarkers for therapeutic purposes. Methods: We integrated the RNA-Seq data from TCGA GBM (n = 166) and GTEx normal brain cortex (n = 408) databases to identify the significantly dysregulated surfaceome in GBM. This was followed by an integrative analysis that combines transcriptomics, proteomics and protein-protein interaction network data to prioritize the highconfidence GBM surfaceome signature. Results: Of the 2381 significantly dysregulated genes in GBM, 395 genes were classified as surfaceome. Via the integrative analysis, we identified 6 high-confidence GBM molecular signature, HLA-DRA, CD44, SLC1A5, EGFR, ITGB2, PTPRJ, which were significantly upregulated in GBM. The expression of these genes was validated in an independent transcriptomics database, which confirmed their upregulated expression in GBM. Importantly, high expression of CD44, PTPRJ and HLA-DRA is significantly associated with poor disease-free survival. Last, using the Drugbank database, we identified several clinically-approved drugs targeting the GBM molecular signature suggesting potential drug repurposing. Conclusions: In summary, we identified and highlighted the key GBM surface-enriched repertoires that could be biologically relevant in supporting GBM pathogenesis. These genes could be further interrogated experimentally in future studies that could lead to efficient diagnostic/prognostic markers or potential treatment options for GBM
Deep Transcriptome Sequencing of Pediatric Acute Myeloid Leukemia Patients at Diagnosis, Remission and Relapse: Experience in 3 Malaysian Children in a Single Center Study
Among the many types of leukemia, acute myeloid leukemia (AML) affects 20% of diagnosed hematological malignancies in pediatric patients (Meshinchi and Arceci, 2007; de Rooij et al., 2015). Standard chemotherapy regimen remains as the first line treatment for pediatric AML, however nearly 40% of AML patients may suffer from relapse and eventually die from the disease (de Rooij et al., 2015). Similarly, it has been reported that 50% of the pediatric AML relapsed within 12–18 months of diagnosis and 45% of those relapsed were not expected to survive (Creutzig et al., 2014). Despite advances in cytogenetic analysis through fluorescence in situ hybridization and multiplex PCR, there is still a need for a better and comprehensive molecular profiling. For instance, microarray has long been used to study the gene expression profiles of AML patients. The different profile of gene expression has enabled clinicians to tailor better treatment for patients and predict whether patients have the tendency to relapse (Goswami et al., 2009). In a recent study, Handschuch et al. reported that three genes, ANXA3, S100A9, and WT1 can differentiate between different prognostic types of AML (Handschuh et al., 2018). The study outcome was in agreement with another study conducted by Shimada et al. (2012), where a high expression of WT1 gene showed prognostic impact in pediatric AML (Shimada et al., 2012). Another study by Jo et al. (2015) reported that high expression of EVI1 and MEL1 could predict the prognosis of pediatric AML (Jo et al., 2015). However, none of the biomarkers identified from these studies have been translated into clinical use. Therefore, the search continues for additional promising biomarkers, notably novel transcripts, novel fusion genes and non-coding RNAs which are not represented in the microarray platform. Transcriptome sequencing through next generation sequencing represents an effective approach to discover new genetic information on gene expression which may contribute to tumorigenesis. Notably, several novel and rare fusion transcripts have been identified from AML patients via RNA-sequencing (Padella et al., 2015). A recent study combining whole genome sequencing, whole exome sequencing and RNA sequencing in pediatric cancers has identified 240 pathogenic variants with increased sensitivity (Rusch et al., 2018). Previous studies in relapsed AML have shown that the cells acquired additional genetic mutations that were either different or evolved from subclones of diagnostic blasts cells (Padella et al., 2015; Rusch et al., 2018). Nevertheless, little is known about the genetic changes at the transcriptomic level at diagnostic, remission and relapse stages of the same patients, especially in the Malaysian population
Genome Sequencing and Bioinformatic Analysis of Pasteurella multocida Serotype B:2 Strain PMTB
Pasteurella multocida is a Gram-negative bacterium, which is the causative agent of a wide range of diseases in animals. This organism usually resides in the mucous membrane of the intestinal, genital and respiratory tissues and is an opportunistic pathogen that causes fowl cholera, bovine haemorrhagic septicaemia and porcine atrophic rhinitis. So far, only complete genome of P. multocida serotype A:3 strain Pm70 has been elucidated. This study was conducted to sequence the genome of P. multocida serotype B:2 strain PMTB and to compare with the complete genome of Pm70. A total of 7.2 million sequence reads were generated from Illumina Genome Analyzer. De novo sequence assembly followed by comparison with Pm70 reference sequence produced a partial near-complete genome of PMTB with missing nucleotide sequences located in 81 gaps. The partial genome of P. multocida strain PMTB is 97.78% identical to Pm70. The estimated size of the partial genome of PMTB is 2,208,894 bp while the Pm70 genome is 2,257,487 bp. In addition, both genomes contain similar % GC content 40 to 41%. Analysis using GeneMark software indicated the total genes of the partial genome of PMTB are 2078 while the reference genome Pm70 has 2014 genes. Gene comparison between PMTB and Pm70 to construct PMTB sequences as a database blast against Pm70 sequences showed there are 223 unique genes found in PMTB but absence in Pm70. The unique genes are probably specific to serotype B:2 only or the genes were not detected in sequence analysis since they are located in the missing sequences in the gaps. On the other hand, a total of 49 genes are not detected in partial PMTB genome but present in Pm70. Sequence analysis also showed the presence of genes with high similarity (99 to 100%) to the genes from previously characterized serotype B:2, genes that are also found in other P. multocida serotypes and genes found in other bacteria especially Haemophilus influenza, Actinobacillus minor and Vibrio cholerae. Based on the partial genome sequence analysis, there are probably several virulence genes and virulence-associated genes in the P. multocida PMTB genome which include adhesins protein [type 4 fimbria (ptfA)], serotype–specific capsular polysaccharide, lipopolysaccharide, iron acquisition related genes such as Exbd and tonB, gene associated hemoglobin binding protein (HgbA), gene encode for transferrin-binding protein (tbpA), and several uncharacterized secreted enzymes and proteins that play important role in the pathogenicity of the disease. Complete genome sequencing and genome-wide functional genomics studies on P. multocida PMTB genome will be able to provide valuable information on pathogenicity of haemorrhagic septicaemia in ruminants
An In Silico Design of Peptides Targeting the S1/S2 Cleavage Site of the SARS-CoV-2 Spike Protein
SARS-CoV-2, responsible for the COVID-19 pandemic, invades host cells via its spike protein, which includes critical binding regions, such as the receptor-binding domain (RBD), the S1/S2 cleavage site, the S2 cleavage site, and heptad-repeat (HR) sections. Peptides targeting the RBD and HR1 inhibit binding to host ACE2 receptors and the formation of the fusion core. Other peptides target proteases, such as TMPRSS2 and cathepsin L, to prevent the cleavage of the S protein. However, research has largely ignored peptides targeting the S1/S2 cleavage site. In this study, bioinformatics was used to investigate the binding of the S1/S2 cleavage site to host proteases, including furin, trypsin, TMPRSS2, matriptase, cathepsin B, and cathepsin L. Peptides targeting the S1/S2 site were designed by identifying binding residues. Peptides were docked to the S1/S2 site using HADDOCK (High-Ambiguity-Driven protein–protein DOCKing). Nine peptides with the lowest HADDOCK scores and strong binding affinities were selected, which was followed by molecular dynamics simulations (MDSs) for further investigation. Among these peptides, BR582 and BR599 stand out. They exhibited relatively high interaction energies with the S protein at −1004.769 ± 21.2 kJ/mol and −1040.334 ± 24.1 kJ/mol, respectively. It is noteworthy that the binding of these peptides to the S protein remained stable during the MDSs. In conclusion, this research highlights the potential of peptides targeting the S1/S2 cleavage site as a means to prevent SARS-CoV-2 from entering cells, and contributes to the development of therapeutic interventions against COVID-19
Biosensors based on sphyraena barracuda muscle cholinesterase inhibition for insecticides (chlorpyrifos, malathion, diazinon and dimethoate) detection
Fish is a living organism that can be used as a sensitive biomarker, while their enzyme such cholinesterase (ChE) acts as a biosensor device that is able to detect the presence of organophosphate in the water. This study was carried out to extract and purify the ChE enzyme from the muscle tissue of Sphyraena barracuda. The muscle tissue was homogenized and then purified via ion exchange chromatography method using DEAE cellulose as the matrix of column. Ellman assay followed by Bradford assay was carried out to determine the enzyme activity and protein content, respectively. The Sphyraena barracuda ChE was successfully purified at the total recovery of 32.74 % with 2.26 purification fold. The optimal conditions for the ChE activity were determined at pH 9 of Tris-HCl buffer and at temperature of 25 °C. The enzyme was also specific to ATC substrate due to its high catalytic efficiency. All organophosphate tested at concentration of 1 mg/L were able to inhibit the activity of ChE; chlorpyrifos (100 %), malathion (100 %), diazinon (91.5 %) and dimethoate (35.8 %). These findings showed that the partially purified ChE from muscle tissue of Sphyraena barracuda is another alternative bioindicator candidate for the detection of organophosphate in water
Antioxidant screening of Garcinia forbesii originated from Sabah
Garcinia forbesii is a wild-type of plant that have been long used traditionally with broad utilities in several fields like medicines, cosmetics, food and neutraceutics. Increasing awareness towards the use of phytochemicals and other plant derives products worldwide has broaden the study of bioactivities from several industrial sectors. Therefore, the present study aims to screen the antioxidant and antimicrobial properties of fruits and leaves of Garcinia forbesii. The methanolic, hexanic and ethyl acetate extracts were obtained by maceration extraction, isolated and undergoes purification through Thin layer Chromatography (TLC) and Column Chromatography (CC). The highest yield of extraction are both from methanol extracts of fruits and leaves which are 9.26 ± 0.34 g and 7.04 ±0.21 g. The antimicrobial activity of the extracts was determined against Escherichia coli using Kirby-Bauer method. The disc diffusion assay showed that the inhibition of growth of E. coli was fully attributed to hexanic extract for fruits and leaves of G. forbesii while all other extracts displayed minimum or none inhibition zone. The inhibition zone of hexanic extract of leaves (11.83mm ± 1.04) showed the highest than fruits (9.33mm ± 1.53) among all the fractions. Antioxidant activities of the leave extracts of G. forbesii in reducing power, FRAP (ferric reducing antioxidant power) and DPPH followed the same order of methanolic > ethyl acetate > hexanic. Meanwhile, extract hexane, methanol and ethyl acetate of the fruits of G. forbesii showed IC50 values at 1.05%, 3.35% and 4.44% while for leaves are at 1.19%, 2.4% and 8.94% respectively. Methanol is therefore a better solvent to extract most of the antioxidant components from G. forbesii leaves