661 research outputs found

    Bioinformatics Applications Based On Machine Learning

    Get PDF
    The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems

    Metagenomic characterisation of the viral community of lough neagh, the largest freshwater lake in Ireland

    Get PDF
    Lough Neagh is the largest and the most economically important lake in Ireland. It is also one of the most nutrient rich amongst the world's major lakes. In this study, 16S rRNA analysis of total metagenomic DNA from the water column of Lough Neagh has revealed a high proportion of Cyanobacteria and low levels of Actinobacteria, Acidobacteria, Chloroflexi, and Firmicutes. The planktonic virome of Lough Neagh has been sequenced and 2,298,791 2×300 bp Illumina reads analysed. Comparison with previously characterised lakes demonstrates that the Lough Neagh viral community has the highest level of sequence diversity. Only about 15% of reads had homologs in the RefSeq database and tailed bacteriophages (Caudovirales) were identified as a major grouping. Within the Caudovirales, the Podoviridae and Siphoviridae were the two most dominant families (34.3% and 32.8% of the reads with sequence homology to the RefSeq database), while ssDNA bacteriophages constituted less than 1% of the virome. Putative cyanophages were found to be abundant. 66,450 viral contigs were assembled with the largest one being 58,805 bp; its existence, and that of another 34,467 bp contig, in the water column was confirmed. Analysis of the contigs confirmed the high abundance of cyanophages in the water column

    Multi-omics molecular profiling of lung tumours

    Get PDF
    Lung Cancer (LC) is one of the most common malignancies and is the leading cause of cancer death worldwide among both men and women. Current LC classifications are based on histopathological features which poorly reflect the molecular diversity of these tumours. Consequently, primary and secondary drug resistance are very frequent, and a high mortality is usual in LC patients. Despite the fact that LC has been intensively studied, there is a lack of effective biomarkers for early detection, stratification and prognosis. Integration of omics data is a powerful approach that can be used to identify molecular subgroups relevant in the clinical setting. This thesis addresses this challenge by characterising the molecular alterations accompanying LC at the genetic and DNA methylation level, using a combination of Whole-Exome Sequencing (WES), Targeted Capture Sequencing (TCS), Single Nucleotide Polymorphism (SNP) genotyping, Whole-Genome Bisulfite Sequencing and RNA-sequencing. The integration of different types of omics data first validated previous molecular alterations in frequently diagnosed LC tumours. This allowed comparison of the genomic and epigenomic landscapes between these common and rarer LC subtypes. Next, novel molecular subgroups of Non-Small Cell Lung Cancer (NSCLC) tumours with bad prognostic, as well as subgroups of Lung Carcinoids (L-CDs, an understudied LC subtype) have been identified and their molecular alterations and signatures characterised. Significant associations with histological features and gene expression programmes have been found by using several bioinformatic tools. These results show the value of multi-omics approaches to better understand the molecular mechanisms underlying LC and to identify new biomarkers. Importantly, some of these findings may be translatable and are likely to improve the detection, monitoring and stratification for targeted therapies in LC patients.Open Acces

    Transcriptional Regulation of Cell-type Specific Expression in the Arabidopsis Root

    Get PDF
    Characterizing transcription factor interactions with their corresponding binding sites is crucial for understanding how gene expression is regulated by DNA sequence. A more comprehensive understanding of this process could have benefits in synthetic promoter design and creation of genetically modified organisms. Herein, the promoters of genes exhibiting cell-type specific expression within a single layer of the Arabidopsis root are analyzed to identify cis-regulatory motifs implicated in cell-type specific expression. De novo motif prediction identifies multiple motif candidates overly represented in the promoter sequences of co-expressed genes specific for epidermal, cortex, and endodermal expression. Several endodermal specific putative motifs are further analyzed for positional biases and tested in planta. A priori mapping of known cis-regulatory motifs catalogued in publicly available databases is also performed. Results show that cell-types contain different statistically significant enrichment patterns of both predicted and known cis-regulatory motifs. These results will help future research in designing cell-type specific synthetic promoters

    The immunopeptidome landscape associated with T cell infiltration, inflammation and immune editing in lung cancer.

    Get PDF
    One key barrier to improving efficacy of personalized cancer immunotherapies that are dependent on the tumor antigenic landscape remains patient stratification. Although patients with CD3 <sup>+</sup> CD8 <sup>+</sup> T cell-inflamed tumors typically show better response to immune checkpoint inhibitors, it is still unknown whether the immunopeptidome repertoire presented in highly inflamed and noninflamed tumors is substantially different. We surveyed 61 tumor regions and adjacent nonmalignant lung tissues from 8 patients with lung cancer and performed deep antigen discovery combining immunopeptidomics, genomics, bulk and spatial transcriptomics, and explored the heterogeneous expression and presentation of tumor (neo)antigens. In the present study, we associated diverse immune cell populations with the immunopeptidome and found a relatively higher frequency of predicted neoantigens located within HLA-I presentation hotspots in CD3 <sup>+</sup> CD8 <sup>+</sup> T cell-excluded tumors. We associated such neoantigens with immune recognition, supporting their involvement in immune editing. This could have implications for the choice of combination therapies tailored to the patient's mutanome and immune microenvironment

    Interruptional Activity and Simulation of Transposable Elements

    Get PDF
    Transposable elements (TEs) are interspersed DNA sequences that can move or copy to new positions within a genome. The active TEs along with the remnants of many transposition events over millions of years constitute 46.69% of the human genome. TEs are believed to promote speciation and their activities play a significant role in human disease. The 22 AluY and 6 AluS TE subfamilies have been the most active TEs in recent human history, whose transposition has been implicated in several inherited human diseases and in various forms of cancer by integrating into genes. Therefore, understanding the transposition activities is very important. Recently, there has been some work done to quantify the activity levels of active Alu transposable elements based on variation in the sequence. Here, given this activity data, an analysis of TE activity based on the position of mutations is conducted. Two different methods/simulations are created to computationally predict so-called harmful mutation regions in the consensus sequence of a TE; that is, mutations that occur in these regions decrease the transposition activities dramatically. The methods are applied to AluY, the youngest and most active Alu subfamily, to identify the harmful regions laying in its consensus, and verifications are presented using the activity of AluY elements and the secondary structure of the AluYa5 RNA, providing evidence that the method is successfully identifying harmful mutation regions. A supplementary simulation also shows that the identified harmful regions covering the AluYa5 RNA functional regions are not occurring by chance. Therefore, mutations within the harmful regions alter the mobile activity levels of active AluY elements. One of the methods is then applied to two additional TE families: the Alu family and L1 family, in detecting the harmful regions in these elements computationally. Understanding and predicting the evolution of these TEs is of interest in understanding their powerful evolutionary force in shaping their host genomes. In this thesis, a formal model of TE fragments and their interruptions is devised that provides definitions that are compatible with biological nomenclature, while still providing a suitable formal foundation for computational analysis. Essentially, this model is used for fixing terminology that was misleading in the literature, and it helps to describe further TE problems in a precise way. Indeed, later chapters include two other models built on top of this model: the sequential interruption model and the recursive interruption model, both used to analyze their activity throughout evolution. The sequential interruption model is defined between TEs that occur in a genomic sequence to estimate how often TEs interrupt other TEs, which has been shown to be useful in predicting their ages and their activity throughout evolution. Here, this prediction from the sequential interruptions is shown to be closely related to a classic matrix optimization problem: the Linear Ordering Problem (LOP). By applying a well-studied method of solving the LOP, Tabu search, to the sequential interruption model, a relative age order of all TEs in the human genome is predicted from a single genome. A comparison of the TE ordering between Tabu search and the method used in [47] shows that Tabu search solves the TE problem exceedingly more efficiently, while it still achieves a more accurate result. As a result of the improved efficiency, a prediction on all human TEs is constructed, whereas it was previously only predicted for a minority fraction of the set of the human TEs. When many insertions occurred throughout the evolution of a genomic sequence, the interruptions nest in a recursive pattern. The nested TEs are very helpful in revealing the age of the TEs, but cannot be fully represented by the sequential interruption model. In the recursive interruption model, a specific context- free grammar is defined, describing a general and simple way to capture the recursive nature in which TEs nest themselves into other TEs. Then, each production of the context-free grammar is associated with a probability to convert the context-free grammar into a stochastic context-free grammar that maximizes the applications of the productions corresponding to TE interruptions. A modified version of an algorithm to parse context-free grammars, the CYK algorithm, that takes into account these probabilities is then used to find the most likely parse tree(s) predicting the TE nesting in an efficient fashion. The recursive interruption model produces small parse trees representing local TE interruptions in a genome. These parse trees are a natural way of grouping TE fragments in a genomic sequence together to form interruptions. Next, some tree adjustment operations are given to simplify these parse trees and obtain more standard evolutionary trees. Then an overall TE-interaction network is created by merging these standard evolutionary trees into a weighted directed graph. This TE-interaction network is a rich representation of the predicted interactions between all TEs throughout evolution and is a powerful tool to predict the insertion evolution of these TEs. It is applied to the human genome, but can be easily applied to other genomes. Furthermore, it can also be applied to multiple related genomes where common TEs exist in order to study the interactions between TEs and the genomes. Lastly, a simulation of TE transpositions throughout evolution is developed. This is especially helpful in understanding the dynamics of how TEs evolve and impact their host genomes. Also, it is used as a verification technique for the previous theoretical models in the thesis. By feeding the simulated TE remnants and activity data into the theoretical models, a relative age order is predicted using the sequential interruption model, and a quantified correlation between this predicted order and the input age order in the simulation can be calculated. Then, a TE-interaction network is constructed using the recursive interruption model on the simulated data, which can also be converted into a linear age order by feeding the adjacency matrix of the network to Tabu search. Another correlation is calculated between the predicted age order from the recursive interruption model and the input age order. An average correlation of ten simulations is calculated for each model, which suggests that in general, the recursive interruption model performs better than the sequential interruption model in predicting a correct relative age order of TEs. Indeed, the recursive interruption model achieves an average correlation value of ρ = 0.939 with the correct simulated answer

    Genome architecture in the fungal kingdom

    Get PDF
    Previous studies have suggested that the location of genes in genomes is not random; instead they may be organized in a way that is beneficial to cellular processes and the organism. While a few studies have investigated the organization of genes on a whole genome scale, they were limited in the functions of genes used in the search and in the number and type of genomes searched. With the recent explosion of available fungal genomes and tools to automatically annotate many genes in a short period of time, it is now possible to obtain a global view of the level of clustering in the genomes of an entire kingdom. To find gene clusters in many genomes, we have constructed a robust and flexible algorithm that runs in trivial time. In parallel, we have annotated 72 fungal genomes using four automated annotation tools that provide information about protein function, protein targeting, involvement in biochemical pathways and paralogous gene families. We used the clustering algorithm to search for clusters from the four annotation categories. We discovered that all the genomes contained clusters of related genes, and that in several cases the clusters included genes involved in processes that were specific to the species in which they are found. This has dramatically expanded our knowledge of both the types of clusters and the number of genomes known to contain clusters. This study has generated information that will assist researchers in addressing many questions central to molecular and cell biology as well as evolutionary studies

    Genome bioinformatics of tomato and potato

    Get PDF
    In the past two decades genome sequencing has developed from a laborious and costly technology employed by large international consortia to a widely used, automated and affordable tool used worldwide by many individual research groups. Genome sequences of many food animals and crop plants have been deciphered and are being exploited for fundamental research and applied to improve their breeding programs. The developments in sequencing technologies have also impacted the associated bioinformatics strategies and tools, both those that are required for data processing, management, and quality control, and those used for interpretation of the data. This thesis focuses on the application of genome sequencing, assembly and annotation to two members of the Solanaceae family, tomato and potato. Potato is the economically most important species within the Solanaceae, and its tubers contribute to dietary intake of starch, protein, antioxidants, and vitamins. Tomato fruits are the second most consumed vegetable after potato, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. The chapters in this thesis document the generation, exploitation and interpretation of genomic sequence resources for these two species and shed light on the contents, structure and evolution of their genomes. Chapter 1introduces the concepts of genome sequencing, assembly and annotation, and explains the novel genome sequencing technologies that have been developed in the past decade. These so-called Next Generation Sequencing platforms display considerable variation in chemistry and workflow, and as a consequence the throughput and data quality differs by orders of magnitude between the platforms. The currently available sequencing platforms produce a vast variety of read lengths and facilitate the generation of paired sequences with an approximately fixed distance between them. The choice of sequencing chemistry and platform combined with the type of sequencing template demands specifically adapted bioinformatics for data processing and interpretation. Irrespective of the sequencing and assembly strategy that is chosen, the resulting genome sequence, often represented by a collection of long linear strings of nucleotides, is of limited interest by itself. Interpretation of the genome can only be achieved through sequence annotation – that is, identification and classification of all functional elements in a genome sequence. Once these elements have been annotated, sequence alignments between multiple genomes of related accessions or species can be utilized to reveal the genetic variation on both the nucleotide and the structural level that underlies the difference between these species or accessions. Chapter 2describes BlastIf, a novel software tool that exploits sequence similarity searches with BLAST to provide a straightforward annotation of long nucleotide sequences. Generally, two problems are associated with the alignment of a long nucleotide sequence to a database of short gene or protein sequences: (i) the large number of similar hits that can be generated due to database redundancy; and (ii) the relationships implied between aligned segments within a hit that in fact correspond to distinct elements on the sequence such as genes. BlastIf generates a comprehensible BLAST output for long nucleotide sequences by reducing the number of similar hits while revealing most of the variation present between hits. It is a valuable tool for molecular biologists who wish to get a quick overview of the genetic elements present in a newly sequenced segment of DNA, prior to more elaborate efforts of gene structure prediction and annotation. In Chapter 3 a first genome-wide comparison between the emerging genomic sequence resources of tomato and potato is presented. Large collections of BAC end sequences from both species were annotated through repeat searches, transcript alignments and protein domain identification. In-depth comparisons of the annotated sequences revealed remarkable differences in both gene and repeat content between these closely related genomes. The tomato genome was found to be more repetitive than the potato genome, and substantial differences in the distribution of Gypsy and Copia retrotransposable elements as well as microsatellites were observed between the two genomes. A higher gene content was identified in the potato sequences, and in particular several large gene families including cytochrome P450 mono-oxygenases and serine-threonine protein kinases were significantly overrepresented in potato compared to tomato. Moreover, the cytochrome P450 gene family was found to be expanded in both tomato and potato when compared to Arabidopsis thaliana, suggesting an expanded network of secondary metabolic pathways in the Solanaceae. Together these findings present a first glimpse into the evolution of Solanaceous genomes, both within the family and relative to other plant species. Chapter 4explores the physical and genetic organization of tomato chromosome 6 through integration of BAC sequence analysis, High Information Content Fingerprinting, genetic analysis, and BAC-FISH mapping data. A collection of BACs spanning substantial parts of the short and long arm euchromatin and several dispersed regions of the pericentrometric heterochromatin were sequenced and assembled into several tiling paths spanning approximately 11 Mb. Overall, the cytogenetic order of BACs was in agreement with the order of BACs anchored to the Tomato EXPEN 2000 genetic map, although a few striking discrepancies were observed. The integration of BAC-FISH, sequence and genetic mapping data furthermore provided a clear picture of the borders between eu- and heterochromatin on chromosome 6. Annotation of the BAC sequences revealed that, although the majority of protein-coding genes were located in the euchromatin, the highly repetitive pericentromeric heterochromatin displayed an unexpectedly high gene content. Moreover, the short arm euchromatin was relatively rich in repeats, but the ratio of Gypsy and Copia retrotransposons across the different domains of the chromosome clearly distinguished euchromatin from heterochromatin. The ongoing whole-genome sequencing effort will reveal if these properties are unique for tomato chromosome 6, or a more general property of the tomato genome. Chapter 5presents the potato genome, the first genome sequence of an Asterid. To overcome the problems associated with genome assembly due tothe high level of heterozygosity that is observed in commercial tetraploid potato varieties, a homozygous doubled-monoploid potato clone was exploited to sequence and assemble 86% of the 844 Mb genome. This potato reference genome sequence was complemented with re-sequencing of aheterozygous diploid clone, revealing the form and extent of sequence polymorphism both between different genotypes and within a single heterozygous genotype. Gene presence/absence variants and other potentially deleterious mutations were found to occur frequently in potato and are a likely cause of inbreeding depression. Annotation of the genome was supported by deep transcriptome sequencing of both the doubled-monoploid and the heterozygous potato, resulting in the prediction of more than 39,000 protein coding genes. Transcriptome analysis provided evidence for the contribution of gene family expansion, tissue specific expression, and recruitment of genes to new pathways to the evolution of tuber development. The sequence of the potato genome has provided new insights into Eudicot genome evolution and has provided a solid basis for the elucidation of the evolution of tuberisation. Many traits of interest to plant breeders are quantitative in nature and the potato sequence will simplify both their characterization and deployment to generate novel cultivars. The outstanding challenges in plant genome sequencing are addressed in Chapter 6. The high concentration of repetitive elements and the heterozygosity and polyploidy of many interesting crop plant species currently pose a barrier for the efficient reconstruction of their genome sequences. Nonetheless, the completion of a large number of new genome sequences in recent years and the ongoing advances in sequencing technology provide many excitingopportunities for plant breeding and genome research. Current sequencing platforms are being continuously updated and improved, and novel technologies are being developed and implemented in third-generation sequencing platforms that sequence individual molecules without need for amplification. While these technologies create exciting opportunities for new sequencing applications, they also require robust software tools to process the data produced through them efficiently. The ever increasing amount of available genome sequences creates the need for an intuitive platform for the automated and reproducible interrogation of these data in order to formulate new biologically relevant questions on datasets spanning hundreds or thousands of genome sequences. </p
    corecore