49 research outputs found

    Applications in computer-assisted biology

    Get PDF
    Biology is becoming a data-rich science driven by the development of high-throughput technologies like next-generation DNA sequencing. This is fundamentally changing biological research. The genome sequences of many species are becoming available, as well as the genetic variation within a species, and the activity of the genes in a genome under various conditions. With the opportunities that these new technologies offer, comes the challenge to effectively deal with the large volumes of data that they produce. Bioinformaticians have an important role to play in organising and analysing this data to extract biological information and gain knowledge. Also for experimental biologists computers have become essential tools. This has created a strong need for software applications aimed at biological research. The chapters in this thesis detail my contributions to this area. Together with molecular biologists, plant breeders, immunologists, and microbiologists, I have developed several software tools and performed computational analyses to study biological questions. Chapter 2 is about Primer3Plus, a web tool that helps biologists to design DNA primers for their experiments. These primers are typically short stretches of DNA (~20 nucleotides) that direct the DNA replication machinery to copy a selected region of a DNA molecule. The specificity of a primer is determined by several chemical and physical properties and therefore designing good primers is best done with the help of a computer program. Primer3Plus offers a user-friendly task-oriented web interface to the popular primer3 primer design program. Primer3Plus clearly fulfils a need in the biological research community as already over 400 scientific articles have cited the Primer3Plus publication. Single nucleotide differences or polymorphisms (SNPs) that are present within a species can be used as markers to link phenotypic observations to locations on the genome. Chapter 3 discusses QualitySNPng, which is a stand-alone software tool for finding SNPs in high-throughput sequencing data. QualitySNPng was inspired by the QualitySNP pipeline for SNP detection that was published in 2006 and it uses similar filtering criteria to distinguish SNPs from technical artefacts like sequence read errors. In addition, the SNPs are used to predict haplotypes. QualitySNPng has a graphical user interface that allows the user to run the SNP detection and evaluate the results. It has already been successfully used in several projects on marker detection for plant breeding. Single nucleotide polymorphisms can lead to single amino acid changes in protein sequences. These single amino acid polymorphisms (SAPs) play a key role in graft-versus-host (GVH) effects that often accompany tissue transplantations. A beneficial variant of GVH is the graft-versus-leukaemia (GVL) effect that is sometimes witnessed after bone marrow transplantation in leukaemia patients. When the GVL effect occurs, the donor’s immune cells actively destroy residual tumour cells in the patient. The GVL effect can already be elicited by a single amino acid difference between the patient and the donor. Currently, a small number of SAPs that can elicit a GVL effect are known and these are used to select the right bone marrow donor for a leukaemia patient. Together with researchers at the Leiden University Medical Center I developed a database to aid in the discovery of more such SAPs. We called this database the “Human Short Peptide Variation database” or HSPVdb. It is described in chapter 4. The work described in chapter 5 is focused on the regions in bacterial genomes that are involved in gene regulation, the promoters. Intrigued by anecdotal evidence that duplication of bacterial promoters can activate or silence genes, we investigated how often promoter duplication occurs in bacterial genomes. Using the large number of bacterial genomes that are currently available, we looked for clusters of highly similar promoter regions. Since duplication assumes some sort of mobility, we termed the duplicated promoters: putative mobile promoters or PMPs. We found over 4,000 clusters of PMPs in 1,043 genomes. Most of the clusters consist of two members, indicating a single duplication event, but we also found much larger clusters of PMPs within some genomes. A number of PMPs are present in multiple species, even in very distantly related bacterial species, suggesting perhaps that these were subjected to horizontal gene transfer. The mobile promoters could play an important role in the rapid rewiring of gene regulatory networks. Chapter 6 discusses how current biological research can adapt to make full use of the opportunities offered by the high-throughput technologies by following three different approaches. The first approach empowers the biologists with user-friendly software that allows him to analyse the large volumes of genome scale data without requiring expert computer skills. In the second approach the biologist teams up with a bioinformatician to combine in-depth biological knowledge with expert computational skills. The third approach combines the biologist and the bioinformatician in one person by teaching the biologist computational skills. Each of these three approaches has it merits and shortcomings, so I do not expect any of them to become dominant in the near future. Looking further ahead, it seems inevitable that any biologist will have to learn at least the basics of computational methods and that this should be an integral part of biology education. Bioinformatics might in time cease to exist as a separate field and instead become an intrinsic aspect of most biological research disciplines.</p

    A gene co-expression network predicts functional genes controlling the re-establishment of desiccation tolerance in germinated Arabidopsis thaliana seeds.

    Get PDF
    MAIN CONCLUSION: During re-establishment of desiccation tolerance (DT), early events promote initial protection and growth arrest, while late events promote stress adaptation and contribute to survival in the dry state. Mature seeds of Arabidopsis thaliana are desiccation tolerant, but they lose desiccation tolerance (DT) while progressing to germination. Yet, there is a small developmental window during which DT can be rescued by treatment with abscisic acid (ABA). To gain temporal resolution and identify relevant genes in this process, data from a time series of microarrays were used to build a gene co-expression network. The network has two regions, namely early response (ER) and late response (LR). Genes in the ER region are related to biological processes, such as dormancy, acquisition of DT and drought, amplification of signals, growth arrest and induction of protection mechanisms (such as LEA proteins). Genes in the LR region lead to inhibition of photosynthesis and primary metabolism, promote adaptation to stress conditions and contribute to seed longevity. Phenotyping of 12 hubs in relation to re-establishment of DT with T-DNA insertion lines indicated a significant increase in the ability to re-establish DT compared with the wild-type in the lines cbsx4, at3g53040 and at4g25580, suggesting the operation of redundant and compensatory mechanisms. Moreover, we show that re-establishment of DT by polyethylene glycol and ABA occurs through partially overlapping mechanisms. Our data confirm that co-expression network analysis is a valid approach to examine data from time series of transcriptome analysis, as it provides promising insights into biologically relevant relations that help to generate new information about the roles of certain genes for DT

    Multi-netclust: an efficient tool for finding connected clusters in multi-parametric networks

    Get PDF
    Summary: Multi-netclust is a simple tool that allows users to extract connected clusters of data represented by different networks given in the form of matrices. The tool uses user-defined threshold values to combine the matrices, and uses a straightforward, memory-efficient graph algorithm to find clusters that are connected in all or in either of the networks. The tool is written in C/C++ and is available either as a form-based or as a command-line-based program running on Linux platforms. The algorithm is fast, processing a network of > 106 nodes and 108 edges takes only a few minutes on an ordinary computer

    Promoter propagation in prokaryotes

    Get PDF
    Transcriptional activation or 'rewiring' of silent genes is an important, yet poorly understood, phenomenon in prokaryotic genomes. Anecdotal evidence coming from experimental evolution studies in bacterial systems has shown the promptness of adaptation upon appropriate selective pressure. In many cases, a partial or complete promoter is mobilized to silent genes from elsewhere in the genome. We term hereafter such recruited regulatory sequences as Putative Mobile Promoters (PMPs) and we hypothesize they have a large impact on rapid adaptation of novel or cryptic functions. Querying all publicly available prokaryotic genomes (1362) uncovered >4000 families of highly conserved PMPs (50 to 100 long with =80% nt identity) in 1043 genomes from 424 different genera. The genomes with the largest number of PMP families are Anabaena variabilis (28 families), Geobacter uraniireducens (27 families) and Cyanothece PCC7424 (25 families). Family size varied from 2 to 93 homologous promoters (in Desulfurivibrio alkaliphilus). Some PMPs are present in particular species, but some are conserved across distant genera. The identified PMPs represent a conservative dataset of very recent or conserved events of mobilization of non-coding DNA and thus they constitute evidence of an extensive reservoir of recyclable regulatory sequences for rapid transcriptional rewirin

    QualitySNPng: a user-friendly SNP detection and visualization tool

    Get PDF
    QualitySNPng is a new software tool for the detection and interactive visualization of single-nucleotide polymorphisms (SNPs). It uses a haplotype-based strategy to identify reliable SNPs; it is optimized for the analysis of current RNA-seq data; but it can also be used on genomic DNA sequences derived from next-generation sequencing experiments. QualitySNPng does not require a sequenced reference genome and delivers reliable SNPs for di- as well as polyploid species. The tool features a user-friendly interface, multiple filtering options to handle typical sequencing errors, support for SAM and ACE files and interactive visualization. QualitySNPng produces high-quality SNP information that can be used directly in genotyping by sequencing approaches for application in QTL and genome-wide association mapping as well as to populate SNP arrays. The software can be used as a stand-alone application with a graphical user interface or as part of a pipeline system like Galaxy. Versions for Windows, Mac OS X and Linux, as well as the source code, are available fro

    A multi-parent recombinant inbred line population of C. elegans allows identification of novel QTLs for complex life history traits

    Get PDF
    Background The nematode Caenorhabditis elegans has been extensively used to explore the relationships between complex traits, genotypes, and environments. Complex traits can vary across different genotypes of a species, and the genetic regulators of trait variation can be mapped on the genome using quantitative trait locus (QTL) analysis of recombinant inbred lines (RILs) derived from genetically and phenotypically divergent parents. Most RILs have been derived from crossing two parents from globally distant locations. However, the genetic diversity between local C. elegans populations can be as diverse as between global populations and could thus provide means of identifying genetic variation associated with complex traits relevant on a broader scale. Results To investigate the effect of local genetic variation on heritable traits, we developed a new RIL population derived from 4 parental wild isolates collected from 2 closely located sites in France: Orsay and Santeuil. We crossed these 4 genetically diverse parental isolates to generate a population of 200 multi-parental RILs and used RNA-seq to obtain sequence polymorphisms identifying almost 9000 SNPs variable between the 4 genotypes with an average spacing of 11 kb, doubling the mapping resolution relative to currently available RIL panels for many loci. The SNPs were used to construct a genetic map to facilitate QTL analysis. We measured life history traits such as lifespan, stress resistance, developmental speed, and population growth in different environments, and found substantial variation for most traits. We detected multiple QTLs for most traits, including novel QTLs not found in previous QTL analysis, including those for lifespan and pathogen responses. This shows that recombining genetic variation across C. elegans populations that are in geographical close proximity provides ample variation for QTL mapping. Conclusion Taken together, we show that using more parents than the classical two parental genotypes to construct a RIL population facilitates the detection of QTLs and that the use of wild isolates facilitates the detection of QTLs. The use of multi-parent RIL populations can further enhance our understanding of local adaptation and life history trade-offs

    ProRepeat: an integrated repository for studying amino acid tandem repeats in proteins

    Get PDF
    ProRepeat (http://prorepeat.bioinformatics.nl/) is an integrated curated repository and analysis platform for in-depth research on the biological characteristics of amino acid tandem repeats. ProRepeat collects repeats from all proteins included in the UniProt knowledgebase, together with 85 completely sequenced eukaryotic proteomes contained within the RefSeq collection. It contains non-redundant perfect tandem repeats, approximate tandem repeats and simple, low-complexity sequences, covering the majority of the amino acid tandem repeat patterns found in proteins. The ProRepeat web interface allows querying the repeat database using repeat characteristics like repeat unit and length, number of repetitions of the repeat unit and position of the repeat in the protein. Users can also search for repeats by the characteristics of repeat containing proteins, such as entry ID, protein description, sequence length, gene name and taxon. ProRepeat offers powerful analysis tools for finding biological interesting properties of repeats, such as the strong position bias of leucine repeats in the N-terminus of eukaryotic protein sequences, the differences of repeat abundance among proteomes, the functional classification of repeat containing proteins and GC content constrains of repeats’ corresponding codons

    Beyond genomic variation - comparison and functional annotation in three Brassica rapa genotypes: a turnip, a rapid cycling and a Chinese cabbage

    Get PDF
    Background - Brassica rapa is an economically important crop species. During its long breeding history, a large number of morphotypes have been generated, including leafy vegetables such as Chinese cabbage and pakchoi, turnip tuber crops and oil crops. Results - To investigate the genetic variation underlying this morphological variation, we re-sequenced, assembled and annotated the genomes of two B. rapa subspecies, turnip crops (turnip) and a rapid cycling. We then analysed the two resulting genomes together with the Chinese cabbage Chiifu reference genome to obtain an impression of the B. rapa pan-genome. The number of genes with protein-coding changes between the three genotypes was lower than that among different accessions of Arabidopsis thaliana, which can be explained by the smaller effective population size of B. rapa due to its domestication. Based on orthology to a number of non-brassica species, we estimated the date of divergence among the three B. rapa morphotypes at approximately 250,000 YA, far predating Brassica domestication (5,000-10,000 YA). Conclusions - By analysing genes unique to turnip we found evidence for copy number differences in peroxidases, pointing to a role for the phenylpropanoid biosynthesis pathway in the generation of morphological variation. The estimated date of divergence among three B. rapa morphotypes implies that prior to domestication there was already considerably divergence among B. rapa genotypes. Our study thus provides two new B. rapa reference genomes, delivers a set of computer tools to analyse the resulting pan-genome and uses these to shed light on genetic drivers behind the rich morphological variation found in B. rapa

    4Pipe4-A 454 data analysis pipeline for SNP detection in datasets with no reference sequence or strain information

    Get PDF
    This work was fully supported by projects SOBREIRO/0036/2009 (under the framework of the Cork Oak ESTs Consortium), PTDC/BIA-BEC/098783/2008 and PTDC/AGR-GPL/119943/2010 from Fundação para a Ciência e Tecnologia (FCT) – Portugal. F. Pina-Martins was funded by FCT grant SFRH/BD/51411/2011, under the PhD program “Biology and Ecology of Global Changes”, Univ. Aveiro & Univ. Lisbon, Portugal. D. Batista was funded by FCT grant SFRH/BPD/104629/2014

    HSPVdb—the Human Short Peptide Variation Database for improved mass spectrometry-based detection of polymorphic HLA-ligands

    Get PDF
    T cell epitopes derived from polymorphic proteins or from proteins encoded by alternative reading frames (ARFs) play an important role in (tumor) immunology. Identification of these peptides is successfully performed with mass spectrometry. In a mass spectrometry-based approach, the recorded tandem mass spectra are matched against hypothetical spectra generated from known protein sequence databases. Commonly used protein databases contain a minimal level of redundancy, and thus, are not suitable data sources for searching polymorphic T cell epitopes, either in normal or ARFs. At the same time, however, these databases contain much non-polymorphic sequence information, thereby complicating the matching of recorded and theoretical spectra, and increasing the potential for finding false positives. Therefore, we created a database with peptides from ARFs and peptide variation arising from single nucleotide polymorphisms (SNPs). It is based on the human mRNA sequences from the well-annotated reference sequence (RefSeq) database and associated variation information derived from the Single Nucleotide Polymorphism Database (dbSNP). In this process, we removed all non-polymorphic information. Investigation of the frequency of SNPs in the dbSNP revealed that many SNPs are non-polymorphic “SNPs”. Therefore, we removed those from our dedicated database, and this resulted in a comprehensive high quality database, which we coined the Human Short Peptide Variation Database (HSPVdb). The value of our HSPVdb is shown by identification of the majority of published polymorphic SNP- and/or ARF-derived epitopes from a mass spectrometry-based proteomics workflow, and by a large variety of polymorphic peptides identified as potential T cell epitopes in the HLA-ligandome presented by the Epstein–Barr virus cells
    corecore