65 research outputs found

    Presence / Absence Marker Discovery in RAD Markers for Multiplexed Samples in the Context of Next-Generation Sequencing

    Get PDF
    Recent improvements in sequencing technologies have caused various interesting problems to arouse. Having millions of read sequences as the final product of sequencing genome at a lower cost compared to micro array era, has encouraged scientists to enhance previous methods in various areas of bioinformatics. Genotyping and generating genetic maps to study inherited genotypes in order to analyze specific traits in a population is one of the fields of bioinformatics that involves generating different genetic markers and identify polymorphisms in different individuals of a population. Presence/absence markers are the main focus of this thesis. This is one type of Restriction site Associate DNA (RAD) markers which is present in some samples and absent in others and is the sign of variation in the cut site of a restriction enzyme. However, the counts of markers in an experiment are highly correlated and calling true absence and presence is not a straightforward task which means any marker with zero count is not necessarily absent in the sample under study. This is also the case for non-zero count markers which are not necessarily present. A good model that can fit the data is able to make true calls. We propose two different contexts for designing such models as a solution to this problem and investigate their performance. On the other hand, utilizing features of next generation sequencing technology in an even more efficient way, requires the ability to multiplex high number of samples in a single experiment run. In that case, appropriate barcoding, that is robust to various sources of noise in the machine, becomes paramount. Designing such barcodes in an efficient way is a challenging task which is addressed in detail as another problem of this thesis. We make two contributions. One, we propose an algorithm for barcoding multiplexed RADSeq samples. Two, we propose an algorithm for the statistical selection of presence/absence markers on the basis of RADSeq data on two related individuals. Operating characteristics of our methods are explored using both simulated and real data

    Characterizing freshwater macroinvertebrates of Bangladesh using metagenetic techniques

    Get PDF
    The degradation of freshwater ecosystems has become a global concern, in particular, the critical conditions of rivers in Bangladesh demand a monitoring programme through the assessment of bioindicator organisms. Macroinvertebrates as prominent bioindicators are widely used for assessing the health of aquatic ecosystems. Recent technological advances have enabled routine assessment with the genomic characterization of macroinvertebrates using different metagenetic techniques such as DNA barcoding for individual specimen identification, metabarcoding for multi-species identification of bulk samples and mitochondrial metagenomics for extraction of mitogenomes from mixed samples. In this thesis, I commence by generating Cytochrome Oxidase subunit (COI) barcodes for Bangladeshi freshwater macroinvertebrates belonging to the Ephemeroptera, Plecoptera, Trichoptera, Coleoptera, Hemiptera, Odonata, Diptera, Gastropoda and Bivalvia. These barcodes can be used as a DNA reference library for species identification in metabarcoding of macroinvertebrates. I also aim for exploring complete mitogenomes from selected macroinvertebrates using a mitochondrial metagenomic pipeline. I carry out phylogenetic analysis with protein-coding genes that reveals the evolutionary relationship of Bangladeshi macroinvertebrate lineages and also support deeper level identification of barcodes placing them into the phylogenetic tree (chapter 2). In chapter 3, I assess some methodological aspects of the metabarcoding pipeline required for diversity estimation from complex bulk samples of macroinvertebrates in large-scale biomonitoring programmes. These include preparation of bulk macroinvertebrate samples, optimization of the procedure of homogenization of samples required for DNA extraction, strategies for DNA pooling from these extracts, choice of robust universal primers, and viable OTU clustering for reliable diversity estimation. The results have implications for the optimization and standardization of these steps in metabarcoding of freshwater macroinvertebrates. In chapter 4, I apply the metabarcoding technique to establish the macroinvertebrate diversity and impact of various types of anthropogenic disturbances on the freshwater macroinvertebrates in highland and lowland rivers. The results document high diversity, local endemicity and pronounced responses to disturbance in largely unexplored but threatened habitats of Bangladesh. My investigations manifest the viability of metagenetic techniques for applied conservation management as a step towards building a biomonitoring system in freshwater ecosystems globally.Open Acces

    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

    Get PDF
    >Magister Scientiae - MScINTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests

    Emergent quality issues in the supply of Chinese medicinal plants: A mixed methods investigation of their contemporary occurrence and historical persistence

    Get PDF
    Quality issues that emerged centuries ago in Chinese medicinal plants (CMP) were investigated to explore why they still persist in an era of advanced analytical testing and extensive legislation so that a solution to improve CMP quality could be proposed. This is important for 85% of the world’s population who rely on medicinal plants (MP) for primary healthcare considering the adverse events, including fatalities that arise from such quality issues. CMP are the most prevalent medicinal plants globally. This investigation used mixed-methods, including 15 interviews with CMP expert key informants (KI), together with thematic analysis that identified the main CMP quality issues, why they persisted, and informed solutions. An unexplained case example, Eleutherococcus nodiflorus (EN), was analysed by collection of 106 samples of EN, its known toxic adulterant Periploca sepium (PS), and a related substitute, Eleutherococcus senticosus (ES), across mainland China, Taiwan and the UK. Authenticity of the samples was determined using High-performance thinlayer chromatography. Misidentification, adulteration, substitution and toxicity were the main CMP quality issues identified. Adulteration was found widespread globally with 57.4% EN found authentic, and 24.6% adulterated with cardiotoxic PS, mostly at markets and traditional pharmacies. The EN study further highlighted that the reason CMP quality issues persisted was due to the laboratory-bound nature of analytical methods and testing currently used that leave gaps in detection throughout much of the supply chain. CMP quality could be more effectively tested with patented analytical technology (PAT) and simpler field-based testing including indicator strip tests. Education highlighting the long-term economic value and communal benefit of delivering better quality CMP to consumers was recommended in favour of the financial motivation for actions that lead to the persistence of well-known and recurrent CMP quality issues

    ACARORUM CATALOGUS IX. Acariformes, Acaridida, Schizoglyphoidea (Schizoglyphidae), Histiostomatoidea (Histiostomatidae, Guanolichidae), Canestrinioidea (Canestriniidae, Chetochelacaridae, Lophonotacaridae, Heterocoptidae), Hemisarcoptoidea (Chaetodactylidae, Hyadesiidae, Algophagidae, Hemisarcoptidae, Carpoglyphidae, Winterschmidtiidae)

    Get PDF
    The 9th volume of the series Acarorum Catalogus contains lists of mites of 13 families, 225 genera and 1268 species of the superfamilies Schizoglyphoidea, Histiostomatoidea, Canestrinioidea and Hemisarcoptoidea. Most of these mites live on insects or other animals (as parasites, phoretic or commensals), some inhabit rotten plant material, dung or fungi. Mites of the families Chetochelacaridae and Lophonotacaridae are specialised to live with Myriapods (Diplopoda). The peculiar aquatic or intertidal mites of the families Hyadesidae and Algophagidae are also included.Publishe

    Genetic Food Diagnostics - Approaches and Limitations of Species Level Diagnostics in Flowering Plants

    Get PDF
    A vast variety of exotic plants used in traditional medicine and cuisine await introduction into the European market. Systematic information and genetic approaches are evaluated to establish reliable authentication of species contained in respective products. Limitations of methods (e.g. sequence vs. PCR diagnostic) and the most commonly used species concept are discussed. Using the genus Dracocephalum an in depth study of genetic authentication from the genus to the species level is presented

    Next generation DNA sequencing based strategies; towards a new era for the traceability of endangered species and genetically modified organisms

    Get PDF
    Food products are often composed of multiple ingredients that are in addition generally heavily processed, this makes it very challenging to determine the ingredient composition. Traditional molecular biological techniques, such as, specific PCR followed by Sanger sequencing or TaqMan PCR are most frequently applied to identify species/varieties in food/feed products. In the last decade, next generation sequencing (NGS) technologies have been developed and have been widely applied in medical science and other areas, such as agricultural and environmental sciences. The aim of this thesis was to use detailed genetic differences to identify species/varieties in feed/food products based on advanced analytical NGS based strategies. The study focused on the identification of two target groups: (a) endangered species and (b) GMOs. Elucidating genetic composition was subdivided in three main topics: enrichment, NGS based strategy and identification. For both applications novel molecular assays were developed and coupled to an apt NGS technology, data analysis was performed with a dedicated bioinformatics pipelines that were developed for the specific needs per application. With respect to endangered species identification, in chapter 2 it was shown that no dedicated method was available to identify endangered plant and animal species in real-life samples. To address this issue, in chapter 3, a multi-locus DNA metabarcoding approach was developed comparing 12 plant and animal barcode and mini-barcode markers, and the method was validated across 16 laboratories. The results showed that the approach was sensitive enough to identify species present at 1% and consistent and reproducible results were observed across the laboratories for all the analysed experimental mixtures and real-life samples. The combination of multiple barcodes enabled the identification of all the species used in the experimental mixtures, and additionally increased the quality assurance for detection. Furthermore, in chapter 4 the applicability of the multi-locus DNA metabarcoding approach was evaluated on 18 traditional medicines (TMs) belonging to different matrices. It was shown that an adequate DNA clean-up system is necessary to remove impurity from real-life samples, in the metabarcoding analysis of the TMs mainly mini-barcode accounted for the identification of the taxa. Regarding to the identified species in the TMs, only a few declared species on the label could be identified across the TMs, however, many undeclared species were identified in the TMs including the endangered species (Ursus arctos). The conclusion for the first part of the thesis was that a combination of universal plant and animal barcode and mini-barcode markers can provide high resolution for species detection, without being limited by matrix, DNA integrity or species composition of a sample. With respect to the identification of GMOs, the AM-SEQ NGS-based GMOs screening approach was developed and evaluated (chapter 6). The obtained results from the NGS based screening were compared to the currently applied two-step TaqMan PCR based GMO screening. This comparison showed that high abundant targets could be detected similarly, however, low abundant targets could not always detected in one of the two methods. With the use of a broader NGS-based screening strategy more GMOs and related targets could be identified compared to the more limited two-step TaqMan PCR based GMO screening. Additionally, some identified low abundant targets could not be explained, which might indicate the presence of Unknown GMOs (UMGOs) or, alternatively, the donor organism. To identify the unknown sequence of a UGMO a genome walking (GW) approach is necessary, and in chapter 5 the available GW approaches were summarised and from this literature review it was concluded that at that moment no GW method was available to full fill the requirements of UGMOs identification, such as, 0.1% detection limit and enrichment of UGMOs target in a background of GMOs. To address these issues, in chapter 7, Amplification of Linearly-enriched Fragments (ALF) approach was developed and combined with PacBio SMRT NGS technology. The ALF approach was subsequently evaluated on real-life mimicking samples, where sequences related to GMOs present at 1% could be identified. The longest enriched fragment was around 2.5 kbp and a data analysis model was used to distinguish the sequences belonging to known GMOs from the unknown sequences by a sequence of data mapping. With the data analysis model, previous unknown sequence information of a GMO was obtained, showing that the ALF approach can be used to identify the unknown sequence of a UGMO in real-life samples. For the second part of the thesis it was concluded that NGS based GMO screening is an accurate and reliable screening method for GMOs, additionally, the combination of a genome walking approach and NGS is sensitive enough to identify previously unknown sequences for GMO present at low abundance. In general, it can be concluded that the use of NGS-based screening methods can provide accurate and reliable information on the detailed genetic differences of species/varieties present in complex food/feed products. Using enrichment of known targets both well-known species as well as known and unknown GM sequences could be identified, not limited by the complexity of a sample. The results of this thesis show that NGS-based approaches have the potential to be effectively used for food composition screening, and the developed methods can aid Customs, regulatory agencies, and food industries in monitoring food and feed samples.</p

    Snails as intermediate hosts for parasitic infections: host-parasite relationships and intervention strategies

    Get PDF
    A fundamental prerequisite in the fight against medically and veterinary important parasites transmitted by intermediate host snails is a good knowledge of their life cycles, host specificity and geographical distribution. With scientists around the world collecting material from the wild and generating vast amounts of sequencing data, there is a huge opportunity to expand our knowledge of host-parasite relationships from the comfort of an office chair. With these motivations in mind, a bioinformatics tool was developed that has proven to be time efficient and accurate for the rapid identification of hidden parasites in publicly available datasets. Several dozen hidden parasite infections were discovered from the 2150 gastropod datasets tested, and some of these relationships have not yet been described. With our better understanding and the rapid progress in development of molecular and genetic methods, new avenues are opening for the control and eradication of diseases caused by vector-borne parasites. To study crucial parasite-snail interactions and eventually try to interfere with the infection, it is desirable to edit the host genome. Thus, in the framework of this work, preliminary experiments for the development of the CRISPR/Cas9 protocol in Biomphalaria glabrata, the intermediate host of the dangerous blood fluke Schistosoma mansoni, were also performed. The most significant findings in this case are the proof-of-concept of cultivation of B. glabrata embryos in glass capillaries using natural egg fluid and the demonstration that dilution of this fluid or complete replacement by other culture media are not suitable for successful cultivation. I also show that the Diaphanous gene, which has been used in the past to optimize CRISPR/Cas9 in another snail model, is not suitable for our model. The ultimate goal of the development of this molecular-genetic toolbox is the eradication of schistosomiasis by replacing susceptible populations in nature with resistant populations using gene drive technology. Although disrupted by COVID-19 pandemic, this work’s contribution to progress in the fight against helminthic parasitic infections is considerable

    A computational framework for transcriptome assembly and annotation in non-model organisms: the case of venturia inaequalis

    Get PDF
    Philosophiae Doctor - PhDIn this dissertation three computational approaches are presented that enable optimization of reference-free transcriptome reconstruction. The first addresses the selection of bona fide reconstructed transcribed fragments (transfrags) from de novo transcriptome assemblies and annotation with a multiple domain co-occurrence framework. We showed that selected transfrags are functionally relevant and represented over 94% of the information derived from annotation by transference. The second approach relates to quality score based RNA-seq sub-sampling and the description of a novel sequence similarity-derived metric for quality assessment of de novo transcriptome assemblies. A detail systematic analysis of the side effects induced by quality score based trimming and or filtering on artefact removal and transcriptome quality is describe. Aggressive trimming produced incomplete reconstructed and missing transfrags. This approach was applied in generating an optimal transcriptome assembly for a South African isolate of V. inaequalis. The third approach deals with the computational partitioning of transfrags assembled from RNA-Seq of mixed host and pathogen reads. We used this strategy to correct a publicly available transcriptome assembly for V. inaequalis (Indian isolate). We binned 50% of the latter to Apple transfrags and identified putative immunity transcript models. Comparative transcriptomic analysis between fungi transfrags from the Indian and South African isolates reveal effectors or transcripts that may be expressed in planta upon morphogenic differentiation. These studies have successfully identified V. inaequalis specific transfrags that can facilitate gene discovery. The unique access to an in-house draft genome assembly allowed us to provide preliminary description of genes that are implicated in pathogenesis. Gene prediction with bona fide transfrags produced 11,692 protein-coding genes. We identified two hydrophobin-like genes and six accessory genes of the melanin biosynthetic pathway that are implicated in the invasive action of the appressorium. The cazyome reveals an impressive repertoire of carbohydrate degrading enzymes and carbohydrate-binding modules amongst which are six polysaccharide lyases, and the largest number of carbohydrate esterases (twenty-eight) known in any fungus sequenced to dat

    Pan-genome Search and Storage

    Get PDF
    Holley G. Pan-genome Search and Storage. Bielefeld: Universität Bielefeld; 2018.High Throughput Sequencing (HTS) technologies are constantly improving and making genome sequencing more affordable. However, HTS sequencers can only produce short overlapping genome fragments that are erroneous and cover the sequenced genomes unevenly. These genome fragments are assembled based on their overlaps to produce larger contiguous sequences. Since de novo genome assembly is computationally intensive, some species have a reference genome used as a guide for assembling genome fragments from the same species or as a basis for comparative genomics methods. Yet, assembling a genome is an error-prone process depending on the quality of the sequencing data and the heuristics used during the assembly. Furthermore, analyses based on a reference are biased towards the reference. Finally, a single reference cannot reflect the dynamics and diversity of a population of genomes. Overcoming these issues requires to move away from the single-genome reference-centric paradigm and take advantage of the multiple sequenced genomes available for each species. For this purpose, pan-genomes were introduced as sets of genomes from different strains of the same species. A pan-genome is represented by a multi-genome index exploiting the similarity and redundancy of the genomes it contains. Still, pan-genomes are more difficult to analyze than single genomes because of the large amount of data to be stored and indexed. Current data structures for pan-genome indexing do not fulfill all requirements for pan-genome analysis. Indeed, these data structures are often immutable while the size of a pan-genome grows constantly with newly sequenced genomes. Frequently, these data structures consider only assemblies as input, while unassembled genome fragments abound in databases. Also, indexing variants and similarities between the genomes of a pan-genome usually requires time and memory consuming algorithms such as sequence alignments. Sometimes, pan-genome analysis tools just assume variants and similarities are provided as input. While data structures already exist for pan-genome indexing, no solution is currently proposed for genome fragment compression in a pan-genome context. Indeed, it is often of interest to transmit and store all genome fragments of a pan-genome. However, HTS-specific compression tools are not dynamic and cannot update a compressed archive of genome fragments with new fragments of a genome without decompression. Hence, those tools are poorly adapted to the transmission and storage of genome fragments in a pan-genome context. In this thesis, we aim to provide scalable solutions for pan-genome indexing and storage. We first address the problem of pan-genome indexing by proposing a new alignment-free, reference-free and incremental data structure that considers genome fragments as well as assemblies in input: the Bloom Filter Trie (BFT). The BFT is a tree data structure representing a colored de Bruijn graph in which k-mers, words of length k from the input genomes, are associated with sets of colors representing the genomes in which they occur. The BFT makes extensive use of Bloom filters to navigate in the tree and optimize the graph traversal. A "bursting" method is employed to perform an efficient path and level compaction of the tree. We show that the BFT outperforms a data structure that has similar features but is based on an approximation of the set of indexed k-mers. Secondly, we address the problem of genome fragments compression in a pan-genome context by proposing a new abstract data structure, the guided de Bruijn graph. It augments the de Bruijn graph with k-mer partitions such that the graph traversal is guided to reconstruct exactly the genome fragments when decompressing. Different techniques are proposed to optimize the storage of fragments in the graph and the partition encoding. We show that the BFT described previously has all features required to index a guided de Bruijn graph and is used in the implementation of our compression method named DARRC. The evaluation of DARRC on a large pan-genome dataset compared to state-of-the-art HTS-specific and general purpose compression tools shows a 30% compression ratio improvement over the second best performing tool of this evaluation
    • …
    corecore