268 research outputs found

    Next-generation sequencing (NGS) in the microbiological world : how to make the most of your money

    Get PDF
    The Sanger sequencing method produces relatively long DNA sequences of unmatched quality and has been considered for long time as the gold standard for sequencing DNA. Many improvements of the Sanger method that culminated with fluorescent dyes coupled with automated capillary electrophoresis enabled the sequencing of the first genomes. Nevertheless, using this technology to sequence whole genomes was costly, laborious and time consuming even for genomes that are relatively small in size. A major technological advance was the introduction of next-generation sequencing (NGS) pioneered by 454 Life Sciences in the early part of the 21th century. NGS allowed scientists to sequence thousands to millions of DNA molecules in a single machine run. Since then, new NGS technologies have emerged and existing NGS platforms have been improved, enabling the production of genome sequences at an unprecedented rate as well as broadening the spectrum of NGS applications. The current affordability of generating genomic information, especially with microbial samples, has resulted in a false sense of simplicity that belies the fact that many researchers still consider these technologies a black box. In this review, our objective is to identify and discuss four steps that we consider crucial to the success of any NGS-related project. These steps are: (1) the definition of the research objectives beyond sequencing and appropriate experimental planning, (2) library preparation, (3) sequencing and (4) data analysis. The goal of this review is to give an overview of the process, from sample to analysis, and discuss how to optimize your resources to achieve the most from your NGS-based research. Regardless of the evolution and improvement of the sequencing technologies, these four steps will remain relevant

    Tools and strategies for RNA-sequencing data analysis

    Get PDF
    RNA-Sequencing (RNA-seq) has enabled the in-depth study of the transcriptome, becoming the primary research method in the field of molecular biology. The typical aim of RNA-seq is to quantify and detect differentially expressed (DE) and differentially spliced (DS) genes. Numerous methodologies and tools have been developed in recent years to assist in analyzing RNA-seq data. However, it is difficult for researchers to decide which methods or strategies they should adopt to optimize the analysis of their datasets. In this Thesis, in Study I, we applied the gene-level DE analysis approach to detect the androgen-regulated genes between cancerous and benign samples in 48 primary prostate cancer patients. Combined with other measurements from the same samples, our analysis indicated that patients having TMPRSS-ERG gene fusion had distinct intratumoral androgen profiles compared to TMPRSS-ERG negative tumors. However, the DE can remain undetected when the expression varies across the gene due to reasons such as alternative splicing. Hence, to account for this problem, an alternate analysis approach has been suggested in which the statistical testing of lower feature levels (e.g. transcripts, transcript compatibility counts, or exons) is performed initially, followed by aggregating the results to the gene level. In Study II, we tested this alternate approach on these lower features and compared the results to those from the conventional gene-level approach. In the alternate approach, two methods (Lancaster method and empirical brown method (ebm)) were tested for aggregating the feature-level results to gene-level results. Our results suggest that the exon-level estimates improve the detection of the DE genes when the ebm method is used for aggregating the results. Accordingly, R/Bioconductor package EBSEA was developed using the winning approach. RNA-seq data can also be used to find DS events between conditions. However, the detection of DS is more challenging than the detection of DE. In Study III, a comprehensive comparison of ten DS tools was performed. We concluded that exonbased and event-based methods (rMATS and MAJIQ) performed overall best across the different evaluation metrics considered. Furthermore, we observed overall low concordance between the results reported by the different tools, making it recommendable to use more than one tool when performing DS analysis, and to concentrate on the overlapping results.Työkaluja ja strategioita RNA-sekvensointidatan analyysiin RNA-sekvensointi (RNA-seq) on mahdollistanut transkriptomin yksityiskohtaisen tarkastelun ja siitä on tullut hyvin suosittu työkalu molekyylibiologian tutkimuksessa. RNA-sekvensointitutkimusten tyypillinen tarkoitus on selvittää näyteryh- mien välillä eriävästi ilmentyviä ja silmukoituvia geenejä. RNA-sekvensointidatojen analyysiin on kehitetty runsaasti työkaluja ja usein on haastavaa valita näiden joukosta optimaaliset välineet tietyn aineiston käsittelyyn. Tässä väitöstyössä osajulkaisussa I tunnistettiin androgeenihormonien säätelemiä eriävästi ilmentyviä geenejä syöpäkudoksen ja terveen kudoksen välillä 48 eturauhassyöpäpotilaalla. Kun nämä tulokset yhdistettiin muihin samojen potilaiden käytettävissä oleviin mittausarvoihin, havaittiin, että TMPRSS-ERG-geenifuusion omaavien potilaiden syöpäkudoksen androgeenihormonigeenien ilmentymisprofiili poikkesi verrattuna niihin potilaisiin, joilta ei löytynyt vastaavaa geenifuusiota. On kuitenkin mahdollista, että tällä lähestymistavalla eriävä ilmentyminen jää joidenkin geenien osalta havaitsematta, jos ilmentymistaso vaihtelee geenin eri osissa, esimerkiksi vaihtoehtoisen silmukoinnin vaikutuksen vuoksi. Ratkaisuksi tähän on esitetty uudenlaista lähestymistapaa, jossa tilastollinen testaus näyteryhmien välillä suoritetaan geenin rakenteen osalta hienojakoisemmalla tasolla (esimerkiksi transkriptien, transkriptiyhteensopivien mittausyksiköiden tai eksonien tasolla) ja vasta näin saadut osatulokset yhdistetään geenitason kokonaistulokseksi. Julkaisussa II verrattiin tätä lähestymistapaa perinteiseen geenitason analyysiin testaamalla kahta eri menetelmää tulosten yhdistämiseen takaisin geenitasolle: 1) Lancaster- menetelmää ja 2) empiiristä Brown-menetelmää (ebm). Tulosten perusteella eksonitason mittausarvojen käyttö yhdistettynä ebm-menetelmään paransi eriävästi ilmentyvien geenien tunnistusta. Tämä lähestymistapa on sisällytetty väitöstyössä kehitettyyn geenien eriävää ilmentymistä analysoivaan R/Bioconductor -analyysipakettiin EBSEA. RNA-sekvensointidataa voidaan käyttää myös eriävien silmukointitapahtumien tunnistamiseen näyteryhmien välillä. Tämä on kuitenkin haastavampaa kuin geenien eriävän ilmentymisen analyysi. Julkaisussa III vertailtiin kymmentä eriävien silmukointitapahtumien tunnistamiseen kehitettyä työkalua. Näistä työkaluista eksoniperustaiset ja silmukointitapahtumaperustaiset työkalut (erityisesti rMATS ja MAJIQ) tuottivat parhaat kokonaistulokset käytetyillä vertailukriteereillä. Työkalujen tuottamien tulosten välillä havaittiin kuitenkin merkittäviä eroja, minkä johdosta tulosten jatkotarkastelussa on hyödyllistä keskittyä niihin tuloksiin, jotka ovat löydettävissä useammalla kuin yhdellä työkalulla

    SQANTI : extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

    Get PDF
    High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes

    A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project

    Get PDF
    High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies. © Copyright © 2020 Andermann, Torres Jiménez, Matos-Maraví, Batista, Blanco-Pastor, Gustafsson, Kistler, Liberal, Oxelman, Bacon and Antonelli

    A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project

    Get PDF
    High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies. © Copyright © 2020 Andermann, Torres Jiménez, Matos-Maraví, Batista, Blanco-Pastor, Gustafsson, Kistler, Liberal, Oxelman, Bacon and Antonelli

    A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project

    Get PDF
    High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies. © Copyright © 2020 Andermann, Torres Jiménez, Matos-Maraví, Batista, Blanco-Pastor, Gustafsson, Kistler, Liberal, Oxelman, Bacon and Antonelli

    Identification of unique neoantigen qualities in long-term survivors of pancreatic cancer

    Get PDF
    Pancreatic ductal adenocarcinoma is a lethal cancer with fewer than 7% of patients surviving past 5 years. T-cell immunity has been linked to the exceptional outcome of the few long-term survivors1,2, yet the relevant antigens remain unknown. Here we use genetic, immunohistochemical and transcriptional immunoprofiling, computational biophysics, and functional assays to identify T-cell antigens in long-term survivors of pancreatic cancer. Using whole-exome sequencing and in silico neoantigen prediction, we found that tumours with both the highest neoantigen number and the most abundant CD8+ T-cell infiltrates, but neither alone, stratified patients with the longest survival. Investigating the specific neoantigen qualities promoting T-cell activation in long-term survivors, we discovered that these individuals were enriched in neoantigen qualities defined by a fitness model, and neoantigens in the tumour antigen MUC16 (also known as CA125). A neoantigen quality fitness model conferring greater immunogenicity to neoantigens with differential presentation and homology to infectious disease-derived peptides identified long-term survivors in two independent datasets, whereas a neoantigen quantity model ascribing greater immunogenicity to increasing neoantigen number alone did not. We detected intratumoural and lasting circulating T-cell reactivity to both high-quality and MUC16 neoantigens in long-term survivors of pancreatic cancer, including clones with specificity to both high-quality neoantigens and predicted cross-reactive microbial epitopes, consistent with neoantigen molecular mimicry. Notably, we observed selective loss of high-quality and MUC16 neoantigenic clones on metastatic progression, suggesting neoantigen immunoediting. Our results identify neoantigens with unique qualities as T-cell targets in pancreatic ductal adenocarcinoma. More broadly, we identify neoantigen quality as a biomarker for immunogenic tumours that may guide the application of immunotherapies

    Machine learning and computational methods to identify molecular and clinical markers for complex diseases – case studies in cancer and obesity

    Get PDF
    In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes. This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors. Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice

    tGBS® genotyping-by-sequencing enables reliable genotyping of heterozygous loci

    Get PDF
    Conventional genotyping-by-sequencing (cGBS) strategies suffer from high rates of missing data and genotyping errors, particularly at heterozygous sites. tGBS® genotyping-by-sequencing is a novel method of genome reduction that employs two restriction enzymes to generate overhangs in opposite orientations to which (single-strand) oligos rather than (double-stranded) adaptors are ligated. This strategy ensures that only doubledigested fragments are amplified and sequenced. The use of oligos avoids the necessity of preparing adaptors and the problems associated with inter-adaptor annealing/ligation. Hence, the tGBS protocol simplifies the preparation of high-quality GBS sequencing libraries. During polymerase chain reaction (PCR) amplification, selective nucleotides included at the 3\u27-end of the PCR primers result in additional genome reduction as compared to cGBS. By adjusting the number of selective bases, different numbers of genomic sites are targeted for sequencing. Therefore, for equivalent amounts of sequencing, more reads per site are available for SNP calling. Hence, as compared to cGBS, tGBS delivers higher SNP calling accuracy (\u3e97–99%), even at heterozygous sites, less missing data per marker across a population of samples, and an enhanced ability to genotype rare alleles. tGBS is particularly well suited for genomic selection, which often requires the ability to genotype populations of individuals that are heterozygous at many loci

    Genomic integrative analysis to improve fusion transcript detection, liquid association and biclustering

    Get PDF
    More data provide more possibilities. Growing number of genomic data provide new perspectives to understand some complex biological problems. Many algorithms for single-study have been developed, however, their results are not stable for small sample size or overwhelmed by study-specific signals. Taking the advantage of high throughput genomic data from multiple cohorts, in this dissertation, we are able to detect novel fusion transcripts, explore complex gene regulations and discovery disease subtypes within an integrative analysis framework. In the first project, we evaluated 15 fusion transcript detection tools for paired-end RNA-seq data. Though no single method had distinguished performance over the others, several top tools were selected according to their F-measures. We further developed a fusion meta-caller algorithm by combining top methods to re-prioritize candidate fusion transcripts. The results showed that our meta-caller can successfully balance precision and recall compared to any single fusion detection tool. In the second project, we extended liquid association to two meta-analytic frameworks (MetaLA and MetaMLA). Liquid association is the dynamic gene-gene correlation depending on the expression level of a third gene. Our MetaLA and MetaMLA provided stronger detection signals and more consistent and stable results compared to single-study analysis. When applied our method to five Yeast datasets related to environmental changes, genes in the top triplets were highly enriched in fundamental biological processes corresponding to environmental changes. In the third project, we extended the plaid model from single-study analysis to multiple cohorts for bicluster detection. Our meta-biclustering algorithm can successfully discovery biclusters with higher Jaccard accuracy toward large noise and small sample size. We also introduced the concept of gap statistic for pruning parameter estimation. In addition, biclusters detected from five breast cancer mRNA expression cohorts can successfully select genes highly associated with many breast cancer related pathways and split samples with significantly different survival behaviors. In conclusion, we improved the fusion transcripts detection, liquid association analysis and bicluster discovery through integrative-analysis frameworks. These results provided strong evidence of gene fusion structure variation, three-way gene regulation and disease subtype detection, and thus contribute to better understanding of complex disease mechanism ultimately
    corecore