16 research outputs found

    How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?

    No full text
    Abstract Background Next-generation sequencing technologies generate a significant number of short reads that are utilized to address a variety of biological questions. However, quite often, sequencing reads tend to have low quality at the 3’ end and are generated from the repetitive regions of a genome. It is unclear how different alignment programs perform under these different cases. In order to investigate this question, we use both real data and simulated data with the above issues to evaluate the performance of four commonly used algorithms: SOAP2, Bowtie, BWA, and Novoalign. Methods The performance of different alignment algorithms are measured in terms of concordance between any pair of aligners (for real sequencing data without known truth) and the accuracy of simulated read alignment. Results Our results show that, for sequencing data with reads that have relatively good quality or that have had low quality bases trimmed off, all four alignment programs perform similarly. We have also demonstrated that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners as well, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. Trimming off low quality ends significantly increases the concordance between Novoalign and other aligners. As for aligning reads from repetitive regions, our simulation data show that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy. Conclusions This study provides a systematic comparison of commonly used alignment algorithms in the context of sequencing data with varying qualities and from repetitive regions. Our approach can be applied to different sequencing data sets generated from different platforms. It can also be utilized to study the performance of other alignment programs.</p

    Development of SRADE tool and analysis of quality scores of the reads of Next-Generation Sequencing data

    Get PDF
    Capillary Electrophoresis (CE) based on Sanger sequencing has given the ability to extract and explain the genetic information among any given biological system. Even though it brought about a major breakthrough in the field of biology, it had limitations like speed, throughput, scaling and resolution that paved the way for the invention of new technology named as Next-Generation Sequencing (NGS). With the invention of NGS technology, there has been a lot of insight into the genomes, transcriptomes and epigenomes of many of the species on earth. As time passed by, a lot of information has been generated using the NGS technology and new methods have been developed with each method having its merits and de-merits. Some of the most popular sequencing methods that were developed were Illumina sequencing, 454 pyrosequencing, SOLiD sequencing and Ion Torrent Semiconductor sequencing. All the information generated by these sequencing methods are stored in databases and of all the available databases, one of the most important one is National Center for Biotechnology Information (NCBI) integrated with Sequence Read Archive (SRA). The sequencing data from the Sequence Read Archive is downloaded through a web interface and converted into the required and useful format using SRA toolkit provided by NCBI. Using the OS Architecture of the SRA toolkit, the data that is stored in `.sra\u27 format is converted into tab delimited text and saved into a text file with `.txt\u27 extension. The data obtained from the files have a lot of redundant information and only a particular data is required for analysis. So, in order to reduce the redundant information and in order to obtain only the desired data, an algorithm is developed that acts using a User Interface (UI), where the user can select the desired data for analysis. This ensures less computational time, high accuracy and memory efficiency. The User Interface developed is named as SRADE (Sequence Read Archive Data Extractor). The data obtained from the SRA files have information regarding the sequence reads, quality of the reads, their position and their length that can be used for mapping. The information obtained from different types of sequencing methods may be different and the quality of the reads may be different. Therefore a comparison of the quality of the results developed from multiple runs of the same sequencing method as well as different sequencing methods is done, so as to find the differences, the best method for sequencing the genes and to find a cost effective way to determine the reads with high quality score and low quality score. For the purpose of comparison, a whole exome sequencing of 1000 Genomes project of Illumina with data from four runs are being considered along with 1000 Genomes whole exome project of Illumina and AB_SOLiD are being studied

    Fine mapping of a QTL for white markings on chromosome 23 in Scandinavian Fjord horses.

    Get PDF
    In Fjord horses, it is undesirable to have extensive white markings on the head or legs. White markings tended to be multifactorial pleitropic traits with phenotypic effects not solely restricted to coat colour but a variety of other syndromes which could have a negative impact on the horse. In this thesis, 328 Fjord horses underwent a GWAS for white markings based on 67k SNPchip data. Of these 328 horses, 19 had white markings on the head or body. This GWAS identified 2 regions on chromosome 23 with significant association to white markings. Following this, 16 horses, evenly split between cases and controls, underwent resequencing culminating in variant calling using the GATK and FreeBayes. Further analysis was carried out on the peaks identified in the GWAS, as well as regions containing known white marking genes. This further analysis comprised of differences in allele frequency, Weir and Cockerham’s Fst and variant annotation. No clear mutations were identified in a protein coding gene, however an association with U6 spliceosomal RNA were indicated following filtering on variant annotation.EM-AB

    Descoberta de novos vírus vegetais e estudo da diversidade viral intrahospedeiro a partir de dados gerados por sequenciamento em larga escala

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Departamento de Biologia Celular, Programa de Pós-Graduação em Biologia Molecular, 2018.As tecnologias de sequenciamento em larga escala permitem a caracterização genômica das comunidades virais presentes em tecidos vegetais e animais e em amostras ambientais com alta sensibilidade e acurácia. Devido ao sequenciamento simultâneo de várias sequências genômicas, essa técnica também permite o estudo da alta diversidade genética intra-hospedeiro apresentada pelos vírus de RNA. Nesse trabalho, estudamos e estabelecemos um pipeline para a análise de viroma em planta utilizando o modelo de pepino, reportamos a descoberta de dois novos vírus em videiras, Grapevine enamovirus1 (GEV-1) e Grapevine virga-like virus (GVLV). Após ensaios de amplificação rápida das extremidades do cDNA (rapid amplification of cDNA ends – RACE) da extremidade 5' do genoma do GEV-1, foi descrito a sequência genômica quase completa desse vírus (6227 bp), possibilitando a sua classificação como um membro do gênero Enamovirus (família Luteoviridae) com base na sua organização genômica, estudos filogenéticos e critérios estabelecidos pelo Comitê Internacional de Taxonomia de Vírus (International Committee on Taxonomy of Viruses – ICTV). Entretanto, o genoma do GVLV permanece parciamente sequenciado em duas partes: um contig de 3348 bp que contém os domínios metiltransferase (Met) e helicase (Hel); e um contig de 1272 bp que corresponde à RNA polimerase dependente de RNA (RdRp) parcial. Com base em estudos filogenéticos não foi possível classificar esse vírus, que mostra baixa identidade com ambas as famílias Virgaviridae e Bromoviridae. Adicionalmente, esse trabalho apresenta um estudo da diversidade genética intra-hospedeiro dos vírus associados ao enrolamento da folha da videira (Grapevine leafroll-associated virus – GLRaV), com foco na poliproteína dos GLRaV-2 e -3 (gêneros Closterovirus e Ampelovirus, respectivamente), assim como a detecção in silico de uma molécula defectiva de RNA do GLRaV-4 (Ampelovirus), a partir de dados gerados por HTS. As populações intra-hospedeiro encontradas em dois isolados de GLRaV-2 mostraram apenas 11 polimorfismos de único nucleotídeo (single nucleotide polymorphisms – SNPs) em comum (~14% dos SNPs em cada isolado). A diversidade intra-hospedeiro encontrada em dois isolados de GLRaV-3 foi baixa se comparada com os isolados de GLRaV-2.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) e Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).High-throughput sequencing technologies allow for the genomic characterization of viral communities present in plant and animal tissues and environmental samples with high accuracy and sensibility. The simultaneous sequencing of various genomic sequences by this technique also makes it useful for the study of the high intrahost genetic diversity presented by RNA viruses. In this work, we studied and established the conditions of analysis of plant virome using the cucumber model, the discovery of two novel grapevine viruses, Grapevine enamovirus-1 (GEV-1) and Grapevine virga-like virus (GVLV). After rapid amplification of cDNA ends (RACE) assays of the 5' end of GEV-1 genome, we obtained the near full genomic sequence of this virus (6227 bp), enabling its classification as a member of the genus Enamovirus (family Luteoviridae) based on its genomic properties, phylogenetic studies and criteria stablished by the International Committee on Taxonomy of Viruses (ICTV). However, the genome of GVLV remains only partially sequenced, separated in two parts: a 3348 bp contig containing the methyltranferase (Met) and helicase (Hel) domains; and a 1272 bp contig which corresponds to the partial RNA dependent RNA polimerase (RdRp). Based on phylogenetic studies, were not able to classify this novel virus, which shows low identity with viruses in the families Virgaviridae and Bromoviridae. Additionally, this works presents a study on the intrahost genetic diversity of Grapevine leafroll-associated viruses (GLRaVs), focusing on the polyprotein of GLRaV-2 and -3 (genera Closterovirus and Ampelovirus, respectively), as well as an in silico detection of a defective RNA molecule of GLRaV-4 (Ampelovirus). The intrahost population of two isolates of GLRaV-2 showed only 11 single nucleotide polymorphisms (SNPs) in common (~14 of the SNPs found on each isolate). The intrahost genetic diversity found on two isolates of GLRaV3 was low compared to GLRaV-2

    Analysis of quantitative and qualitative genetic features in the pathogenesis of hereditary solid tumors.

    Get PDF
    Nádorová onemocnění patří mezi druhou nejčastější příčinu úmrtí v ČR. Nosiči mutací v genech predisponujících k dědičné formě onemocnění tvoří malou, ale klinicky velmi významnou skupinu vysoce rizikových osob. V současné době jsou známy desítky predispozičních genů pro vznik hereditárních nádorových syndromů, pro jejichž analýzu se cílené sekvenování nové generace (NGS) stalo metodou první volby. NGS umožňuje rapidní zrychlení určení příčinné mutace v oblasti diagnostiky hereditárních nádorových syndromů. K identifikaci mutací v genech predisponujících ke vzniku dědičných nádorových onemocnění jsme navrhli analýzu pomocí panelového NGS včetně bioinformatického zpracování, které umožňuje spolehlivou identifikaci jednonukleotidových záměn, krátkých inzercí/delecí i rozsáhlých intragenových přestaveb. Bioinformatické postupy, popsané v této dizertační práci, jsme následně využili k validaci panelového NGS, ale i pro identifikaci alterací v konkrétních genech, která umožnila nalézt jejich doposud nepopsané asociace s dědičnými nádorovými onemocněními. Bioinformatické analýzy se staly základem pro jednotné zpracování rozsáhlých souborů dat z CZECANCA konsorcia a umožňují konstrukci frekvenční databáze variant, která slouží pro zlepšení klinické diagnostiky nádorové predispozice u pacientů v ČR....Cancer the second most common causes of death in the Czech Republic. Carriers of mutations in genes predisposing to hereditary cancers represent a small but clinically significant group of high risk individuals. Today, dozens of predisposing genes for hereditary tumor syndromes are known and targeted next generation sequencing (NGS) has become a standard approach for their analysis. NGS allows rapid acceleration diagnostics of causal mutation in high-risk individuals. To identify mutations in genes predisposing to hereditary cancers, we designed a panel NGS analysis including subsequent bioinformatics analysis allowing a reliable identification of single nucleotide variants, insertions/deletions, and large intragenic rearrangements. The bioinformatics procedures described in this thesis were used for panel NGS validation, but also for identification of alterations associating with so far undescribed hereditary tumor types. Bioinformatics analyzes have become the basis for the unified processing of large datasets from the CZECANCA consortium and enable the construction of a population-specific database of genotypes that serve to improve clinical diagnostics of cancer predisposition in Czech patients. The versatility of NGS also allows its use for RNA (cDNA-based) analyzes of splicing variants in the...Ústav biochemie a experimentální onkologie 1. LF UKInstitute of Biochemistry and Experimental Oncology First Faculty of Medicine Charles University1. lékařská fakultaFirst Faculty of Medicin

    Multi-omics analysis of early molecular mechanisms of type 1 diabetes

    Get PDF
    Type 1 diabetes (T1D) is a complicated autoimmune disease with largely unknown disease mechanisms. The diagnosis is preceded by a long asymptomatic period of autoimmune activity in the insulin-producing pancreatic islets. Currently the only clinical markers used for T1D prediction are islet autoantibodies, which are a sign of already-broken immune tolerance. The focus of this dissertation is on the early asymptomatic period preceding seroconversion to islet autoantibody positivity. The genetic risk of type 1 diabetes has been thoroughly mapped in genome-wide association studies, but environmental factors and molecular mechanisms that mediate the risk are less well understood. According to the hygiene hypothesis, the risk of immune-mediated disorders is increased by the lack of exposure to pathogens in modern environments. Within a study on the hygiene hypothesis, we compared umbilical cord blood gene expression patterns between children born in environments with contrasting standards of living and type 1 diabetes incidences (Finland, Russia, and Estonia). The differentially expressed genes were associated with innate immunity and immune maturation. Our results suggest that the environment influences the immune system development already in-utero. Furthermore, we analyzed genome-wide DNA methylation and gene expression profiles in samples collected prospectively from Finnish children and newborn infants at risk of type 1 diabetes. Bisulfite sequencing analysis did not show any association of neonatal DNA methylation with later progression to T1D. However, antiviral type I interferon response in early childhood was found to be a risk factor of T1D. This transcriptomic signature was detectable in the peripheral blood already before islet autoantibodies, and the main observations were confirmed in an independent German study. These results contributed to the hypothesis that virus infections might play a role in T1D. Additionally, this dissertation contributed to transcriptomic and epigenomic data analysis workflows. Simple probe-level analysis of exon array data was shown to improve the reproducibility, specificity, and sensitivity of detected differential exon inclusion events. Type 1 error rate was markedly reduced by permutation-based significance assessment of differential methylation in bisulfite sequencing studies.Tyypin 1 diabeteksen varhaisten molekulaaristen mekanismien multiomiikka-analyysi Tyypin 1 diabetes (T1D) on autoimmuunitauti, jonka taustalla olevista mekanismeista tiedetään vähän. Diagnoosia edeltää pitkä oireeton jakso, jonka aikana insuliinia tuottaviin beetasoluihin kohdistuva autoimmuunireaktio etenee haiman saarekkeissa. Tämä väitöskirjatutkimus keskittyy T1D:n varhaiseen oireettomaan ajanjaksoon, joka edeltää serokonversiota autovasta-ainepositiiviseksi. Tyypin 1 diabeteksen geneettiset riskitekijät on kartoitettu perusteellisesti genominlaajuisissa assosiaatiotutkimuksissa, mutta ympäristön riskitekijöistä ja riskiä välittävistä molekyylimekanismeista tiedetään vähemmän. Hygieniahypoteesin mukaan vähäinen altistuminen taudinaiheuttajille lisää immuunijärjestelmän häiriöiden riskiä. Hygieniahypoteesiin liittyvässä osatyössä vertasimme hygienian ja T1D:n ilmaantuvuuden suhteen erilaisissa ympäristöissä (Suomi, Venäjä ja Viro) syntyneiden lasten napaveren geeniekpressioprofiileja. Erilaisesti ekspressoituneet geenit liittyivät synnynnäiseen immuniteettiin ja immuunijärjestelmän maturaatioon. Näiden tulosten perusteella ympäristö saattaa vaikuttaa immuunijärjestelmän kehitykseen jo raskauden aikana. Genominlaajuista DNA-metylaatiota ja geeniekspressiota analysoitiin näytteistä, jotka oli kerätty laajassa suomalaisessa seurantatutkimuksessa T1D:n riskiryhmään kuuluvilta lapsilta ja vastasyntyneiltä. Bisulfiittisekvensointianalyysin perusteella vastasyntyneen DNA-metylaation ja lapsuuden aikana kehittyvän T1D:n välillä ei ollut yhteyttä. Sen sijaan RNA:n tasolla havaittava viruksiin kohdistuva tyypin 1 interferonivaste varhaislapsuudessa todettiin T1D:n riskitekijäksi. Tämä havainto tehtiin perifeerisestä verestä jo ennen saarekevasta-aineiden ilmaantumista, ja päähavainnot vahvistettiin saksalaisessa tutkimuksessa. Nämä tulokset vahvistivat hypoteesia, jonka mukaan virukset voivat vaikuttaa T1D:n puhkeamiseen. T1D-tutkimuksen ohella tämä väitöskirjatyö kehitti transkriptomiikkaan ja epigenomiikkaan sopivia analyysimenetelmiä. Eksonimikrosirujen koetintasoisen analyysin todettiin parantavan toistettavuutta, sensitiivisyyttä ja tarkkuutta vaihtoehtoisen silmukoinniin kartoittamisessa. Tilastollisen merkitsevyyden permutaatiopohjainen analyysi vähensi tyypin 1 virhettä bisulfiittisekvensointidatan analyysissa

    Characterization of progeny derived from disomic alien addition lines from intersubgeneric cross between Glycine max and Glycine tomentella

    Get PDF
    Disomic alien addition lines (DAALs, 2n=42) were obtained from an intersubgeneric cross between Glycine max [L.] Merr. cv. Dwight (2n=40, G1G1) and Glycine tomentella Hayata (PI 441001, 2n=78, D3D3CC). They are morphologically uniform but distinct from either of the parents. These DAALs were all derived from the same monosomic alien addition line (MAAL, 2n=41), and theoretically they should breed true because they had a pair of homologous chromosomes from G. tomentella and 40 soybean chromosomes. However, in some selfed progenies of DAALs the extra G. tomentella chromosomes were eliminated resulting in plants with 2n=40 chromosomes. These progeny lines (2n=40) have a wide variation in phenotypes. The objective of this research was to document the phenotypic and chromosomal variation among the progeny of these DAALs, and to understand the genetics behind this phenomenon. In the replicated field study, variation was observed among the disomic progenies for the qualitative traits such as flower, seed coat, hilum, pod, and pubescence color, and stem termination; as well as the quantitative traits protein and oil concentrations, plant height, lodging, and time of maturity. Three disomic lines had protein concentrations significantly high than either the DAAL or Dwight. Studying the plant transcriptome via RNA-sequencing documented that many genes that are critical to fundamental plant growth processes and related to stress and defense responses were differentially expressed between the DAAL (LG13-7552) and one of the disomic progeny (LG12-7063). RNA-sequencing data indicated that the gray pubescence of LG12-7063 was not due to sequence change from T- to t t genotype, but the result of altered gene expression. The expression of G. tomentella sequences and higher expression of transposable elements (TEs) in the DAAL were also documented

    Strategies to detect genetic diversity in plants

    Get PDF
    Next-generation sequencing can provide access to the genomic sequence of even large and complex plant genomes. Three major strategies exist to assess the genomic information of a species at different scales and complexity levels: transcriptome, target capture and whole-genome shotgun sequencing. The scope of this thesis was to evaluate each concept, ascertain its potential for the discovery of genetic diversity, and develop methods for their improvement. With these objectives, the economically important crops rye, maize and barley were investigated to reveal novel insights into their genetic diversity. The study constructed the first rye transcriptome reference that was utilized for variant discovery revealing ~18,000 single nucleotide variants (SNVs) in coding regions. Subsequently, this resource was converted into a genotyping assay (RYE5k) for application e.g. in breeding programs. The identification of genomic variants requires a high degree of accuracy. Two methods were developed to increase the accuracy in the process of variant discovery: the ‘combinatorial variant calling’ and the approach of ‘k-mer repeat investigation’. With the first method, the reliability of variant calling was increased by the interlaced support and analysis of multiple detection procedures. The approach was successfully applied to determine the diversity in biomass-related genes of maize. Hereby, the applied capture sequencing approach revealed 86,875 SNVs in coding regions. The second method was motivated by the complexity of the large and repetitive barley genome. Therefore, k-mer analyses were used to gain knowledge of repetitive features and this resulted in greater precision in variant calling. The positive effect was shown in a genome-wide diversity study of barley. As a result, more than 15 million high-quality SNVs were identified in five cultivars and a wild progenitor of cultivated barley. The study successfully revealed novel insights into the genetic diversity of barley

    Natural and engineered resistance triggered by TAL effectors of Xanthomonas oryzae

    Get PDF
    Xanthomonas plant pathogenic bacteria cause yield-limiting disease in several important crops. Some species promote infection by secreting transcription activator-like (TAL) effectors directly into host cells where they interact with eukaryotic cellular apparatus to transactivate plant genes. Specific recognition occurs through direct, predictable interactions between hypervariable amino acid residues in the central DNA binding domain and adjacent nucleotides in the sense strand of the gene promoter, thus defining the length and sequence of the effector binding element (EBE). Activation of host susceptibility genes promotes disease, whereas induction of executor resistance (R) genes leads to plant defense. The vascular pathogen Xanthomonas oryzae pv. oryzae (Xoo) and the mesophyll pathogen Xanthomonas oryzae pv. oryzicola (Xoc) are causal agents of the devastating rice (Oryza sativa) diseases bacterial blight and bacterial leaf streak, respectively. To investigate whether executor R genes can be engineered for broader resistance, we added six predicted EBEs corresponding to TAL effectors from Xoo and Xoc to the promoter of Xa27. This modification resulted in specific activation of Xa27 in transgenic rice by Xoo, Xoc and each of the corresponding TAL effectors individually, as measured by quantitative Real Time RT-PCR (qPCR). It expanded the resistance of Xa27 to include additional strains of Xoo and all tested strains of Xoc. A bioinformatics analysis of sequences amended to the Xa27 promoter suggests the likely introduction of unwanted regulatory elements, highlighting the importance of EBE design to guard against spurious gene activation. During a screen of Xoc TAL effectors, we observed a hypersensitive reaction (HR) triggered by Tal2a when it was expressed heterologously in rice leaves by another Xanthomonas strain. The response was Tal2a-specific and dependent on gene activation, suggesting an executor R gene mechanism. EBE prediction, qPCR and next generation RNA sequencing studies identified three rice genes activated specifically in response to Tal2a. One, a ubiquitin carboxy-terminal hydrolase (UCH), was activated with designer TAL effectors (dTALEs) but was not sufficient to cause the HR. Testing of the remaining three genes through dTALE activation is ongoing. Expression from high and low copy plasmids points to a dose-dependent avirulence effect of Tal2a in Xoo and Xoc
    corecore