45 research outputs found

    DiNAMO: Exact method for degenerate IUPAC motifs discovery, characterization of sequence-specific errors

    Get PDF
    National audienceNext generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced several times and variants are usually filtered based on depth criteria. The significant number of artifacts, in spite of those filters, shows the limit of conventional approaches and indicates that some sequencing artifacts are recurrent. This recurrence underlines that sequencing errors can depend on the upstream nucleotide sequence context. Our goal is to search for overrepresented motifs that tend to induce sequencing errors. Previous studies showed that some motifs, such as GGT [1,2], induce sequencing errors in the Illumina technologies. However, these studies were dedicated to exact motifs, and did not take into account approximate motifs, limiting the statistical power of such approaches. On the other hand, some tools, such as FIRE [3], DREME [4] and Discrover [5], were developed to search for degenerate motifs over the 15-letter IUPAC alphabet in the context of chip-seq studies. However, these tools use greedy algorithms, implying a lack of sensitivity. So we developed an exact algorithm to search for degenerate motifs by enumerating all possible IUPAC motifs. This algorithm is based on mutual information and uses hashtables with graphs data structure to store the motifs. It is independent from the sequencing technology. Experimental results on real data show that there are many overrepresented motifs upstream of sequencing artifacts. These latter are identified through the strand bias between forward and reverse reads. The homopoly-mer of length 3 CCC seems to be sufficient to induce errors on IonTorrent. On Illumina, motifs are mainly composed of GGC followed by GGT (like: TGGCNGGT) or homopolymers. We have also noticed a base quality fall after the detected motifs. Our exact algorithm requires less than one minute (Intel R Core TM i5-4570 CPU, 3.20GHz), and less than 2GB of RAM to search for full degenerate motifs of length 6 on a dataset of approximately 24000 sequences, extracted from 11 exomes sequenced on IonTorrent Proton

    Qatar genome: Insights on genomics from the Middle East

    Get PDF
    Despite recent biomedical breakthroughs and large genomic studies growing momentum, the Middle Eastern population, home to over 400 million people, is underrepresented in the human genome variation databases. Here we describe insights from Phase 1 of the Qatar Genome Program with whole genome sequenced 6047 individuals from Qatar. We identified more than 88 million variants of which 24 million are novel and 23 million are singletons. Consistent with the high consanguinity and founder effects in the region, we found that several rare deleterious variants were more common in the Qatari population while others seem to provide protection against diseases and have shaped the genetic architecture of adaptive phenotypes. These results highlight the value of our data as a resource to advance genetic studies in the Arab and neighboring Middle Eastern populations and will significantly boost the current efforts to improve our understanding of global patterns of human variations, human history, and genetic contributions to health and diseases in diverse populations.The Qatar Genome Program (QGP) and Qatar Biobank (QBB) are both Research and Development entities within Qatar Foundation for Education, Science and Community Development. The authors are thankful for everyone who contributed to this endeavor including the QGP and QBB team members, in addition to our partners at Hamad Medical Corporation (HMC), Sidra Medicine and other national stakeholders. The authors would like to especially thank all participants in this study for their continuous support

    Exome-wide association study to identify rare variants influencing COVID-19 outcomes : Results from the Host Genetics Initiative

    Get PDF
    Publisher Copyright: Copyright: © 2022 Butler-Laporte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Host genetics is a key determinant of COVID-19 outcomes. Previously, the COVID-19 Host Genetics Initiative genome-wide association study used common variants to identify multiple loci associated with COVID-19 outcomes. However, variants with the largest impact on COVID-19 outcomes are expected to be rare in the population. Hence, studying rare variants may provide additional insights into disease susceptibility and pathogenesis, thereby informing therapeutics development. Here, we combined whole-exome and whole-genome sequencing from 21 cohorts across 12 countries and performed rare variant exome-wide burden analyses for COVID-19 outcomes. In an analysis of 5,085 severe disease cases and 571,737 controls, we observed that carrying a rare deleterious variant in the SARS-CoV-2 sensor toll-like receptor TLR7 (on chromosome X) was associated with a 5.3-fold increase in severe disease (95% CI: 2.75–10.05, p = 5.41x10-7). This association was consistent across sexes. These results further support TLR7 as a genetic determinant of severe disease and suggest that larger studies on rare variants influencing COVID-19 outcomes could provide additional insights.Peer reviewe

    Author Correction: Multi-ancestry genome-wide association analyses improve resolution of genes and pathways influencing lung function and chronic obstructive pulmonary disease risk

    Get PDF

    Multi-ancestry genome-wide association analyses improve resolution of genes and pathways influencing lung function and chronic obstructive pulmonary disease risk

    Get PDF
    Lung-function impairment underlies chronic obstructive pulmonary disease (COPD) and predicts mortality. In the largest multi-ancestry genome-wide association meta-analysis of lung function to date, comprising 580,869 participants, we identified 1,020 independent association signals implicating 559 genes supported by ≄2 criteria from a systematic variant-to-gene mapping framework. These genes were enriched in 29 pathways. Individual variants showed heterogeneity across ancestries, age and smoking groups, and collectively as a genetic risk score showed strong association with COPD across ancestry groups. We undertook phenome-wide association studies for selected associated variants as well as trait and pathway-specific genetic risk scores to infer possible consequences of intervening in pathways underlying lung function. We highlight new putative causal variants, genes, proteins and pathways, including those targeted by existing drugs. These findings bring us closer to understanding the mechanisms underlying lung function and COPD, and should inform functional genomics experiments and potentially future COPD therapies

    Controversy and consensus on the management of elevated sperm DNA fragmentation in male infertility: A global survey, current guidelines, and expert recommendations

    Get PDF
    Purpose Sperm DNA fragmentation (SDF) has been associated with male infertility and poor outcomes of assisted reproductive technology (ART). The purpose of this study was to investigate global practices related to the management of elevated SDF in infertile men, summarize the relevant professional society recommendations, and provide expert recommendations for managing this condition. Materials and Methods An online global survey on clinical practices related to SDF was disseminated to reproductive clinicians, according to the CHERRIES checklist criteria. Management protocols for various conditions associated with SDF were captured and compared to the relevant recommendations in professional society guidelines and the appropriate available evidence. Expert recommendations and consensus on the management of infertile men with elevated SDF were then formulated and adapted using the Delphi method. Results A total of 436 experts from 55 different countries submitted responses. As an initial approach, 79.1% of reproductive experts recommend lifestyle modifications for infertile men with elevated SDF, and 76.9% prescribe empiric antioxidants. Regarding antioxidant duration, 39.3% recommend 4–6 months and 38.1% recommend 3 months. For men with unexplained or idiopathic infertility, and couples experiencing recurrent miscarriages associated with elevated SDF, most respondents refer to ART 6 months after failure of conservative and empiric medical management. Infertile men with clinical varicocele, normal conventional semen parameters, and elevated SDF are offered varicocele repair immediately after diagnosis by 31.4%, and after failure of antioxidants and conservative measures by 40.9%. Sperm selection techniques and testicular sperm extraction are also management options for couples undergoing ART. For most questions, heterogenous practices were demonstrated. Conclusions This paper presents the results of a large global survey on the management of infertile men with elevated SDF and reveals a lack of consensus among clinicians. Furthermore, it demonstrates the scarcity of professional society guidelines in this regard and attempts to highlight the relevant evidence. Expert recommendations are proposed to help guide clinicians

    Caractérisation des erreurs de séquençage non aléatoires, application aux mosaïques et tumeurs hétérogÚnes

    No full text
    The advent of Next Generation DNA Sequencing technologies has revolutionizedthe field of personalized genomics through their resolution and low cost. However,these new technologies are associated with a relatively high error rate, which varies between 0.1% and 1% for second-generation sequencers. This value is problematic when searching for low allelic ratio variants, as observed in the case of heterogeneoustumors. Indeed, such error rate can lead to thousands of false positives. Each region ofthe studied DNA must therefore be sequenced several times, and the variants are thenfiltered according to criteria based on their depth. Despite these filters, the number oferrors remains significant, showing the limit of conventional approaches and indicatingthat some sequencing errors are not random.In the context of this thesis, we have developed an exact algorithm for over-representeddegenerate DNA motifs discovery on the upstream of non-random sequencing errorsand thus potentially linked to their appearance. This algorithm was implemented in asoftware called DiNAMO, which was tested on sequencing data from IonTorrent andIllumina technologies.The experimental results revealed several motifs, specific to each of these two technologies. We then showed that taking these motifs into account in the analysis reduced significantly the false-positive rate. DiNAMO can therefore be used downstream of each analysis, as an additional filter to improve the identification of variants, especially,variants with low allelic ratio.L'arrivĂ©e des technologies de sĂ©quençage d'ADN Ă  haut-dĂ©bit a reprĂ©sentĂ© une rĂ©volution dans le domaine de la gĂ©nomique personnalisĂ©e, en raison de leur rĂ©solution et leur faible coĂ»t. Toutefois, ces nouvelles technologies prĂ©sentent un taux d'erreur Ă©levĂ©, qui varie entre 0,1% et 1% pour les sĂ©quenceurs de seconde gĂ©nĂ©ration. Cette valeur est problĂ©matique dans le cadre de la recherche de variants de faible ratio allĂ©lique, comme ce qui est observĂ© dans le cas des tumeurs hĂ©tĂ©rogĂšnes. En effet, un tel taux d'erreur peut mener Ă  des milliers de faux positifs. Chaque rĂ©gion de l'ADN Ă©tudiĂ© doit donc ĂȘtre sĂ©quencĂ©e plusieurs fois, et les variants sont alors filtrĂ©s en fonction de critĂšres basĂ©s sur leur profondeur. MalgrĂ© ces filtres, le nombre d'artefacts reste important, montrant la limite des approches conventionnelles et indiquant que certains artefacts de sĂ©quençage ne sont pas alĂ©atoires. Dans le cadre de cette thĂšse, nous avons dĂ©veloppĂ© un algorithme exact de recherche des motifs d'ADN dĂ©gĂ©nĂ©rĂ©s sur-reprĂ©sentĂ©s en amont des erreurs de sĂ©quençage non alĂ©atoires et donc potentiellement liĂ©s Ă  leur apparition. Cet algorithme a Ă©tĂ© mis en oeuvre dans un logiciel appelĂ© DiNAMO, qui a Ă©tĂ© testĂ© sur des donnĂ©es de sĂ©quençage issues des technologies IonTorrent et Illumina. Les rĂ©sultats expĂ©rimentaux ont mis en Ă©vidence plusieurs motifs, spĂ©cifiques Ă  chacune de ces deux technologies. Nous avons ensuite montrĂ© que la prise en compte de ces motifs dans l'analyse, rĂ©duisait considĂ©rablement le taux de faux positifs. DiNAMO peut donc ĂȘtre utilisĂ© en aval de chaque analyse, comme un filtre supplĂ©mentaire permettant d'amĂ©liorer l'identification des variants, en particulier des variants Ă  faible ratio allĂ©lique

    Caractérisation des erreurs de séquençage non aléatoires, application aux mosaïques et tumeurs hétérogÚnes

    No full text
    The advent of Next Generation DNA Sequencing technologies has revolutionizedthe field of personalized genomics through their resolution and low cost. However,these new technologies are associated with a relatively high error rate, which varies between 0.1% and 1% for second-generation sequencers. This value is problematic when searching for low allelic ratio variants, as observed in the case of heterogeneoustumors. Indeed, such error rate can lead to thousands of false positives. Each region ofthe studied DNA must therefore be sequenced several times, and the variants are thenfiltered according to criteria based on their depth. Despite these filters, the number oferrors remains significant, showing the limit of conventional approaches and indicatingthat some sequencing errors are not random.In the context of this thesis, we have developed an exact algorithm for over-representeddegenerate DNA motifs discovery on the upstream of non-random sequencing errorsand thus potentially linked to their appearance. This algorithm was implemented in asoftware called DiNAMO, which was tested on sequencing data from IonTorrent andIllumina technologies.The experimental results revealed several motifs, specific to each of these two technologies. We then showed that taking these motifs into account in the analysis reduced significantly the false-positive rate. DiNAMO can therefore be used downstream of each analysis, as an additional filter to improve the identification of variants, especially,variants with low allelic ratio.L'arrivĂ©e des technologies de sĂ©quençage d'ADN Ă  haut-dĂ©bit a reprĂ©sentĂ© une rĂ©volution dans le domaine de la gĂ©nomique personnalisĂ©e, en raison de leur rĂ©solution et leur faible coĂ»t. Toutefois, ces nouvelles technologies prĂ©sentent un taux d'erreur Ă©levĂ©, qui varie entre 0,1% et 1% pour les sĂ©quenceurs de seconde gĂ©nĂ©ration. Cette valeur est problĂ©matique dans le cadre de la recherche de variants de faible ratio allĂ©lique, comme ce qui est observĂ© dans le cas des tumeurs hĂ©tĂ©rogĂšnes. En effet, un tel taux d'erreur peut mener Ă  des milliers de faux positifs. Chaque rĂ©gion de l'ADN Ă©tudiĂ© doit donc ĂȘtre sĂ©quencĂ©e plusieurs fois, et les variants sont alors filtrĂ©s en fonction de critĂšres basĂ©s sur leur profondeur. MalgrĂ© ces filtres, le nombre d'artefacts reste important, montrant la limite des approches conventionnelles et indiquant que certains artefacts de sĂ©quençage ne sont pas alĂ©atoires. Dans le cadre de cette thĂšse, nous avons dĂ©veloppĂ© un algorithme exact de recherche des motifs d'ADN dĂ©gĂ©nĂ©rĂ©s sur-reprĂ©sentĂ©s en amont des erreurs de sĂ©quençage non alĂ©atoires et donc potentiellement liĂ©s Ă  leur apparition. Cet algorithme a Ă©tĂ© mis en oeuvre dans un logiciel appelĂ© DiNAMO, qui a Ă©tĂ© testĂ© sur des donnĂ©es de sĂ©quençage issues des technologies IonTorrent et Illumina. Les rĂ©sultats expĂ©rimentaux ont mis en Ă©vidence plusieurs motifs, spĂ©cifiques Ă  chacune de ces deux technologies. Nous avons ensuite montrĂ© que la prise en compte de ces motifs dans l'analyse, rĂ©duisait considĂ©rablement le taux de faux positifs. DiNAMO peut donc ĂȘtre utilisĂ© en aval de chaque analyse, comme un filtre supplĂ©mentaire permettant d'amĂ©liorer l'identification des variants, en particulier des variants Ă  faible ratio allĂ©lique
    corecore