Search CORE

8 research outputs found

Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads.

Author: Baid Gunjan
Carnevali Paolo
Carroll Andrew
Chang Pi-Chuan
Eizenga Jordan M
Goel Sidharth
Jain Miten
Kolesnikov Alexey
Kolmogorov Mikhail
Miga Karen H
Nattestad Maria
Paten Benedict
Pesout Trevor
Shafin Kishwar
Publication venue: eScholarship, University of California
Publication date: 01/11/2021
Field of study

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished)

PubMed Central

eScholarship - University of California

A draft human pangenome reference

Author: Abel Haley J.
Abou Tayoun Ahmad
Antonacci-Fulton Lucinda L.
Asri Mobin
Baid Gunjan
Baker Carl A.
Belyaeva Anastasiya
Billis Konstantinos
Bourque Guillaume
Buonaiuto Silvia
Carroll Andrew
Chaisson Mark
Chang Pi-Chuan
Chang Xian H.
Cheng Haoyu
Chu Justin
Cody Sarah
Colonna Vincenza
Cook Daniel E.
Cook-Deegan Robert M.
Cornejo Omar E.
Diekhans Mark
Doerr Daniel
Ebert Peter
Ebler Jana
Eichler Evan E.
Eizenga Jordan
Fairley Susan
Fedrigo Olivier
Felsenfeld Adam L.
Feng Xiaowen
Fischer Christian
Flicek Paul
Formenti Giulio
Frankish Adam
Fulton Robert S.
Gao Yan
Garg Shilpa
Garrison Erik
Garrison Nanibaa' A.
Giron Carlos Garcia
Green Richard E.
Groza Cristian
Guarracino Andrea
Haggerty Leanne
Hall Ira M.
Harvey William T.
Haukness Marina
Haussler David
Heumos Simon
Hickey Glenn
Hoekzema Kendra
Hourlier Thibaut
Howe Kerstin
Jain Miten
Jarvis Erich
Ji Hanlee P.
Kenny Eimear E.
Koenig Barbara A.
Kolesnikov Alexey
Korbel Jan O.
Kordosky Jennifer
Koren Sergey
Lee HoJoon
Lewis Alexandra P.
Li Heng
Liao Wen-Wei
Lu Shuangjia
Lu Tsung-Yu
Lucas Julian K.
Magalhães Hugo
Marco-Sola Santiago
Marijon Pierre
Markello Charles
Marschall Tobias
Martin Fergal J.
McCartney Ann
McDaniel Jennifer
Miga Karen H.
Mitchell Matthew W.
Monlong Jean
Mountcastle Jacquelyn
Munson Katherine M.
Mwaniki Moses Njagi
Nattestad Maria
Novak Adam M.
Nurk Sergey
Olsen Hugh E.
Olson Nathan D.
Paten Benedict
Pesout Trevor
Phillippy Adam M.
Popejoy Alice B.
Porubsky David
Prins Pjotr
Puiu Daniela
Rautiainen Mikko
Regier Allison A.
Rhie Arang
Sacco Samuel
Sanders Ashley D.
Schneider Valerie A.
Schultz Baergen I.
Shafin Kishwar
Sibbesen Jonas A.
Sirén Jouni
Smith Michael W.
Sofia Heidi J.
Thibaud-Nissen Françoise
Tomlinson Chad
Tricomi Francesca Floriana
Villani Flavia
Vollger Mitchell R.
Wagner Justin
Walenz Brian
Wang Ting
Wood Jonathan M. D.
Zimin Aleksey V.
Zook Justin M.
Publication venue
Publication date: 01/01/2023
Field of study

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample

Diposit Digital de Documents de la UAB

PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions

Author: Ahsan Mian Umair
Arslan Elif
Baid Gunjan
Boja Emily
Bourgey Mathieu
Bourque Guillaume
Brown Richard
Brueffer Christian
Budak Gungor
Carroll Andrew
Catreux Severine
Chang Pi-Chuan
Chen Luoqi
Demirkaya-Budak Sinem
Dolgoborodov Alexey
DU YuanPing
Eveleigh Robert
Fang Li Tai
Feng Hanying
Flores Carlos
Goel Sidharth
Hung Calvin
Jain Amit
Jain Chirag
Jain Miten
Jain Varun
Johanson Elaine
Johnson Ivan J.
Jáspez David
Kabakci-Zorlu Duygu
Kalay Özem
Kolesnikov Alexey
Kyriakidis Konstantinos
Lajoie Bryan
Li Gen
Li Zhipan
Liu Qian
Lorenzo-Salazar José M.
MA ChouXian
Maier Ezekiel J.
Malousi Andigoni
McDaniel Jennifer
Mehio Rami
Mohiyuddin Marghoob
Morata Jordi
Muñoz-Barrera Adrián
Narcı Kübra
Nattestad Maria
Olson Nathan D.
Parra Genís
Paten Benedict
Pesout Trevor
Prasanna Anish G.
Roddey Cooper
Rubio-Rodríguez Luis A.
Ruehle Mike
Sahraeian Sayed Mohammad Ebrahim
Sedlazeck Fritz J.
Semenyuk Vladimir
Serang Omar
Shafin Kishwar
Stephens Sarah H.
Tang LinQi
Tetikol H. Serhat
Tonda Raúl
Trotta Jean-Rémi
Turgut Deniz
Wagner Justin
Wang Kai
Westreich Samuel T.
Yang Howard
Zhang ShaoWei
Zook Justin M
Publication venue: 'Elsevier BV'
Publication date: 27/04/2022
Field of study

The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants

Lund University Publications

PubMed Central

precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions

Author: Ahsan Mian Umair
Arslan Elif
Baid Gunjan
Boja Emily
Bourgey Mathieu
Bourque Guillaume
Brown Richard
Brueffer Christian
Budak Gungor
Carroll Andrew
Catreux Severine
Chang Pi-Chuan
Chen Luoqi
Demirkaya-Budak Sinem
Dolgoborodov Alexey
DU YuanPing
Eveleigh Robert
Fang Li Tai
Feng Hanying
Flores Carlos
Goel Sidharth
Hung Calvin
Jain Amit
Jain Chirag
Jain Miten
Jain Varun
Johanson Elaine
Johnson Ivan J.
Jáspez David
Kabakci-Zorlu Duygu
Kalay Özem
Kolesnikov Alexey
Kyriakidis Konstantinos
Lajoie Bryan
Li Gen
Li Zhipan
Liu Qian
Lorenzo-Salazar José M.
MA ChouXian
Maier Ezekiel J.
Malousi Andigoni
McDaniel Jennifer
Mehio Rami
Mohiyuddin Marghoob
Morata Jordi
Muñoz-Barrera Adrián
Narcı Kübra
Nattestad Maria
Olson Nathan D.
Parra Genís
Paten Benedict
Pesout Trevor
Prasanna Anish G.
Roddey Cooper
Rubio-Rodríguez Luis A.
Ruehle Mike
Sahraeian Sayed Mohammad Ebrahim
Sedlazeck Fritz J.
Semenyuk Vladimir
Serang Omar
Shafin Kishwar
Stephens Sarah H.
Tang LinQi
Tetikol H. Serhat
Tonda Raúl
Trotta Jean-Rémi
Turgut Deniz
Wagner Justin
Wang Kai
Westreich Samuel T.
Yang Howard
Zhang ShaoWei
Zook Justin M
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 15/11/2020
Field of study

The precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants

Lund University Publications

Recommended from our members

Gaps and complex structurally variant loci in phased genome assemblies

Author: Abel Haley J
Antonacci-Fulton Lucinda L
Asri Mobin
Baid Gunjan
Baker Carl A
Belyaeva Anastasiya
Billis Konstantinos
Bourque Guillaume
Buonaiuto Silvia
Carroll Andrew
Chaisson Mark JP
Chang Pi-Chuan
Chang Xian H
Cheng Haoyu
Chu Justin
Cody Sarah
Colonna Vincenza
Consortium Human Pangenome Reference
Cook Daniel E
Cook-Deegan Robert M
Cornejo Omar E
Diekhans Mark
Doerr Daniel
Ebert Peter
Ebert Peter
Ebler Jana
Eichler Evan E
Eichler Evan E
Eizenga Jordan M
Fairley Susan
Fedrigo Olivier
Felsenfeld Adam L
Feng Xiaowen
Fischer Christian
Flicek Paul
Formenti Giulio
Frankish Adam
Fulton Robert S
Gao Yan
Garg Shilpa
Garrison Erik
Garrison Nanibaa’ A
Giron Carlos Garcia
Green Richard E
Groza Cristian
Guarracino Andrea
Haggerty Leanne
Hall Ira M
Harvey William T
Harvey William T
Hasenfeld Patrick
Haukness Marina
Haussler David
Heumos Simon
Hickey Glenn
Hickey Glenn
Hoekzema Kendra
Hourlier Thibaut
Howe Kerstin
Jain Miten
Jarvis Erich D
Ji Hanlee P
Kenny Eimear E
Koenig Barbara A
Kolesnikov Alexey
Korbel Jan O
Korbel Jan O
Kordosky Jennifer
Koren Sergey
Lee HoJoon
Lewis Alexandra P
Li Heng
Liao Wen-Wei
Lu Shuangjia
Lu Tsung-Yu
Lucas Julian K
Magalhães Hugo
Marco-Sola Santiago
Marijon Pierre
Markello Charles
Marschall Tobias
Marschall Tobias
Martin Fergal J
McCartney Ann
McDaniel Jennifer
Miga Karen H
Mitchell Matthew W
Monlong Jean
Mountcastle Jacquelyn
Munson Katherine M
Mwaniki Moses Njagi
Nattestad Maria
Novak Adam M
Nurk Sergey
Paten Benedict
Porubsky David
Rozanski Allison N
Sanders Ashley D
Stober Catherine
Vollger Mitchell R
Publication venue: eScholarship, University of California
Publication date: 01/04/2023
Field of study

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation

eScholarship - University of California

A draft human pangenome reference

Author: Abel Haley J.
Abou Tayoun Ahmad N.
Antonacci-Fulton Lucinda L.
Asri Mobin
Baid Gunjan
Baker Carl A.
Belyaeva Anastasiya
Billis Konstantinos
Bourque Guillaume
Buonaiuto Silvia
Carroll Andrew
Chaisson Mark J.P.
Chang Pi Chuan
Chang Xian H.
Cheng Haoyu
Chu Justin
Cody Sarah
Colonna Vincenza
Cook Daniel E.
Cook-Deegan Robert M.
Cornejo Omar E.
Diekhans Mark
Doerr Daniel
Ebert Peter
Ebler Jana
Eichler Evan E.
Eizenga Jordan M.
Fairley Susan
Fedrigo Olivier
Felsenfeld Adam L.
Feng Xiaowen
Fischer Christian
Flicek Paul
Formenti Giulio
Frankish Adam
Fulton Robert S.
Gao Yan
Garg Shilpa
Garrison Erik
Garrison Nanibaa’ A.
Giron Carlos Garcia
Green Richard E.
Groza Cristian
Guarracino Andrea
Haggerty Leanne
Hall Ira M.
Harvey William T.
Haukness Marina
Haussler David
Heumos Simon
Hickey Glenn
Hoekzema Kendra
Hourlier Thibaut
Howe Kerstin
Jain Miten
Jarvis Erich D.
Ji Hanlee P.
Kenny Eimear E.
Koenig Barbara A.
Kolesnikov Alexey
Korbel Jan O.
Kordosky Jennifer
Koren Sergey
Lee Ho Joon
Lewis Alexandra P.
Li Heng
Liao Wen Wei
Lu Shuangjia
Lu Tsung Yu
Lucas Julian K.
Magalhães Hugo
Marco-Sola Santiago
Marijon Pierre
Markello Charles
Marschall Tobias
Martin Fergal J.
McCartney Ann
McDaniel Jennifer
Miga Karen H.
Mitchell Matthew W.
Monlong Jean
Mountcastle Jacquelyn
Munson Katherine M.
Mwaniki Moses Njagi
Nattestad Maria
Novak Adam M.
Nurk Sergey
Olsen Hugh E.
Olson Nathan D.
Paten Benedict
Pesout Trevor
Phillippy Adam M.
Popejoy Alice B.
Porubsky David
Prins Pjotr
Puiu Daniela
Rautiainen Mikko
Regier Allison A.
Rhie Arang
Sacco Samuel
Sanders Ashley D.
Schneider Valerie A.
Schultz Baergen I.
Shafin Kishwar
Sibbesen Jonas A.
Sirén Jouni
Smith Michael W.
Sofia Heidi J.
Thibaud-Nissen Françoise
Tomlinson Chad
Tricomi Francesca Floriana
Villani Flavia
Vollger Mitchell R.
Wagner Justin
Walenz Brian
Wang Ting
Wood Jonathan M.D.
Zimin Aleksey V.
Zook Justin M.
Publication venue
Publication date: 01/01/2023
Field of study

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals 1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.</p

Online Research Database In Technology

Recommended from our members

A draft human pangenome reference.

Author: Abel Haley J.
Abou Tayoun Ahmad N.
Antonacci-Fulton Lucinda L.
Asri Mobin
Baid Gunjan
Baker Carl A.
Belyaeva Anastasiya
Billis Konstantinos
Bourque Guillaume
Buonaiuto Silvia
Carroll Andrew
Chaisson Mark J.P.
Chang Pi Chuan
Chang Xian H.
Cheng Haoyu
Chu Justin
Cody Sarah
Colonna Vincenza
Cook Daniel E.
Cook-Deegan Robert M.
Cornejo Omar E.
Diekhans Mark
Doerr Daniel
Ebert Peter
Ebler Jana
Eichler Evan E.
Eizenga Jordan M.
Fairley Susan
Fedrigo Olivier
Felsenfeld Adam L.
Feng Xiaowen
Fischer Christian
Flicek Paul
Formenti Giulio
Frankish Adam
Fulton Robert S.
Gao Yan
Garg Shilpa
Garrison Erik
Garrison Nanibaa’ A.
Giron Carlos Garcia
Green Richard E.
Groza Cristian
Guarracino Andrea
Haggerty Leanne
Hall Ira M.
Harvey William T.
Haukness Marina
Haussler David
Heumos Simon
Hickey Glenn
Hoekzema Kendra
Hourlier Thibaut
Howe Kerstin
Jain Miten
Jarvis Erich D.
Ji Hanlee P.
Kenny Eimear E.
Koenig Barbara A.
Kolesnikov Alexey
Korbel Jan O.
Kordosky Jennifer
Koren Sergey
Lee Ho Joon
Lewis Alexandra P.
Li Heng
Liao Wen Wei
Lu Shuangjia
Lu Tsung Yu
Lucas Julian K.
Magalhães Hugo
Marco-Sola Santiago
Marijon Pierre
Markello Charles
Marschall Tobias
Martin Fergal J.
McCartney Ann
McDaniel Jennifer
Miga Karen H.
Mitchell Matthew W.
Monlong Jean
Mountcastle Jacquelyn
Munson Katherine M.
Mwaniki Moses Njagi
Nattestad Maria
Novak Adam M.
Nurk Sergey
Olsen Hugh E.
Olson Nathan D.
Paten Benedict
Pesout Trevor
Phillippy Adam M.
Popejoy Alice B.
Porubsky David
Prins Pjotr
Puiu Daniela
Rautiainen Mikko
Regier Allison A.
Rhie Arang
Sacco Samuel
Sanders Ashley D.
Schneider Valerie A.
Schultz Baergen I.
Shafin Kishwar
Sibbesen Jonas A.
Sirén Jouni
Smith Michael W.
Sofia Heidi J.
Thibaud-Nissen Françoise
Tomlinson Chad
Tricomi Francesca Floriana
Villani Flavia
Vollger Mitchell R.
Wagner Justin
Walenz Brian
Wang Ting
Wood Jonathan M.D.
Zimin Aleksey V.
Zook Justin M.
Publication venue: eScholarship, University of California
Publication date: 01/01/2023
Field of study

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample

eScholarship - University of California

Diposit Digital de Documents de la UAB

Online Research Database In Technology