Search CORE

559 research outputs found

BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies

Author: Alkan Can
Hach Faraz
Numanagi? Ibrahim
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)
Publication date: 01/01/2021
Field of study

The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years

Dagstuhl Research Online Publication Server

Short tandem repeats, segmental duplications, gene deletion, and genomic instability in a rapidly diversified immune gene family

Using population admixture to help complete maps of the human genome

Author: A Kong
A Sırmacı
AG Hinch
AL Price
Alkes L Price
Amelia M Lindgren
AP Reiner
Bogdan Pasaniuc
C Alkan
CA Winkler
Cynthia C Morton
D Botstein
D Reich
D Wegmann
DA Benson
David Reich
DM Church
DP Ryan
EE Eichler
EE Eichler
ES Lander
G Golfier
Giulio Genovese
H Donis-Keller
H Lango Allen
H Li
H Li
H Li
H Stefansson
HA Taylor Jr.
HC Mefford
Heng Li
J Christiansen
J Martin
J Weissenbach
J Zhang
JA Bailey
JA Bailey
JA Bailey
James G Wilson
JC Venter
JI Kim
JK Pickrell
JM Kidd
JM Korn
JT Robinson
K Musunuru
Kimberly Chambert
M Guipponi
M Ruault
MA DePristo
Martin R Pollak
MF Seldin
MM Mahtani
MY Dennis
N Brunetti-Pierri
NA Doggett
Nicolas Altemose
PH Sudmant
R Li
R Lyle
RE Handsaker
Robert E Handsaker
RV Samonte
S Gnerre
S Kirsch
S Levy
Steven A McCarroll
X She
YS Ju
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2013
Field of study

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies

eScholarship - University of California

Single haplotype assembly of the human genome from a hydatidiform mole

Author: Agarwala Richa
Church Deanna M.
Eichler Evan E.
Fulton Robert S.
Graves-Lindsay Tina A.
Huddleston John
Meltz Steinberg Karyn
Morgulis Aleksandr
Schneider Valerie A.
Shiryev Sergey A.
Surti Urvashi
Warren Wesley C.
Wilson Richard K.
Publication venue: Digital Commons@Becker
Publication date: 01/01/2014
Field of study

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly

Inversion variants in human and primate genomes

Author: Antonacci Francesca
Archidiacono Nicoletta
BITONTO MIRIANA
Capozzi Oronzo
Catacchio Claudia Rita
D'Addabbo Pietro
Eichler Evan E
Maggiolini Flavia Angela Maria
Miroballo Mattia
Signorile Martina Lepore
Ventura Mario
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2018
Field of study

For many years, inversions have been proposed to be a direct driving force in speciation since they suppress recombination when heterozygous. Inversions are the most common large-scale differences among humans and great apes. Nevertheless, they represent large events easily distinguishable by classical cytogenetics, whose resolution, however, is limited. Here, we performed a genome-wide comparison between human, great ape, and macaque genomes using the net alignments for the most recent releases of genome assemblies. We identified a total of 156 putative inversions, between 103 kb and 91 Mb, corresponding to 136 human loci. Combining literature, sequence, and experimental analyses, we analyzed 109 of these loci and found 67 regions inverted in one or multiple primates, including 28 newly identified inversions. These events overlap with 81 human genes at their breakpoints, and seven correspond to sites of recurrent rearrangements associated with human disease. This work doubles the number of validated primate inversions larger than 100 kb, beyond what was previously documented. We identified 74 sites of errors, where the sequence has been assembled in the wrong orientation, in the reference genomes analyzed. Our data serve two purposes: First, we generated a map of evolutionary inversions in these genomes representing a resource for interrogating differences among these species at a functional level; second, we provide a list of misassembled regions in these primate genomes, involving over 300 Mb of DNA and 1978 human genes. Accurately annotating these regions in the genome references has immediate applications for evolutionary and biomedical studies on primates

Multi-platform discovery of haplotype-resolved structural variation in human genomes

Author: Guryev Victor
Lansdorp Peter
Porubský David
Spierings Diana
Publication venue
Publication date: 23/09/2017
Field of study

The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long- and short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent-child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,181 indel variants (<50 bp) and 31,599 structural variants (≥50 bp) per human genome, a seven fold increase in structural variation compared to previous reports, including from the 1000 Genomes Project. We also discovered 156 inversions per genome, most of which previously escaped detection, as well as large unbalanced chromosomal rearrangements. We provide near-complete, haplotype-resolved structural variation for three genomes that can now be used as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies

Proceedings - University of Groningen

Dissertations of the University of Groningen

The Genomes of Oryza sativa: A History of Duplications

Author: Bao Jingyue
Bu Dongbo
Cao Mengliang
Chen Chen
Chen Huan
Chen Peng
Cong Lijuan
Deng Yajun
Dong Lijun
Dong Lingli
Dong Wei
Fang Lijun
Fang Lin
Gao Lei
Geng Jianing
Han Yujun
Hao Bailin
He Ximiao
Hu Songnian
Hu Wei
Huang Haiyan
Huang Xiangang
Huang Yanqing
Ji Jia
Ji Zhendong
Jiao Yongzhi
Jin Jiao
Lei Meng
Lei Tingting
Li Changfeng
Li Dawei
Li Guangyuan
Li Haihong
Li Heng
Li Jinhong
Li Jun
Li Long
Li Na
Li Ruiqiang
Li Shengting
Li Shuangli
Li Shuting
Li Songgang
Li Wenjie
Li Xianran
Li Yuanzhe
Liang Xiaohu
Lin Liang
Lin Wei
Liu Bin
Liu Dongyuan
Liu Jinsong
Liu Juan
Liu Siqi
Lv Hong
McDermott Jason
Ni Peixiang
Qi Qiuhui
Ran Longhua
Ren Xiaoyu
Samudrala Ram
Shi Jianping
Shi Xiaoli
Su Zhixi
Sun Yongqiao
Tan Jianlong
Tian Xiangjun
Tong Wei
Tong Zongzhong
Wang Jian
Wang Jing
Wang Jingqiang
Wang Jun
Wang Lishun
Wang Wen
Wang Xiaoling
Wang Xiyin
Wei Haibin
Wei Shulin
Wong Gane Ka-Shu
Wu Qingfa
Wu Shuming
Xi Yan
Xiao Ying
Xu Hao
Xu Huayong
Xu Jingyi
Xu Zhao
Xu Zuyuan
Yang Huanming
Yang Li
Ye Chen
Ye Jia
Yin Jianning
Yu Hong
Yu Jun
Yu Yingpu
Yuan Longping
Zeng Changqing
Zhang Bing
Zhang Bo
Zhang Feng
Zhang Jianguo
Zhang Jingfen
Zhang Xiaowei
Zhang Yanling
Zhang Yong
Zhang Yunze
Zhang Zengjin
Zhang Zhenpeng
Zhao Caifeng
Zhao Wenming
Zheng Hongkun
Zheng Weimou
Zhou Jun
Zhou Yan
Zhuang Shulin
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000–40,000. Only 2%–3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family

FigShare

Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse

Author: A Morgulis
A Spiess
A Touré
A Valouev
A Varki
Ana C. Marques
B Charlesworth
B Oh
Brian Teague
Carol J. Bult
Chris P. Ponting
Christopher Churas
CM Laukaitis
D Kipling
D Nguyen
D Söderlund
Daniel Forrest
David C. Schwartz
Deanna M. Church
Donna Maglott
E Whitelaw
EC Salido
ER Liman
ET Dimalanta
Evan E. Eichler
FS Collins
H Huang
H Iida
H Skaletsky
IA Maksakova
J Eid
J Perry
J Ponjavic
J Rossant
JA Bailey
JA Bailey
JA Bailey
James Amos-Landgraf
JC Stevens
JC Venter
JHM Lammers
Jill Herschleb
JL Mueller
JM Young
Joshua L. Cherry
K Lindblad-Toh
KD Pruitt
Kerstin Lindblad-Toh
Konstantinos Potamousis
L Armengol
L Chittenden
L Goodstadt
L Goodstadt
LaDeana W. Hillier
Leo Goodstadt
LL Jacobs
LN Reynard
M Clamp
M Jackson
MF Bolliger
Michael C. Zody
Michael DiCuccio
Michael Place
MJ Justice
MM Abd El-Aziz
MT Ross
P Carninci
P Pevzner
Peter Meric
PJI Ellis
RA Gibbs
RA Gibbs
RD Emes
RD Emes
RD Martin
RH Waterston
Richa Agarwala
Richard J. Roberts
Ron Runnheim
S Aluru
S Dadé
S Griffiths-Jones
S Ohno
S Rouquier
S Tu
S Zhou
SC Grubb
SF Altschul
SG Gregory
Shiguo Zhou
Steve Goldstein
T Marques-Bonet
Tina Graves
TJ Hudson
TJ Nicholas
TS Mikkelsen
WF Dietrich
WJ Murphy
WJ Murphy
Wratko Hlavina
X She
X She
X She
Xinwe She
Y Okazaki
Yuri Kapustin
Z Birtle
Ze Cheng
Zoë Birtle
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

A finished clone-based assembly of the mouse genome reveals extensive recent sequence duplication during recent evolution and rodent-specific expansion of certain gene families. Newly assembled duplications contain protein-coding genes that are mostly involved in reproductive function

CiteSeerX

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

Oxford University Research Archive

Reference genome and comparative genome analysis for the WHO reference strain for Mycobacterium bovis BCG Danish, the present tuberculosis vaccine

Author: Borgers Katlyn
Callewaert Nico
Festjens Nele
Lin Yao-Cheng
Michielsen Gitte
Ou Jheng-Yang
Plets Evelyn
Tiels Petra
Van Hecke Annelies
Zheng Po-Xing
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Background: Mycobacterium bovis bacillus Calmette-Guerin (M. bovis BCG) is the only vaccine available against tuberculosis (TB). In an effort to standardize the vaccine production, three substrains, i.e. BCG Danish 1331, Tokyo 172-1 and Russia BCG-1 were established as the WHO reference strains. Both for BCG Tokyo 172-1 as Russia BCG-1, reference genomes exist, not for BCG Danish. In this study, we set out to determine the completely assembled genome sequence for BCG Danish and to establish a workflow for genome characterization of engineering-derived vaccine candidate strains.ResultsBy combining second (Illumina) and third (PacBio) generation sequencing in an integrated genome analysis workflow for BCG, we could construct the completely assembled genome sequence of BCG Danish 1331 (07/270) (and an engineered derivative that is studied as an improved vaccine candidate, a SapM KO), including the resolution of the analytically challenging long duplication regions. We report the presence of a DU1-like duplication in BCG Danish 1331, while this tandem duplication was previously thought to be exclusively restricted to BCG Pasteur. Furthermore, comparative genome analyses of publicly available data for BCG substrains showed the absence of a DU1 in certain BCG Pasteur substrains and the presence of a DU1-like duplication in some BCG China substrains. By integrating publicly available data, we provide an update to the genome features of the commonly used BCG strains. Conclusions: We demonstrate how this analysis workflow enables the resolution of genome duplications and of the genome of engineered derivatives of the BCG Danish vaccine strain. The BCG Danish WHO reference genome will serve as a reference for future engineered strains and the established workflow can be used to enhance BCG vaccine standardization