400 research outputs found

    Efficient algorithms for analyzing segmental duplications with deletions and inversions in genomes

    Get PDF
    Background: Segmental duplications, or low-copy repeats, are common in mammalian genomes. In the human genome, most segmental duplications are mosaics comprised of multiple duplicated fragments. This complex genomic organization complicates analysis of the evolutionary history of these sequences. One model proposed to explain this mosaic patterns is a model of repeated aggregation and subsequent duplication of genomic sequences. Results: We describe a polynomial-time exact algorithm to compute duplication distance, a genomic distance defined as the most parsimonious way to build a target string by repeatedly copying substrings of a fixed source string. This distance models the process of repeated aggregation and duplication. We also describe extensions of this distance to include certain types of substring deletions and inversions. Finally, we provide an description of a sequence of duplication events as a context-free grammar (CFG). Conclusion: These new genomic distances will permit more biologically realistic analyses of segmental duplications in genomes.

    Genome maps across 26 human populations reveal population-specific patterns of structural variation.

    Get PDF
    Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome

    Finding genomic differences from whole-genome assemblies using SyRI

    Get PDF
    Genomic differences can range from single nucleotide differences (SNPs) to large complex structural rearrangements. Current methods typically can annotate sequence differences like SNPs and large indels accurately but do not unravel the full complexity of structural rearrangements that include inversions, translocations, and duplications. Structural rearrangements involve changes in location, orientation, or copy-number between highly similar sequences and have been reported to be associated with several biological differences between organisms. However, they are still scantly studied with sequencing technologies as it is still challenging to identify them accurately. Here I present SyRI, a novel computational method for genome-wide identification of structural differences using the pairwise comparison of whole-genome chromosome-level assemblies. SyRI uses a unique approach where it first identifies all syntenic (structurally conserved) regions between two genomes. Since all non-syntenic regions are structural rearrangements by definition, this transforms the difficult problem of rearrangement identification to a comparatively easier problem of rearrangement classification. SyRI analyses the location, orientation, and copy-number of alignments between rearranged regions and selects alignments that best represent the putative rearrangements and result in the highest total alignment score between the genomes. Next, SyRI searches for sequence differences that are distinguished for residing in syntenic or rearranged regions. This distinction is important, as rearranged regions (and sequence differences within them) do not follow Mendelian Law of Segregation and are therefore inherited differently compared to syntenic regions. Using SyRI, I successfully identified rearrangements in human, A. thaliana, yeast, fruit fly, and maize genomes. Further, I also experimentally validated 92% (108/117) of the predicted translocations in A. thaliana using a genetic approach

    PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data

    Get PDF
    Paired-End Mapper (PEMer) enables mapping of genomic structural variants at considerably enhanced sensitivity, specificity and resolution over previous approaches

    Structural Variation Discovery and Genotyping from Whole Genome Sequencing: Methodology and Applications: A Dissertation

    Get PDF
    A comprehensive understanding about how genetic variants and mutations contribute to phenotypic variations and alterations entails experimental technologies and analytical methodologies that are able to detect genetic variants/mutations from various biological samples in a timely and accurate manner. High-throughput sequencing technology represents the latest achievement in a series of efforts to facilitate genetic variants discovery and genotyping and promises to transform the way we tackle healthcare and biomedical problems. The tremendous amount of data generated by this new technology, however, needs to be processed and analyzed in an accurate and efficient way in order to fully harness its potential. Structural variation (SV) encompasses a wide range of genetic variations with different sizes and generated by diverse mechanisms. Due to the technical difficulties of reliably detecting SVs, their characterization lags behind that of SNPs and indels. In this dissertation I presented two novel computational methods: one for detecting transposable element (TE) transpositions and the other for detecting SVs in general using a local assembly approach. Both methods are able to pinpoint breakpoint junctions at single-nucleotide resolution and estimate variant allele frequencies in the sample. I also applied those methods to study the impact of TE transpositions on the genomic stability, the inheritance patterns of TE insertions in the population and the molecular mechanisms and potential functional consequences of somatic SVs in cancer genomes

    Understanding The Complexity of Human Structural Genomic Variation Through Multiple Whole Genome Sequencing Platforms

    Full text link
    Genomic structural variants (SVs) are major sources of genome diversity and closely related to human health, as indicated by numerous studies. In spite of the recent advances in sequencing technology and discovery methodology, there are still considerable amounts of variants in the genome that are partially or completely misinterpreted. This thesis has mainly focused on comprehensively interpreting the structural variants in human genomes by accurately defining the locations and formats of variants with the application of different sequencing platforms. To accomplish this goal, I developed a randomized iterative approach to define all types of SVs, which has shown superior performance in accurately defining complex variants. Next, I built a recurrence based validation pipeline to systematically validate SVs with long read sequences. I conclude with a systematic integration of SVs in multiple individuals discovered by various short read based detecting algorithms, with supportive evidence from orthogonal technologies, which presents to date the most comprehensive SV map in the human genome and the best current technologies allow us to do.PhDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138462/1/xuefzhao_1.pd

    Improving CNV detection from short-read MPS data in neuromuscular disorders

    Get PDF
    Neuromuscular disorders (NMD) are highly heterogenic with around 1000 reported different subtypes. Most are genetic in origin, and some 500 genes are currently identified to cause NMDs. Massively parallel sequencing (MPS) approaches have been widely used to increase the cost-effectiveness and diagnostic yield in the work-up of the genetic molecular diagnosis and to speed up the process. Copy number variants (CNVs), deletions and duplications larger than 50 base pairs, explain approximately 10% of the Mendelian disorders. No best practices pipelines have been developed yet for CNV analysis from MPS data. Therefore, the detection and verification of CNV findings has often involved complementary methods, such as array comparative genomic hybridization (array CGH), multiplex ligation-dependent probe amplification (MLPA) and quantitative PCR approaches. Recently, various CNV detection programs have been developed, but for widely different types of designated research settings, which complicates choosing the correct approach for NMDs. These individual programs have generally exhibited less than ideal sensitivity and specificity for CNV detection. Our aim was to develop a comprehensive pipeline for the detection and annotation of CNVs with high accuracy from targeted gene panel sequencing and whole exome sequencing (WES) data of patients with NMDs. Four different CNV analysis programs were chosen for this study: CoNIFER, XHMM, ExomeDepth and CODEX. The targeted gene panel MYOcap includes 349 genes for myopathic disorders and MNDcap 302 genes for neurogenic disorders in their current panel versions. 2359 samples were sequenced with MYOcap, 942 samples with MNDcap and 262 samples with WES. This included for the targeted gene panels 24 positive control samples with previously characterized CNVs and 31 negative control samples with certain genes verified to not have CNVs. A detection sensitivity of 100% and specificity of 100% were reached for these control samples. Previously undetected CNVs from MYOcap or MNDcap sequenced samples were verified as true positive detections in 36 cases with MLPA, PCR or array CGH, and eight CNVs were verified as false positive detections. These and the positive control samples were utilized in validation of a predictive logistic regression model. In silico CNV generation into MYOcap sequenced samples provided 18,677 specific and 3892 unspecific CNV detections to initially train the model. The model was trained to differentiate true positive detections from false positive detections in order to increase the specificity of the CNV detection pipeline. The advantage of using four different CNV detection programs compared to using them individually, or with any other combination, was demonstrated by CNV detection sensitivity from the set of in silico CNVs. The predictive model with variables from all four programs provided the highest sensitivity (96.6%) and specificity (87.5%) for predicting CNV detections correctly, indicating an accuracy of 95.5% (95% CI 87.3–99.1%). The CNV detection pipeline together with the predictive model was validated for WES samples with control samples with 235 previously characterized CNVs. For CNVs spanning at least three exons, the detection sensitivity was 97.3% and the sensitivity of the predicative model was 99.3% after adjusting the model threshold for WES data. The CNV annotation platform cnvScan was expanded to contain the most recent CNV population databases as well as in-house CNV databases for all the sequenced sample sets. CNV detection results were filtered by < 1% frequency with reciprocal overlap of 90% in the common CNV population databases, with both it and < 5% frequency with 50% reciprocal overlap in the in-house CNV database, and by the true positive prediction with the model. These procedures significantly decreased the workload (with 3–13% of the original CNV detections preserved) in evaluating the CNVs further regarding clinical significance. The added value, i.e. the additional diagnostic yield from CNVs for both the targeted gene panel sequenced samples and WES samples was estimated to be 1.9%. Altogether 39 final genetic diagnoses were solved with these CNV findings. In addition, 18 patient cases had a likely pathogenic finding, and five had a heterozygous CNV likely pathogenic for a recessive disease without association to the patient’s phenotype. The clarified cases included six different DMD deletions or duplications causing dystrophinopathies. In three sequenced familial cases, the detected CNVs in CACNA1A, SGCD and TTN genes co-segregated with the disease. One case had two separate genetic diseases, tibial muscular dystrophy (TMD) and BMD, caused by the founder mutation FINmaj in the gene TTN and a deletion in DMD. Some of the solved cases had novel findings: the second ever reported large intragenic deletion in NEB causing dominant disease, and the first CNV, an intragenic deletion, in TIA1 in a patient diagnosed with Welander distal myopathy (WDM). Some of the genes associated with NMDs are challenging to analyze from short-read sequencing data due to homology or repetitive regions. An additional script was thus written to differentiate copy numbers of the highly homologous genes, SMN1 and SMN2. Two SMN1/SMN2 copy number 0/3 control cases were successfully recognized, and five cases were identified with a possible exon 7 conversion in SMN1 and a compatible spinal muscular atrophy phenotype. The latter findings were considered likely pathogenic and are awaiting further validation on the genomic level. Comparison of CNV detections within the in-house CNV database revealed divergences in the CNV detections within the triplicate repetitive region of NEB with potentially clinically significant changes. One array CGH validated change correlated well with the nemaline rod pathology observed in the patient. CNV analysis utilizing MPS data from targeted gene panels and WES samples provided increased diagnostic yield as reported also in other studies on NMDs. Our multi-algorithm and -platform approach decreased the workload in variant analysis and provided more insight into the many difficult to analyze genomic regions involved in NMDs. In the future, whole genome sequencing and long-read sequencing will likely provide higher resolution for CNV detections and reveal an even wider spectrum of structural genomic variants, together with other emerging comprehensive methods, such as optical mapping.Lihastaudit ovat hyvin heterogeenisiä, ja niistä on kuvattu noin tuhat alatyyppiä. Suurin osa on perinnöllisiä tauteja, ja tähän mennessä on tunnistettu noin 500 eri lihastauteja aiheuttavaa geeniä. Massiivista rinnakkaissekvensointia (MPS) on käytetty laajalti perinnöllisten tautien diagnostisen prosessin nopeuttamiseksi, kustannustehokkuuden parantamiseksi ja lopullisen geeniperäisen diagnoosin saavuttamiseksi. Kopiolukumuutokset, yli 50 emäsparin deleetiot tai duplikaatiot, aiheuttavat arviolta 10 % Mendelin mukaisesti periytyvistä taudeista. Kopiolukumuutosten havaitsemiseen sekvensointidatasta ei ole vielä kehitetty yleisesti hyväksyttyjä ja suositeltuja käytänteitä. Kopiolukumuutosten havaitsemiseksi ja varmistamiseksi käytetäänkin usein täydentäviä menetelmiä, kuten vertaileva genominen hybridisaatio sirulla (aCGH), rinnastettu ligaatio-riippuvainen alukemonistus (MLPA) ja kvantitatiivinen PCR. Kopiolukumuutosten havaitsemiseen sekvensointidatasta on kehitetty useita työkaluja vaihtelevissa tutkimusasetelmissa, mikä hankaloittaa oikean lähestymistavan valitsemista lihastaudeille. Yksittäisten ohjelmien on todettu tuottavan usein epätäsmällisiä ja herkkyydeltään vaihtelevia tai riittämättömiä havaintoja. Tämän tutkimuksen tavoitteena oli kehittää kattava menetelmä kopiolukumuutosten havaitsemiseen ja annotointiin suurella tarkkuudella kohdennetun geenipaneelin ja koko eksomin (WES) sekvensointidatasta lihastautipotilailta. Tutkimukseen valittiin neljä kopiolukumuutosanalyysin työkalua: CoNIFER, XHMM, ExomeDepth ja CODEX. Kohdennetuista geenipaneeleista MYOcap kattaa 349 geeniä lihaspainotteisille taudeille ja MNDcap 302 hermopainotteisille taudeille nykyisissä paneeliversioissa. MYOcap:lla sekvensointiin 2359 näytettä, MNDcap:lla 942 ja WES:llä 262. Kohdennetuilla geenipaneeleilla sekvensointiin 24 positiivista kontrollinäytettä, joissa on aiemmin tunnistettu kopiolukumuutos, ja 31 negatiivista kontrollinäytettä, joissa tietyt geenit oli varmistettu kopiolukumuutoksia sisältämättömiksi. Kontrollinäytteille saavutettiin kehittämällämme menetelmällä 100 % havaitsemisherkkyys ja 100 % tarkkuus. MYOcap:lla tai MNDcap:lla sekvensoiduista näytteistä havaituista kopiolukumuutoksista 36 varmistettiin todellisiksi havainnoiksi MLPA:lla, PCR:lla tai aCGH:llä ja kahdeksan varmistettiin vääriksi positiivisiksi. Nämä ja positiiviset kontrollinäytteet sisällytettiin logistiseen regressioon perustuvan tilastollisen mallin validointiin. Erottelumallin kehitysvaiheessa MYOcap-sekvensoituihin näytteisiin tehtiin in silico kopiolukumuutoksia, mikä tuotti 18677 spesifiä ja 3892 ei-spesifiä kopiolukumuutoshavaintoa mallinnukseen. Malli kehitettiin erottelemaan todelliset kopiolukumuutoshavainnot vääristä positiivista havainnoista havaintomenetelmän tarkkuuden lisäämiseksi. Neljän ohjelman havaintojen käyttämisen paremmuus verrattuna ohjelmien käyttämiseen yksittäin tai muilla yhdistelmillä todennettiin in silico kopiolukumuutosten havaitsemisen herkkyyden tuloksilla. Erottelumalli, jossa oli muuttujia kaikilta neljältä ohjelmalta, saavutti korkeimman herkkyyden (96,6 %), täsmällisyyden (87,5 %) ja tarkkuuden 95,5 % (95 % CI 87,3–99,1 %) kopiolukumuutosten erottelulle. Kopiolukumuutoshavaitsemismenetelmä ja erottelumalli validoitiin WES-kontrollinäytteillä, joissa oli 235 aiemmin tunnistettua kopiolukumuutosta. Havaitsemisherkkyys kopiolukumuutoksille, jotka sisältävät vähintään kolme eksonia oli 97,3 %, ja erottelumallin herkkyys oli 99,3 % kunhan mallin arviointiraja oli uudelleensäädetty WES-datalle. Kopiolukumuutosten annotaatiotyökalu cnvScan laajennettiin sisältämään uusimmat kopiolukumuutospopulaatiotietokannat ja talonsisäinen kopiolukumuutostietokanta kaikista sekvensointinäytejoukoista. Alkuperäiset kopiolukumuutoshavainnot neljältä ohjelmalta suodatettiin 1 % enimmäisyleisyyden ja vastavuoroisen 90 % muutoksen kattamisen vaatimuksella yleisissä kopiolukumuutospopulaatiotietokannoissa, tällä sekä 5 % enimmäisyleisyyden ja vastavuoroisen 50 % muutoksen kattamisen vaatimuksella talonsisäisessä tietokannassa, ja lisäksi erottelumallilla todellisiin havaintoihin. Nämä toimenpiteet vähensivät merkittävästi työmäärää kliinisen merkityksen arvioinnille kopiolukumuutoksille säästäen 3–13 % alkuperäisistä havainnoista. Lisääntyneiden diagnoosien määrä kopiolukumuutoshavaintojen myötä sekä kohdennetuilla geenipaneeleilla että WES-sekvensoiduilla näytteillä oli noin 1,9 %. Kopiolukumuutoshavainnoilla saavutettiin 39 lopullista geneettistä diagnoosia potilaille. Lisäksi 18:lla tutkitulla oli todennäköisesti patogeeninen löydös, ja viidellä tutkitulla havaittiin heterotsygoottinen kopiolukumuutos, jonka arvioitiin olevan patogeeninen peittyvästi periytyvän taudin variantti ilman yhteyttä potilaan taudinkuvaan. Selvitettyihin tapauksiin sisältyi kuusi eri DMD-geenissä olevaa deleetiota tai duplikaatiota, jotka aiheuttivat dystrofinopatioita. Kolme potilasta, joilla oli oireisia perheenjäseniä, sekvensointiin perhetapauksina, ja havaitut kopiolukumuutokset geeneissä CACNA1A, SGCD ja TTN segregoituivat yhdessä taudin kanssa. Yhdellä tutkitulla havaittiin kaksi perinnöllistä tautia, tibiaalinen lihasdystrofia (TMD) ja BMD, joiden aiheuttajina olivat perustajamutaatio FINmaj TTN-geenissä ja deleetio DMD-geenissä. Osalla selvitetyistä tapauksista oli ennen havaitsemattomia löydöksiä: NEB-geenissä toinen koskaan raportoitu iso geeninsisäinen deleetio, joka aiheuttaa vallitsevasti periytyvän taudin, sekä TIA1-geenin geeninsisäinen deleetio, joka on ensimmäinen havaittu kopiolukumuutos TIA1:ssä Welanderin distaalimyopatiaa (WDM) sairastavalla potilaalla. Jotkin geeneistä, jotka on liitetty lihastauteihin, ovat haastavia analysoitavia lyhytlukuisesta sekvensointidatasta homologian ja toistojaksojen takia. Hyvin homologisille geeneille SMN1 ja SMN2 kehitettiin erillinen ohjelma erottelemaan geenien kopiolukumäärät. Kaksi kontrollitapausta tunnistettiin onnistuneesti SMN1 ja SMN2 kopiolukumäärillä 0 ja 3, ja lisäksi tunnistettiin viisi tapausta, joilla on mahdollisesti eksonin 7 konversio SMN1:ssä ja yhteensopiva spinaalinen lihasatrofia. Jälkimmäiset löydökset luokiteltiin todennäköisesti patogeeniseksi, ja ne odottavat genomista lisävarmistusta. Kopiolukumuutoshavaintojen vertailu NEB-geenin triplikaattitoistoalueella talonsisäisessä tietokannassa paljasti eroavaisuuksia, joilla on potentiaalisesti kliinisesti merkitystä. Yksi aCGH:llä varmistettu muutos korreloi selkeästi nemaliinisauvakappalepatologian kanssa, joka potilaalla oli havaittu. Kopiolukumuutoshavainnointi käyttäen sekvensointidataa kohdennetusta geenipaneelista tai WES-näytteistä lisäsi diagnoosien määrää kuten aiemmissa vastaavissa tutkimuksissa lihastaudeille. Käyttämämme usean algoritmin ja alustan lähestymistapa vähensi varianttianalyysin työmäärää ja tarjosi lisää tietoa useista hankalasti analysoitavista genomisista alueista, jotka on liitetty lihastauteihin. Tulevaisuudessa koko genomin sekvensointi ja pitkälukuinen sekvensointi tarjonnevat paremman resoluution kopiolukumuutoksille ja paljastavat enemmän rakenteellisia genomin muutoksia yhdessä muiden kehitteillä olevien kattavien menetelmien kanssa, kuten optinen kartoitus

    Detection of copy number variants in sequencing data.

    Get PDF
    In this work a program for detection of CNVs in sequencing data based on depth of coverage was implemented in C++ (copyDOC). Single steps in the pipeline, the acquisition of DOC signals in windows, the event calling and merging are implemented using generic programming techniques that enable the future integration of other algorithms in the pipeline. Furthermore, a testing environment was implemented, the copySim platform, which is very useful for testing and evaluation of different algorithms. CopyDOC was successfully applied to synthetic and real data using constant sized windows. Dynamic windows, that adapt according to the local mappability of the sequence, are implemented in the pipeline, but could not be tested in this work. They might be advantageous in datasets that contain uniquely mapped reads. However, CNVs have been shown to be overrepresented in segmental duplications (Nguyen et al. 2006; Cooper et al. 2007) and by a general exclusion of multireads those CNVs might be difficult to ascertain. In the application of copyDOC to a 1000 genomes dataset the overlap of predicted variants was considerable higer using multireads compared to uniquely mapped reads. Thus there is a requirement for tools that can handle multireads. Futher improvements of copyDOC might be done for the CNV calling algorithm and the merging step. For example the program workflow could be tested with a direct comparison of the DOC signals in two datasets via log ratios instead of appling a t-test on DOC signals in the two datasets. CopyDOC and copySim could be used as platform for the implementation and evaluation of futher CNV detection algorithms
    corecore