117 research outputs found

    Algorithms for the description of molecular sequences

    Get PDF
    Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two DNA sequences. Our algorithm is able to compute the HGVS~descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers. Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex) frame shifted variants that can be described with our method. Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing HGVS nomenclature. The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups.  Algorithms and the Foundations of Software technolog

    Mutalyzer 2: next generation HGVS nomenclature checker

    Get PDF
    Motivation: Unambiguous variant descriptions are of utmost importance in clinical genetic diagnostics, scientific literature and genetic databases. The Human Genome Variation Society (HGVS) publishes a comprehensive set of guidelines on how variants should be correctly and unambiguously described. We present the implementation of the Mutalyzer 2 tool suite, designed to automatically apply the HGVS guidelines so users do not have to deal with the HGVS intricacies explicitly to check and correct their variant descriptions.Results: Mutalyzer is profusely used by the community, having processed over 133 million descriptions since its launch. Over a five year period, Mutalyzer reported a correct input in similar to 50% of cases. In 41% of the cases either a syntactic or semantic error was identified and for similar to 7% of cases, Mutalyzer was able to automatically correct the description.Molecular Technology and Informatics for Personalised Medicine and Healt

    Computational analysis of human genomic variants and lncRNAs from sequence data

    Get PDF
    The high-throughput sequencing technologies have been developed and applied to the human genome studies for nearly 20 years. These technologies have provided numerous research applications and have significantly expanded our knowledge about the human genome. In this thesis, computational methods that utilize sequence data to study human genomic variants and transcripts were evaluated and developed. Indel represents insertion and deletion, which are two types of common genomic variants that are widespread in the human genome. Detecting indels from human genomes is the crucial step for diagnosing indel related genomic disorders and may potentially identify novel indel makers for studying certain diseases. Compared with previous techniques, the high-throughput sequencing technologies, especially the next- generation sequencing (NGS) technology, enable to detect indels accurately and efficiently in wide ranges of genome. In the first part of the thesis, tools with indel calling abilities are evaluated with an assortment of indels and different NGS settings. The results show that the selection of tools and NGS settings impact on indel detection significantly, which provide suggestions for tool selection and future developments. In bioinformatics analysis, an indel’s position can be marked inconsistently on the reference genome, which may result in an indel having different but equivalent representations and cause troubles for downstream. This problem is related to the complex sequence context of the indels, for example, short tandem repeats (STRs), where the same short stretch of nucleotides is amplified. In the second part of the thesis, a novel computational tool VarSCAT was described, which has various functions for annotating the sequence context of variants, including ambiguous positions, STRs, and other sequence context features. Analysis of several high- confidence human variant sets with VarSCAT reveals that a large number of genomic variants, especially indels, have sequence features associated with STRs. In the human genome, not all genes and their transcripts are translated into proteins. Long non-coding ribonucleic acid (lncRNA) is a typical example. Sequence recognition built with machine learning models have improved significantly in recent years. In the last part of the thesis, several machine learning-based lncRNA prediction tools were evaluated on their predictions for coding potentiality of transcripts. The results suggest that tools based on deep learning identify lncRNAs best. Ihmisen genomivarianttien ja lncRNA:iden laskennallinen analyysi sekvenssiaineistosta Korkean suorituskyvyn sekvensointiteknologioita on kehitetty ja sovellettu ihmisen genomitutkimuksiin lähes 20 vuoden ajan. Nämä teknologiat ovat mahdollistaneet ihmisen genomin laaja-alaisen tutkimisen ja lisänneet merkittävästi tietoamme siitä. Tässä väitöstyössä arvioitiin ja kehitettiin sekvenssiaineistoa hyödyntäviä laskennallisia menetelmiä ihmisen genomivarianttien sekä transkriptien tutkimiseen. Indeli on yhteisnimitys lisäys- eli insertio-varianteille ja häviämä- eli deleetio-varianteille, joita esiintyy koko genomin alueella. Indelien tunnistaminen on ratkaisevaa geneettisten poikkeavuuksien diagnosoinnissa ja eri sairauksiin liittyvien uusien indeli-markkereiden löytämisessä. Aiempiin teknologioihin verrattuna korkean suorituskyvyn sekvensointiteknologiat, erityisesti seuraavan sukupolven sekvensointi (NGS) mahdollistavat indelien havaitsemisen tarkemmin ja tehokkaammin laajemmilta genomialueilta. Väitöstyön ensimmäisessä osassa indelien kutsumiseen tarkoitettuja laskentatyökaluja arvioitiin käyttäen laajaa valikoimaa indeleitä ja erilaisia NGS-asetuksia. Tulokset osoittivat, että työkalujen valinta ja NGS-asetukset vaikuttivat indelien tunnistukseen merkittävästi ja siten ne voivat ohjata työkalujen valinnassa ja kehitystyössä. Bioinformatiivisessa analyysissä saman indelin sijainti voidaan merkitä eri kohtiin referenssigenomia, joka voi aiheuttaa ongelmia loppupään analyysiin, kuten indeli-kutsujen arviointiin. Tämä ongelma liittyy sekvenssikontekstiin, koska variantit voivat sijoittua lyhyille perättäisille tandem-toistojaksoille (STR), jossa sama lyhyt nukleotidijakso on monistunut. Väitöstyön toisessa osassa kehitettiin laskentatyökalu VarSCAT, jossa on eri toimintoja, mm. monitulkintaisten sijaintitietojen, vierekkäisten alueiden ja STR-alueiden tarkasteluun. Luotettaviksi arvioitujen ihmisen varianttiaineistojen analyysi VarSCAT-työkalulla paljasti, että monien geneettisten varianttien ja erityisesti indelien ominaisuudet liittyvät STR-alueisiin. Kaikkia ihmisen geenejä ja niiden geenituotteita, kuten esimerkiksi ei-koodaavia RNA:ta (lncRNA) ei käännetä proteiiniksi. Koneoppimismenetelmissä ja sekvenssitunnistuksessa on tapahtunut huomattavaa parannusta viime vuosina. Väitöstyön viimeisessä osassa arvioitiin useiden koneoppimiseen perustuvien lncRNA-ennustustyökalujen ennusteita. Tulokset viittaavat siihen, että syväoppimiseen perustuvat työkalut tunnistavat lncRNA:t parhaiten

    Detection, prioritization and analysis of variants of unknown significance in familial breast cancer genes

    Get PDF
    Currently, Molecular Diagnostics Laboratories in Ontario sequence coding and adjacent intronic regions in BRCA1 and BRCA2 in patients with a family history of breast cancer. At LHSC it is estimated that ~15% of patients have BRCA1 or BRCA2 variants of clinical significance, and ~15-20% patients have variants of unknown clinical significance (VUS), while the remaining patients have variants of no clinical significance, making patient prognosis difficult to ascertain. To elucidate VUS and improve deleterious variant detection, my study has three aims, 1) assess the effects of VUS on splicing using bioinformatics and transfection assays; 2) investigate the limitations of BRCA1 and BRCA2 routine sequencing in deleterious variant detection and expand deleterious variant detection by sequencing seven breast cancer associated genes in 21 familial breast cancer patients and 3) prioritize detected variants in silico for effects on: splicing, transcription factor binding, mRNA structure, miRNA binding and amino acids

    Detecting PKD1 variants in polycystic kidney disease patients by single-molecule long-read sequencing

    Get PDF
    A genetic diagnosis of autosomal-dominant polycystic kidney disease (ADPKD) is challenging due to allelic heterogeneity, high GC content, and homology of the PKD1 gene with six pseudogenes. Short-read next-generation sequencing approaches, such as whole-genome sequencing and whole-exome sequencing, often fail at reliably characterizing complex regions such as PKD1. However, long-read single-molecule sequencing has been shown to be an alternative strategy that could overcome PKD1 complexities and discriminate between homologous regions of PKD1 and its pseudogenes. In this study, we present the increased power of resolution for complex regions using long-read sequencing to characterize a cohort of 19 patients with ADPKD. Our approach provided high sensitivity in identifying PKD1 pathogenic variants, diagnosing 94.7% of the patients. We show that reliable screening of ADPKD patients in a single test without interference of PKD1 homologous sequences, commonly introduced by residual amplification of PKD1 pseudogenes, by direct long-read sequencing is now possible. This strategy can be implemented in diagnostics and is highly suitable to sequence and resolve complex genomic regions that are of clinical relevance

    Critical points for an accurate human genome analysis

    Full text link

    The use of sequencing technologies for enhanced understanding of molecular determinants in renal diseases

    Get PDF
    Les maladies rénales ont un impact important sur l'économie de tout système de santé dans le monde. En outre, le nombre de patients augmente régulièrement au cours des dernières décennies avec une prévalence de plus de 500 000 nouveaux cas de maladie rénale en phase terminale (ESRD) dans le monde entier chaque année. L'ESRD est l'étape finale de la maladie rénale chronique (CKD) qui a comme principales causes le diabète et l'hypertension, ainsi que la glomérulonéphrite, urolithiasis, la polykystose rénale autosomique dominante (ADPKD) et la progression de la lésion rénale aiguë (LRA), entre autres. Cependant, dans de nombreux cas, les mécanismes de ces maladies affectant le rein et sa fonction sont mal connus ou difficiles à diagnostiquer. Dans le cadre de cette étude, nous avons utilisé des technologies plus récentes, des méthodologies et des approches d'analyse de données pour jeter un peu de lumière dans les pathomécanismes de la CKD et de l'AKI. En outre, l'amélioration potentielle de la valeur diagnostique des tests diagnostiques déjà existants (par exemple ADPKD). Au cours des dernières années, les progrès dans les technologies de séquençage de l'ADN ont révolutionné le domaine de la recherche clinique et du diagnostic. Le séquençage à haut débit tel que le séquençage de prochaine génération (NG) est utilisé en raison de sa haute qualité et de précision lors que l'analyse des échantillons d'ADN. D'autres technologies de séquençage ont également montré leur valeur, comme le séquençage à longue lecture qui est utilisé en raison de ses longues lectures de séquençage et de la précision de résolution de séquençages de faible complexité, telles que les régions répétitives ou des régions de GC-pourcentage élevé. Dans le cadre de cette thèse, nous avons utilisé plusieurs méthodes de pointe de séquençage appliquées à la recherche clinique sur la maladie rénale afin de: 1. Améliorer la valeur diagnostique des tests diagnostiques déjà existants pour l'ADPKD. ADPKD est une maladie héréditaire qui représente de 5% à 10% de l'ESRD. Cependant, le criblage du principal gène ADPKD PKD1 est difficile en raison de sa structure multi-exon, de son hétérogénéité allélique et de son homologie élevée avec six pseudogènes PKD1, ainsi que d'une teneur en GC extrêmement élevée. En utilisant le séquençage direct à longue lecture, nous avons montré que le diagnostic ADPKD sans interférence des séquences homologues PKD1 est possible. 2. Caractériser le profil d'expression de l'IRA et des mécanismes sous-jacents en utilisant le séquençage de l'ARN. Les patients qui subissent une chirurgie majeure peuvent développer une IRA qui a été associée à un risque de mortalité plus élevé et une fonctionnalité rénale réduite, et un risque élevé de progression de la CKD. Certaines preuves indiquent que le système tubulaire est au milieu de cette pathophysiologie et de la récupération ultérieure. Cependant, les facteurs impliqués dans cette reprise sont encore mal compris. Dans ce contexte, nous avons caractérisé les profils d'expression de l'ARN messager rénal d'AKI dans des constructions de type sauvage et knock-out Gdf15. Gdf15 a été identifié comme étant associé à des lésions tubulaires inférieures suggérant un rôle protecteur du rein. Nous avons identifié 89 facteurs de transcription qui sont potentiellement moteurs des mécanismes de réponse dans l'IRA, ainsi que d'autres facteurs de transcription 13 éventuellement liés avec les mécanismes de protection de Gdf15. 3. Évaluer les limites pratiques du séquençage de l'ARN pour caractériser les maladies glomérulaires pour la CKD. L'analyse des biopsies rénales est très informative pour déterminer le stade de la maladie glomérulaire du patient et les taux de progression. Cependant, les biopsies congelées fraîches sont limitées ou inexistantes par rapport aux biopsies rénales fixées au formol et à la paraffine (FFPE) plus abondantes. Les tissus FFPE peuvent être facilement stockés pendant de longues périodes, ce qui permet de disposer d'importantes archives d'échantillons avec de nombreuses années de collecte de données cliniques et de suivi. Nous avons montré que la caractérisation des profils d'expression de la maladie glomérulaire par séquençage d'ARN à partir d'échantillons de FFPE est possible. Cependant, nos données suggèrent que le nombre requis de glomérules dans les coupes transversales peut être supérieur au nombre de glomérules présents dans une biopsie rénale habituelle. Enfin, nous avons développé l'impact futur des résultats obtenus dans le cadre de la recherche clinique et leur valeur pour la compréhension ou le diagnostic des maladies rénales.Renal diseases have a high impact on the economy of any health care system worldwide. In addition, patient numbers are steadily increasing over the past decades with a prevalence of over 500.000 new end stage renal disease (ESRD) worldwide cases every year. ESRD is the final stage of chronic kidney disease (CKD) that has as the leading causes diabetes and hypertension, as well as glomerulonephritis, urolithiasis, autosomal dominant polycystic kidney disease (ADPKD), and progression of acute kidney injury (AKI), among others. However, in many cases, the mechanisms of these diseases affecting kidney and its function are poorly understood, or difficult to diagnose. Within this study, we used newer technologies, methodologies, and data analysis approaches to throw some light into the pathomechanisms of CKD and AKI. Moreover, potentially improving the diagnostic value for already existing diagnostic assays (e.g. ADPKD). In the past years, advances in DNA sequencing technologies have revolutionized the field of clinical research and diagnostics. High throughput sequencing such as next-generation sequencing (NGs) is being used because of its high quality and accuracy when analysing DNA samples. Other sequencing technologies have also shown their value such as long-read sequencing which is used because of its longer sequencing reads and accuracy resolving low-complexity sequences, such as repetitive regions or high GC-percent regions. Within the scope of this thesis we used several cutting-edge sequencing approaches applied to renal disease's clinical research to: 1. Improve the diagnostic value of already existing diagnostic assays for ADPKD. ADPKD is an inherited disease that accounts for 5% to 10% of ESRD. However, the screening of the main ADPKD gene PKD1 is challenging because of its multi-exon structure, allelic heterogeneity, and high homology with six PKD1 pseudogenes, as well as extremely high GC content. Using direct long-read sequencing we showed that ADPKD diagnostics without interference of PKD1 homologous sequences is possible. 2. Characterize the expression profile of AKI and the underlying mechanisms using RNA sequencing. Patients undergoing major surgery may develop AKI which has been associated with higher mortality risk and reduced renal functionality, and high risk of progression of CKD. Some evidences pointed out to the tubular system being at the middle of this pathophysiology and further recovery. However, the factors involved in this recovery are still poorly understood. In this context, we characterized renal messenger RNA expression profiles of AKI in wild type and Gdf15 knock out constructs. Gdf15 was identified to be associated with lower tubular damage suggesting a protective role of the kidney. We identified 89 transcription factors that are potentially driving the response mechanisms in AKI, as well as other 13 transcription factors possibly linked with the protective mechanisms of Gdf15. 3. Evaluate the practical boundaries of RNA sequencing to characterize glomerular diseases for CKD. The analysis of renal biopsies is very informative to determine patient's glomerular disease stage and progression rates. However, fresh frozen biopsies are limited or non-existent compared to the more abundant formalin-fixed, paraffin-embedded (FFPE) renal biopsies. FFPE tissue can be easily stored for long periods of time, allowing large sample archives with many years of clinical data collection and follow-up. We showed that characterizing glomerular disease expression profiles by RNA sequencing from FFPE samples is possible. However, our data suggests that the required number of glomeruli in cross sections may be higher than the number glomeruli present in a usual renal biopsy. Finally, we elaborated about the future impact of the results obtained in the context of clinical research, and their value for the understanding or diagnostics of renal diseases

    Cohort-based association study of germline genetic variants with acute and chronic health complications of childhood cancer and its treatment: Genetic Risks for Childhood Cancer Complications Switzerland (GECCOS) study protocol

    Full text link
    INTRODUCTION: Childhood cancer and its treatment may lead to various health complications. Related impairment in quality of life, excess in deaths and accumulated healthcare costs are relevant. Genetic variations are suggested to contribute to the wide inter-individual variability of complications but have been used only rarely to risk-stratify treatment and follow-up care. This study aims to identify germline genetic variants associated with acute and late complications of childhood cancer. METHODS AND ANALYSIS: The Genetic Risks for Childhood Cancer Complications Switzerland (GECCOS) study is a nationwide cohort study. Eligible are patients and survivors who were diagnosed with childhood cancers or Langerhans cell histiocytosis before age 21 years, were registered in the Swiss Childhood Cancer Registry (SCCR) since 1976 and have consented to the Paediatric Biobank for Research in Haematology and Oncology, Geneva, host of the national Germline DNA Biobank Switzerland for Childhood Cancer and Blood Disorders (BISKIDS).GECCOS uses demographic and clinical data from the SCCR and the associated Swiss Childhood Cancer Survivor Study. Clinical outcome data consists of organ function testing, health conditions diagnosed by physicians, second primary neoplasms and self-reported information from participants. Germline genetic samples and sequencing data are collected in BISKIDS. We will perform association analyses using primarily whole-exome or whole-genome sequencing to identify genetic variants associated with specified health conditions. We will use clustering and machine-learning techniques and assess multiple health conditions in different models. DISCUSSION: GECCOS will improve knowledge of germline genetic variants associated with childhood cancer-associated health conditions and help to further individualise cancer treatment and follow-up care, potentially resulting in improved efficacy and reduced side effects. ETHICS AND DISSEMINATION: The Geneva Cantonal Commission for Research Ethics has approved the GECCOS study.Research findings will be disseminated through national and international conferences, publications in peer-reviewed journals and in lay language online
    corecore