9 research outputs found

    Tools and annotations for variation

    Get PDF
    Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods.We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it.BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even.The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis.To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids.Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased

    Computational analysis of human genomic variants and lncRNAs from sequence data

    Get PDF
    The high-throughput sequencing technologies have been developed and applied to the human genome studies for nearly 20 years. These technologies have provided numerous research applications and have significantly expanded our knowledge about the human genome. In this thesis, computational methods that utilize sequence data to study human genomic variants and transcripts were evaluated and developed. Indel represents insertion and deletion, which are two types of common genomic variants that are widespread in the human genome. Detecting indels from human genomes is the crucial step for diagnosing indel related genomic disorders and may potentially identify novel indel makers for studying certain diseases. Compared with previous techniques, the high-throughput sequencing technologies, especially the next- generation sequencing (NGS) technology, enable to detect indels accurately and efficiently in wide ranges of genome. In the first part of the thesis, tools with indel calling abilities are evaluated with an assortment of indels and different NGS settings. The results show that the selection of tools and NGS settings impact on indel detection significantly, which provide suggestions for tool selection and future developments. In bioinformatics analysis, an indel’s position can be marked inconsistently on the reference genome, which may result in an indel having different but equivalent representations and cause troubles for downstream. This problem is related to the complex sequence context of the indels, for example, short tandem repeats (STRs), where the same short stretch of nucleotides is amplified. In the second part of the thesis, a novel computational tool VarSCAT was described, which has various functions for annotating the sequence context of variants, including ambiguous positions, STRs, and other sequence context features. Analysis of several high- confidence human variant sets with VarSCAT reveals that a large number of genomic variants, especially indels, have sequence features associated with STRs. In the human genome, not all genes and their transcripts are translated into proteins. Long non-coding ribonucleic acid (lncRNA) is a typical example. Sequence recognition built with machine learning models have improved significantly in recent years. In the last part of the thesis, several machine learning-based lncRNA prediction tools were evaluated on their predictions for coding potentiality of transcripts. The results suggest that tools based on deep learning identify lncRNAs best. Ihmisen genomivarianttien ja lncRNA:iden laskennallinen analyysi sekvenssiaineistosta Korkean suorituskyvyn sekvensointiteknologioita on kehitetty ja sovellettu ihmisen genomitutkimuksiin lähes 20 vuoden ajan. Nämä teknologiat ovat mahdollistaneet ihmisen genomin laaja-alaisen tutkimisen ja lisänneet merkittävästi tietoamme siitä. Tässä väitöstyössä arvioitiin ja kehitettiin sekvenssiaineistoa hyödyntäviä laskennallisia menetelmiä ihmisen genomivarianttien sekä transkriptien tutkimiseen. Indeli on yhteisnimitys lisäys- eli insertio-varianteille ja häviämä- eli deleetio-varianteille, joita esiintyy koko genomin alueella. Indelien tunnistaminen on ratkaisevaa geneettisten poikkeavuuksien diagnosoinnissa ja eri sairauksiin liittyvien uusien indeli-markkereiden löytämisessä. Aiempiin teknologioihin verrattuna korkean suorituskyvyn sekvensointiteknologiat, erityisesti seuraavan sukupolven sekvensointi (NGS) mahdollistavat indelien havaitsemisen tarkemmin ja tehokkaammin laajemmilta genomialueilta. Väitöstyön ensimmäisessä osassa indelien kutsumiseen tarkoitettuja laskentatyökaluja arvioitiin käyttäen laajaa valikoimaa indeleitä ja erilaisia NGS-asetuksia. Tulokset osoittivat, että työkalujen valinta ja NGS-asetukset vaikuttivat indelien tunnistukseen merkittävästi ja siten ne voivat ohjata työkalujen valinnassa ja kehitystyössä. Bioinformatiivisessa analyysissä saman indelin sijainti voidaan merkitä eri kohtiin referenssigenomia, joka voi aiheuttaa ongelmia loppupään analyysiin, kuten indeli-kutsujen arviointiin. Tämä ongelma liittyy sekvenssikontekstiin, koska variantit voivat sijoittua lyhyille perättäisille tandem-toistojaksoille (STR), jossa sama lyhyt nukleotidijakso on monistunut. Väitöstyön toisessa osassa kehitettiin laskentatyökalu VarSCAT, jossa on eri toimintoja, mm. monitulkintaisten sijaintitietojen, vierekkäisten alueiden ja STR-alueiden tarkasteluun. Luotettaviksi arvioitujen ihmisen varianttiaineistojen analyysi VarSCAT-työkalulla paljasti, että monien geneettisten varianttien ja erityisesti indelien ominaisuudet liittyvät STR-alueisiin. Kaikkia ihmisen geenejä ja niiden geenituotteita, kuten esimerkiksi ei-koodaavia RNA:ta (lncRNA) ei käännetä proteiiniksi. Koneoppimismenetelmissä ja sekvenssitunnistuksessa on tapahtunut huomattavaa parannusta viime vuosina. Väitöstyön viimeisessä osassa arvioitiin useiden koneoppimiseen perustuvien lncRNA-ennustustyökalujen ennusteita. Tulokset viittaavat siihen, että syväoppimiseen perustuvat työkalut tunnistavat lncRNA:t parhaiten

    Understanding and visualising the variation in HLA

    Get PDF
    The HLA region is a segment of the human genome containing immune system genes, which orchestrate the defence against infection. A key aspect distinguishing the HLA genes from other genes in the human genome is their extensive levels of variation. This variation increases the depth and breadth of the weaponry used against pathogen infections; and helps impede the spread of infection within families and communities. Analysing and understanding these high levels of variation gives an insight into how the human immune system has developed and the breadth of variation seen in the human population This is possible using data generated over the last twenty years, from the sequencing of millions of individuals across the world, and these sequences deposited in public databases. This gives an unprecedented opportunity to compare more than 15,000 sequences and distinguish aspects of the variation that are important for immune functions, from those that are not. This thesis examines methods used to assess variation and develops new methods to both catalogue and visualise the data. Initial analysis focusses on the antigen recognition domain of the HLA class I genes and is expanded to analyse the full sequence of both the HLA class I and II genes. The analysis of non-human primate orthologs to HLA is also investigated. The analysis reveals the extensive levels of variation at both the nucleotide and protein levels and identifies mechanisms responsible for generating this variation. This allows evolutionary lineages to be identified, and the identification of a minimum set of 42 core HLA class I alleles and 47 HLA class II alleles. This also allows estimates of the total numbers of HLA alleles in the worldwide population, with the potential for 2-3 million alleles of a single gene in HLA class I and up to 1.7 million in HLA class II

    Contribution to the management of gastrointestinal tumors in dogs and cats

    Get PDF

    Locus specific databases: integrating sequence, structural and clinical analysis in relation to disorders of the coagulation and complement systems.

    Get PDF
    Following the success of the Human Genome Mutation Database, increasing numbers of locus-specific mutation databases are being compiled in order to present more detailed descriptions of the mutations in specific proteins and diseases. However, until now, many coagulation or complement disorders have no up-to-date, accurate repository of information. Structural analyses of disease-associated mutations can provide important new tools for identifying the underlying biochemical defect of disease. Type I mutations may affect the correct folding, secretion, expression or degradation of the protein. Alternatively, Type II mutations can disrupt the catalytic or substrate-binding functional site of the protein. Alternatively, mutations may not be the causative factor for disease. Including sequence and structural information to propose consensus domain structures can unravel these effects of mutations. Central repositories of data that combine structural, sequence and phenotypic information on mutations and proteins can facilitate the diagnosis and understanding of the associated diseases, and define the molecular consequences of disease. This thesis describes interactive web databases of genotypic, phenotypic, clinical and structural information on mutations associated with complement Factor H (FH) and coagulation Factor XI (FXI). Mutations within FH are associated with aHUS (atypical Haemolytic Uraemic Syndrome), MPGN (Membranoproliferative Glomerulonephritis) and AMD (Age-related Macular Degeneration), whereas mutations within FXI are associated with a FXI deficiency bleeding disorder. The methods used within the FH and FXI databases were extended and combined with a database management system to design a mutation database for the five Vitamin K dependent serine proteases of coagulation in order to study the effects of mutations within conserved domains. The database management system incorporates new tools that have been designed to automatically scan full length references to find text describing mutations. This significantly reduces the time and expertise required to maintain the databases
    corecore