48 research outputs found

    SORTA:a system for ontology-based re-coding and technical annotation of biomedical phenotype data

    Get PDF
    There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi-automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA's applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re) coding tasks and we believe it will prove useful for many more projects

    Genotype harmonizer:automatic strand alignment and format conversion for genotype data integration

    Get PDF
    BACKGROUND: To gain statistical power or to allow fine mapping, researchers typically want to pool data before meta-analyses or genotype imputation. However, the necessary harmonization of genetic datasets is currently error-prone because of many different file formats and lack of clarity about which genomic strand is used as reference. FINDINGS: Genotype Harmonizer (GH) is a command-line tool to harmonize genetic datasets by automatically solving issues concerning genomic strand and file format. GH solves the unknown strand issue by aligning ambiguous A/T and G/C SNPs to a specified reference, using linkage disequilibrium patterns without prior knowledge of the used strands. GH supports many common GWAS/NGS genotype formats including PLINK, binary PLINK, VCF, SHAPEIT2 & Oxford GEN. GH is implemented in Java and a large part of the functionality can also be used as Java 'Genotype-IO' API. All software is open source under license LGPLv3 and available from http://www.molgenis.org/systemsgenetics. CONCLUSIONS: GH can be used to harmonize genetic datasets across different file formats and can be easily integrated as a step in routine meta-analysis and imputation pipelines

    CAPICE:a computational method for Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome variations

    Get PDF
    Exome sequencing is now mainstream in clinical practice. However, identification of pathogenic Mendelian variants remains time-consuming, in part, because the limited accuracy of current computational prediction methods requires manual classification by experts. Here we introduce CAPICE, a new machine-learning-based method for prioritizing pathogenic variants, including SNVs and short InDels. CAPICE outperforms the best general (CADD, GAVIN) and consequence-type-specific (REVEL, ClinPred) computational prediction methods, for both rare and ultra-rare variants. CAPICE is easily added to diagnostic pipelines as pre-computed score file or command-line software, or using online MOLGENIS web service with API. Download CAPICE for free and open-source (LGPLv3) at https://github.com/molgenis/capice.

    Proficiency testing of virus diagnostics based on bioinformatics analysis of simulated in silico high-throughput sequencing data sets

    Get PDF
    Quality management and independent assessment of high-throughput sequencing-based virus diagnostics have not yet been established as a mandatory approach for ensuring comparable results. The sensitivity and specificity of viral high-throughput sequence data analysis are highly affected by bioinformatics processing using publicly available and custom tools and databases and thus differ widely between individuals and institutions. Here we present the results of the COMPARE [Collaborative Management Platform for Detection and Analyses of (Re-) emerging and Foodborne Outbreaks in Europe] in silico virus proficiency test. An artificial, simulated in silico data set of Illumina HiSeq sequences was provided to 13 different European institutes for bioinformatics analysis to identify viral pathogens in high-throughput sequence data. Comparison of the participants’ analyses shows that the use of different tools, programs, and databases for bioinformatics analyses can impact the correct identification of viral sequences from a simple data set. The identification of slightly mutated and highly divergent virus genomes has been shown to be most challenging. Furthermore, the interpretation of the results, together with a fictitious case report, by the participants showed that in addition to the bioinformatics analysis, the virological evaluation of the results can be important in clinical settings. External quality assessment and proficiency testing should become an important part of validating high-throughput sequencing-based virus diagnostics and could improve the harmonization, comparability, and reproducibility of results. There is a need for the establishment of international proficiency testing, like that established for conventional laboratory tests such as PCR, for bioinformatics pipelines and the interpretation of such results

    BiobankUniverse:Automatic matchmaking between datasets for biobank data discovery and integration

    Get PDF
    Motivation: Biobanks are indispensable for large-scale genetic/epidemiological studies, yet it remains difficult for researchers to determine which biobanks contain data matching their research questions. Results: To overcome this, we developed a new matching algorithm that identifies pairs of related data elements between biobanks and research variables with high precision and recall. It integrates lexical comparison, Unified Medical Language System ontology tagging and semantic query expansion. The result is BiobankUniverse, a fast matchmaking service for biobanks and researchers. Biobankers upload their data elements and researchers their desired study variables, BiobankUniverse automatically shortlists matching attributes between them. Users can quickly explore matching potential and search for biobanks/data elements matching their research. They can also curate matches and define personalized data-universes

    Modulation of Androgen Receptor Signaling in Hormonal Therapy-Resistant Prostate Cancer Cell Lines

    Get PDF
    Background: Prostate epithelial cells depend on androgens for survival and function. In (early) prostate cancer (PCa) androgens also regulate tumor growth, which is exploited by hormonal therapies in metastatic disease. The aim of the present study was to characterize the androgen receptor (AR) response in hormonal therapy-resistant PC346 cells and identify potential disease markers. Methodology/Principal Findings: Human 19K oligoarrays were used to establish the androgen-regulated expression profile of androgen-responsive PC346C cells and its derivative therapy-resistant sublines: PC346DCC (vestigial AR levels), PC346Flu1 (AR overexpression) and PC346Flu2 (T877A AR mutation). In total, 107 transcripts were differentially-expressed in PC346C and derivatives after R1881 or hydroxyflutamide stimulations. The AR-regulated expression profiles reflected the AR modifications of respective therapy-resistant sublines: AR overexpression resulted in stronger and broader transcriptional response to R1881 stimulation, AR down-regulation correlated with deficient response of AR-target genes and the T877A mutation resulted in transcriptional response to both R1881 and hydroxyflutamide. This AR-target signature was linked to multiple publicly available cell line and tumor derived PCa databases, revealing that distinct functional clusters were differentially modulated during PCa progression. Differentiation and secretory functions were up-regulated in primary PCa but repressed i

    Bypass Mechanisms of the Androgen Receptor Pathway in Therapy-Resistant Prostate Cancer Cell Models

    Get PDF
    Background: Prostate cancer is initially dependent on androgens for survival and growth, making hormonal therapy the cornerstone treatment for late-stage tumors. However, despite initial remission, the cancer will inevitably recur. The present study was designed to investigate how androgen-dependent prostate cancer cells eventually survive and resume growth under androgen-deprived and antiandrogen supplemented conditions. As model system, we used the androgen-responsive PC346C cell line and its therapy-resistant sublines: PC346DCC, PC346Flu1 and PC346Flu2. Methodology/Principal Findings: Microarray technology was used to analyze differences in gene expression between the androgen-responsive and therapy-resistant PC346 cell lines. Microarray analysis revealed 487 transcripts differentiallyexpressed between the androgen-responsive and the therapy-resistant cell lines. Most of these genes were common to all three therapy-resistant sublines and only a minority (,5%) was androgen-regulated. Pathway analysis revealed enrichment in functions involving cellular movement, cell growth and cell death, as well as association with cancer and reproductive system disease. PC346DCC expressed residual levels of androgen receptor (AR) and showed significant down-regulation of androgen-regulated genes (p-value = 10 27). Up-regulation of VAV3 and TWIST1 oncogenes and repression of the DKK3 tumor-suppressor was observed in PC346DCC, suggesting a potential AR bypass mechanism. Subsequent validation of these three genes in patient samples confirmed that expression was deregulated during prostate cancer progression