70 research outputs found

    More efficient manual review of automatically transcribed tabular data

    Full text link
    Machine learning methods have proven useful in transcribing historical data. However, results from even highly accurate methods require manual verification and correction. Such manual review can be time-consuming and expensive, therefore the objective of this paper was to make it more efficient. Previously, we used machine learning to transcribe 2.3 million handwritten occupation codes from the Norwegian 1950 census with high accuracy (97%). We manually reviewed the 90,000 (3%) codes with the lowest model confidence. We allocated those 90,000 codes to human reviewers, who used our annotation tool to review the codes. To assess reviewer agreement, some codes were assigned to multiple reviewers. We then analyzed the review results to understand the relationship between accuracy improvements and effort. Additionally, we interviewed the reviewers to improve the workflow. The reviewers corrected 62.8% of the labels and agreed with the model label in 31.9% of cases. About 0.2% of the images could not be assigned a label, while for 5.1% the reviewers were uncertain, or they assigned an invalid label. 9,000 images were independently reviewed by multiple reviewers, resulting in an agreement of 86.43% and disagreement of 8.96%. We learned that our automatic transcription is biased towards the most frequent codes, with a higher degree of misclassification for the lowest frequency codes. Our interview findings show that the reviewers did internal quality control and found our custom tool well-suited. So, only one reviewer is needed, but they should report uncertainty.Comment: 19 pages, 5 figures, 1 tabl

    Occode: an end-to-end machine learning pipeline for transcription of historical population censuses

    Get PDF
    Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end machine learning pipeline that scales to the dataset size, and a model that achieves high accuracy with few manual transcriptions. In addition, the correctness of the model results must be verified. This paper describes our lessons learned developing, tuning, and using the Occode end-to-end machine learning pipeline for transcribing 7,3 million rows with handwritten occupation codes in the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification. We verify that the occupation code distribution found in our result matches the distribution found in our training data which should be representative for the census as a whole. We believe our approach and lessons learned are useful for other transcription projects that plan to use machine learning in production. The source code is available at: https://github.com/uit-hdl/rhd-code

    Implicating genes, pleiotropy, and sexual dimorphism at blood lipid loci through multi-ancestry meta-analysis

    Get PDF
    Publisher Copyright: © 2022, The Author(s).Background: Genetic variants within nearly 1000 loci are known to contribute to modulation of blood lipid levels. However, the biological pathways underlying these associations are frequently unknown, limiting understanding of these findings and hindering downstream translational efforts such as drug target discovery. Results: To expand our understanding of the underlying biological pathways and mechanisms controlling blood lipid levels, we leverage a large multi-ancestry meta-analysis (N = 1,654,960) of blood lipids to prioritize putative causal genes for 2286 lipid associations using six gene prediction approaches. Using phenome-wide association (PheWAS) scans, we identify relationships of genetically predicted lipid levels to other diseases and conditions. We confirm known pleiotropic associations with cardiovascular phenotypes and determine novel associations, notably with cholelithiasis risk. We perform sex-stratified GWAS meta-analysis of lipid levels and show that 3–5% of autosomal lipid-associated loci demonstrate sex-biased effects. Finally, we report 21 novel lipid loci identified on the X chromosome. Many of the sex-biased autosomal and X chromosome lipid loci show pleiotropic associations with sex hormones, emphasizing the role of hormone regulation in lipid metabolism. Conclusions: Taken together, our findings provide insights into the biological mechanisms through which associated variants lead to altered lipid levels and potentially cardiovascular disease risk.Peer reviewe

    Implicating genes, pleiotropy, and sexual dimorphism at blood lipid loci through multi-ancestry meta-analysis

    Get PDF
    Funding GMP, PN, and CW are supported by NHLBI R01HL127564. GMP and PN are supported by R01HL142711. AG acknowledge support from the Wellcome Trust (201543/B/16/Z), European Union Seventh Framework Programme FP7/2007–2013 under grant agreement no. HEALTH-F2-2013–601456 (CVGenes@Target) & the TriPartite Immunometabolism Consortium [TrIC]-Novo Nordisk Foundation’s Grant number NNF15CC0018486. JMM is supported by American Diabetes Association Innovative and Clinical Translational Award 1–19-ICTS-068. SR was supported by the Academy of Finland Center of Excellence in Complex Disease Genetics (Grant No 312062), the Finnish Foundation for Cardiovascular Research, the Sigrid Juselius Foundation, and University of Helsinki HiLIFE Fellow and Grand Challenge grants. EW was supported by the Finnish innovation fund Sitra (EW) and Finska Läkaresällskapet. CNS was supported by American Heart Association Postdoctoral Fellowships 15POST24470131 and 17POST33650016. Charles N Rotimi is supported by Z01HG200362. Zhe Wang, Michael H Preuss, and Ruth JF Loos are supported by R01HL142302. NJT is a Wellcome Trust Investigator (202802/Z/16/Z), is the PI of the Avon Longitudinal Study of Parents and Children (MRC & WT 217065/Z/19/Z), is supported by the University of Bristol NIHR Biomedical Research Centre (BRC-1215–2001) and the MRC Integrative Epidemiology Unit (MC_UU_00011), and works within the CRUK Integrative Cancer Epidemiology Programme (C18281/A19169). Ruth E Mitchell is a member of the MRC Integrative Epidemiology Unit at the University of Bristol funded by the MRC (MC_UU_00011/1). Simon Haworth is supported by the UK National Institute for Health Research Academic Clinical Fellowship. Paul S. de Vries was supported by American Heart Association grant number 18CDA34110116. Julia Ramierz acknowledges support by the People Programme of the European Union’s Seventh Framework Programme grant n° 608765 and Marie Sklodowska-Curie grant n° 786833. Maria Sabater-Lleal is supported by a Miguel Servet contract from the ISCIII Spanish Health Institute (CP17/00142) and co-financed by the European Social Fund. Jian Yang is funded by the Westlake Education Foundation. Olga Giannakopoulou has received funding from the British Heart Foundation (BHF) (FS/14/66/3129). CHARGE Consortium cohorts were supported by R01HL105756. Study-specific acknowledgements are available in the Additional file 32: Supplementary Note. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.Peer reviewedPublisher PD

    Implicating genes, pleiotropy, and sexual dimorphism at blood lipid loci through multi-ancestry meta-analysis

    Get PDF
    Abstract Background Genetic variants within nearly 1000 loci are known to contribute to modulation of blood lipid levels. However, the biological pathways underlying these associations are frequently unknown, limiting understanding of these findings and hindering downstream translational efforts such as drug target discovery. Results To expand our understanding of the underlying biological pathways and mechanisms controlling blood lipid levels, we leverage a large multi-ancestry meta-analysis (N = 1,654,960) of blood lipids to prioritize putative causal genes for 2286 lipid associations using six gene prediction approaches. Using phenome-wide association (PheWAS) scans, we identify relationships of genetically predicted lipid levels to other diseases and conditions. We confirm known pleiotropic associations with cardiovascular phenotypes and determine novel associations, notably with cholelithiasis risk. We perform sex-stratified GWAS meta-analysis of lipid levels and show that 3–5% of autosomal lipid-associated loci demonstrate sex-biased effects. Finally, we report 21 novel lipid loci identified on the X chromosome. Many of the sex-biased autosomal and X chromosome lipid loci show pleiotropic associations with sex hormones, emphasizing the role of hormone regulation in lipid metabolism. Conclusions Taken together, our findings provide insights into the biological mechanisms through which associated variants lead to altered lipid levels and potentially cardiovascular disease risk

    First Genome-Wide Association Study of Latent Autoimmune Diabetes in Adults Reveals Novel Insights Linking Immune and Metabolic Diabetes

    Get PDF
    OBJECTIVELatent autoimmune diabetes in adults (LADA) shares clinical features with both type 1 and type 2 diabetes; however, there is ongoing debate regarding the precise definition of LADA. Understanding its genetic basis is one potential strategy to gain insight into appropriate classification of this diabetes subtype.RESEARCH DESIGN AND METHODSWe performed the first genome-wide association study of LADA in case subjects of European ancestry versus population control subjects (n = 2,634 vs. 5,947) and compared against both case subjects with type 1 diabetes (n = 2,454 vs. 968) and type 2 diabetes (n = 2,779 vs. 10,396).RESULTSThe leading genetic signals were principally shared with type 1 diabetes, although we observed positive genetic correlations genome-wide with both type 1 and type 2 diabetes. Additionally, we observed a novel independent signal at the known type 1 diabetes locus harboring PFKFB3, encoding a regulator of glycolysis and insulin signaling in type 2 diabetes and inflammation and autophagy in autoimmune disease, as well as an attenuation of key type 1-associated HLA haplotype frequencies in LADA, suggesting that these are factors that distinguish childhood-onset type 1 diabetes from adult autoimmune diabetes.CONCLUSIONSOur results support the need for further investigations of the genetic factors that distinguish forms of autoimmune diabetes as well as more precise classification strategies.Peer reviewe

    Implicating genes, pleiotropy, and sexual dimorphism at blood lipid loci through multi-ancestry meta-analysis

    Get PDF
    Funding Information: GMP, PN, and CW are supported by NHLBI R01HL127564. GMP and PN are supported by R01HL142711. AG acknowledge support from the Wellcome Trust (201543/B/16/Z), European Union Seventh Framework Programme FP7/2007–2013 under grant agreement no. HEALTH-F2-2013–601456 (CVGenes@Target) & the TriPartite Immunometabolism Consortium [TrIC]-Novo Nordisk Foundation’s Grant number NNF15CC0018486. JMM is supported by American Diabetes Association Innovative and Clinical Translational Award 1–19-ICTS-068. SR was supported by the Academy of Finland Center of Excellence in Complex Disease Genetics (Grant No 312062), the Finnish Foundation for Cardiovascular Research, the Sigrid Juselius Foundation, and University of Helsinki HiLIFE Fellow and Grand Challenge grants. EW was supported by the Finnish innovation fund Sitra (EW) and Finska Läkaresällskapet. CNS was supported by American Heart Association Postdoctoral Fellowships 15POST24470131 and 17POST33650016. Charles N Rotimi is supported by Z01HG200362. Zhe Wang, Michael H Preuss, and Ruth JF Loos are supported by R01HL142302. NJT is a Wellcome Trust Investigator (202802/Z/16/Z), is the PI of the Avon Longitudinal Study of Parents and Children (MRC & WT 217065/Z/19/Z), is supported by the University of Bristol NIHR Biomedical Research Centre (BRC-1215–2001) and the MRC Integrative Epidemiology Unit (MC_UU_00011), and works within the CRUK Integrative Cancer Epidemiology Programme (C18281/A19169). Ruth E Mitchell is a member of the MRC Integrative Epidemiology Unit at the University of Bristol funded by the MRC (MC_UU_00011/1). Simon Haworth is supported by the UK National Institute for Health Research Academic Clinical Fellowship. Paul S. de Vries was supported by American Heart Association grant number 18CDA34110116. Julia Ramierz acknowledges support by the People Programme of the European Union’s Seventh Framework Programme grant n° 608765 and Marie Sklodowska-Curie grant n° 786833. Maria Sabater-Lleal is supported by a Miguel Servet contract from the ISCIII Spanish Health Institute (CP17/00142) and co-financed by the European Social Fund. Jian Yang is funded by the Westlake Education Foundation. Olga Giannakopoulou has received funding from the British Heart Foundation (BHF) (FS/14/66/3129). CHARGE Consortium cohorts were supported by R01HL105756. Study-specific acknowledgements are available in the Additional file : Supplementary Note. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services. Publisher Copyright: © 2022, The Author(s).Background: Genetic variants within nearly 1000 loci are known to contribute to modulation of blood lipid levels. However, the biological pathways underlying these associations are frequently unknown, limiting understanding of these findings and hindering downstream translational efforts such as drug target discovery. Results: To expand our understanding of the underlying biological pathways and mechanisms controlling blood lipid levels, we leverage a large multi-ancestry meta-analysis (N = 1,654,960) of blood lipids to prioritize putative causal genes for 2286 lipid associations using six gene prediction approaches. Using phenome-wide association (PheWAS) scans, we identify relationships of genetically predicted lipid levels to other diseases and conditions. We confirm known pleiotropic associations with cardiovascular phenotypes and determine novel associations, notably with cholelithiasis risk. We perform sex-stratified GWAS meta-analysis of lipid levels and show that 3–5% of autosomal lipid-associated loci demonstrate sex-biased effects. Finally, we report 21 novel lipid loci identified on the X chromosome. Many of the sex-biased autosomal and X chromosome lipid loci show pleiotropic associations with sex hormones, emphasizing the role of hormone regulation in lipid metabolism. Conclusions: Taken together, our findings provide insights into the biological mechanisms through which associated variants lead to altered lipid levels and potentially cardiovascular disease risk.Peer reviewe

    Automated Coding of Historical Danish Cause of Death Data Using String Similarity

    No full text
    The study of causes of death has been central to some of the most influential studies of the modern mortality decline in the nineteenth and twentieth centuries. The digitization of individual-level cause of-death data has been game-changing, however, the data presents a major challenge: how do we code the thousands of unique strings for analysis in an efficient way? This paper aims to see how far we can get with automated coding based on string similarity. We do this by applying a Jaro Winkler string similarity algorithm in Python (pyjarowinkler) that codes our cause of death data from the Copenhagen Burial Register 1861-1911 to DK1875, a contemporary coding and classification system from nineteenth century Denmark. We then compare the performance of the algorithm to that of a manual (historian) coder in three different ways: at the level of each unique cause-of-death string, at the level of each cause-of-death group and for the overall cause-of-death pattern for all burials in Copenhagen 1861-1911. Our results show that a minimum-effort algorithm coded approximately half of the causes of death correctly compared to the manually coded dataset. This means that the method applied here is not accurate enough to use for actual data analysis of mortality patterns, as it is not possible to examine individual causes within larger causal groups. However, the results are promising for different uses of the method as a help for the manual coder. A way forward could be to use cut-off points of the Jaro-Winkler scores, coding only those causes where the string similarity match is relatively certain or use the automated method to catch most of the initial cases of a certain disease with a very set phrasing, such as cancer. In both cases, the remainder of the unique cause of death strings could then be coded by a manual coder

    The causal role of smoking on the risk of hip or knee replacement due to primary osteoarthritis: a Mendelian randomisation analysis of the HUNT study

    Get PDF
    Objective Smoking has been associated with a reduced risk of hip and knee osteoarthritis (OA) and subsequent joint replacement. The aim of the present study was to assess whether the observed association is likely to be causal. Method 55,745 participants of a population-based cohort were genotyped for the rs1051730 C > T single-nucleotide polymorphism (SNP), a proxy for smoking quantity among smokers. A Mendelian randomization analysis was performed using rs1051730 as an instrument to evaluate the causal role of smoking on the risk of hip or knee replacement (combined as total joint replacement (TJR)). Association between rs1051730 T alleles and TJR was estimated by hazard ratios (HRs) and 95% confidence intervals (CIs). All analyses were adjusted for age and sex. Results Smoking quantity (no. of cigarettes) was inversely associated with TJR (HR 0.97, 95% CI 0.97–0.98). In the Mendelian randomization analysis, rs1051730 T alleles were associated with reduced risk of TJR among current smokers (HR 0.84, 95% CI 0.76–0.98, per T allele), however we found no evidence of association among former (HR 0.97, 95% CI 0.88–1.07) and never smokers (HR 0.97, 95% CI 0.89–1.06). Neither adjusting for body mass index (BMI), cardiovascular disease (CVD) nor accounting for the competing risk of mortality substantially changed the results. Conclusion This study suggests that smoking may be causally associated with the reduced risk of TJR. Our findings add support to the inverse association found in previous observational studies. More research is needed to further elucidate the underlying mechanisms of this causal association. © 2017. This manuscript version is made available under the CC-BY-NC-ND 4.0 license

    The causal role of smoking on the risk of hip or knee replacement due to primary osteoarthritis: a Mendelian randomisation analysis of the HUNT study

    No full text
    Objective Smoking has been associated with a reduced risk of hip and knee osteoarthritis (OA) and subsequent joint replacement. The aim of the present study was to assess whether the observed association is likely to be causal. Method 55,745 participants of a population-based cohort were genotyped for the rs1051730 C > T single-nucleotide polymorphism (SNP), a proxy for smoking quantity among smokers. A Mendelian randomization analysis was performed using rs1051730 as an instrument to evaluate the causal role of smoking on the risk of hip or knee replacement (combined as total joint replacement (TJR)). Association between rs1051730 T alleles and TJR was estimated by hazard ratios (HRs) and 95% confidence intervals (CIs). All analyses were adjusted for age and sex. Results Smoking quantity (no. of cigarettes) was inversely associated with TJR (HR 0.97, 95% CI 0.97–0.98). In the Mendelian randomization analysis, rs1051730 T alleles were associated with reduced risk of TJR among current smokers (HR 0.84, 95% CI 0.76–0.98, per T allele), however we found no evidence of association among former (HR 0.97, 95% CI 0.88–1.07) and never smokers (HR 0.97, 95% CI 0.89–1.06). Neither adjusting for body mass index (BMI), cardiovascular disease (CVD) nor accounting for the competing risk of mortality substantially changed the results. Conclusion This study suggests that smoking may be causally associated with the reduced risk of TJR. Our findings add support to the inverse association found in previous observational studies. More research is needed to further elucidate the underlying mechanisms of this causal association
    corecore