97 research outputs found

    Meta Matters - Enriching & Exploiting Your Metadata

    Get PDF
    Introduction Data is nothing without context: if you don't know how, when or why a variable was gathered, it's nigh impossible to draw conclusions from it. This presentation discusses different sorts of metadata and how they can be gathered, stored, and used to enrich data; drawing examples from our biobank. Objectives and Approach Each data item has two types of metadata: variable-level and value-level. For example, consider a questionnaire. The variable-level metadata covers each question: exact wording, validation rules for the answers, etc. The value-level metadata covers each individual answer: details of the questioner, date and time of response, and so on. We also have database-level metadata: datasets which list every dataset or every field in the database. While some of this information needs to be gathered alongside the data itself, much can be extracted or imputed from results or documentation. We present some generalizable examples. Results Like any other data, metadata is only worth having if you’re using it. We will present principles and examples of applications that we have developed for it: • Data management – Deriving useful variables and tables, and helping to make your data easier to parse, extract, and validate. • Presentation – Making your data more human-readable by labelling variables and decoding values. • Documentation – Metadata tables make ideal repositories for granular institutional knowledge about your data: known issues, potential pitfalls, or explanations for missing values. • Analysis – Identifying which metadata variables are most valuable for analysts, and how best to provide them. • Automation – Using the metadata to generate code that can automatically produce summary statistics, tables, graphs… and more metadata! Conclusion/Implications Every dataset comes with some metadata. When examined and built upon, it can deepen understanding of the data within, as well as becoming a powerful resource in its own right

    Validation & Vindication – Comparing Electronic Health Records with Hospital Notes

    Get PDF
    Introduction No doubt your Electronic Health Records have been meticulously gathered, imported, validated and standardised. However, if you want to be certain that they are an accurate representation of reality, you can’t beat physically going to hospitals and cross-checking their records against yours. Our biobank did exactly this. Objectives and Approach Our validation exercise encompassed all reported cases in our follow-up data of three key conditions: stroke, heart disease, and cancer. Key data about each hospitalisation was extracted and exported to tablet computers running custom software. Our staff then visited each hospital in this dataset seeking the corresponding medical notes, and collected additional data from those that they found including photographs of key documents. These results were then adjudicated by specialist physicians to determine the accuracy of the diagnosis, and identify disease phenotypes of interest. Finally, all these results were merged back into our follow-up data. Results Not only was gathering the data a huge logistical and technical challenge, integrating it back into the database presented its own difficulties. Our initial plan was to assign each sought event a status of ‘validated’, ‘corrected’ or ‘unfound’. However, this proved inadequate for addressing the complexities of the data, as we will discuss, with examples. Our solution was to initially treat the retrieved hospital notes as simply another source of follow-up data. We were thus able to use our existing systems for validating, standardising and aggregating events; and thus produce validated endpoints that were meaningfully comparable to our reported endpoints. We could then implement and test definitions of the required validation statuses at a participant level for each disease of interest. Conclusion/Implications This validation project was a huge and daunting undertaking, but repaid our investment with proof that our Electronic Health Records were generally very reliable, and also with much richer data about disease diagnosis and phenotyping. Other projects using Electronic Health Records may wish to adopt this approach

    Long-term ambient air pollution exposure and cardio-respiratory disease in China: findings from a prospective cohort study

    Get PDF
    Background Existing evidence on long-term ambient air pollution (AAP) exposure and risk of cardio-respiratory diseases in China is mainly on mortality, and based on area average concentrations from fixed-site monitors for individual exposures. Substantial uncertainty persists, therefore, about the shape and strength of the relationship when assessed using more personalised individual exposure data. We aimed to examine the relationships between AAP exposure and risk of cardio-respiratory diseases using predicted local levels of AAP. Methods A prospective study included 50,407 participants aged 30–79 years from Suzhou, China, with concentrations of nitrogen dioxide (NO2), sulphur dioxide (SO2), fine (PM2.5), and inhalable (PM10) particulate matter, ozone (O3) and carbon monoxide (CO) and incident cases of cardiovascular disease (CVD) (n = 2,563) and respiratory disease (n = 1,764) recorded during 2013–2015. Cox regression models with time-dependent covariates were used to estimate adjusted hazard ratios (HRs) for diseases associated with local-level concentrations of AAP exposure, estimated using Bayesian spatio–temporal modelling. Results The study period of 2013–2015 included a total of 135,199 person-years of follow-up for CVD. There was a positive association of AAP, particularly SO2 and O3, with risk of major cardiovascular and respiratory diseases. Each 10 µg/m3 increase in SO2 was associated with adjusted hazard ratios (HRs) of 1.07 (95% CI: 1.02, 1.12) for CVD, 1.25 (1.08, 1.44) for COPD and 1.12 (1.02, 1.23) for pneumonia. Similarly, each 10 µg/m3 increase in O3 was associated with adjusted HR of 1.02 (1.01, 1.03) for CVD, 1.03 (1.02, 1.05) for all stroke, and 1.04 (1.02, 1.06) for pneumonia. Conclusions Among adults in urban China, long-term exposure to ambient air pollution is associated with a higher risk of cardio-respiratory disease

    Thermal and tectonic consequences of India underthrusting Tibet

    Get PDF
    The Tibetan Plateau is the largest orogenic system on Earth, and has been influential in our understanding of how the continental lithosphere deforms. Beneath the plateau are some of the deepest ( ~ 100 ) earthquakes observed within the continental lithosphere, which have been pivotal in ongoing debates about the rheology and behaviour of the continents. We present new observations of earthquake depths from the region, and use thermal models to suggest that all of them occur in material at temperatures of ≲600 °C. Thermal modelling, combined with experimentally derived flow laws, suggests that if the Indian lower crust is anhydrous it will remain strong beneath the entire southern half of the Tibetan plateau, as is also suggested by dynamic models. In northwest Tibet, the strong underthrust Indian lower crust abuts the rigid Tarim Basin, and may be responsible for both the clockwise rotation of Tarim relative to stable Eurasia and the gradient of shortening along the Tien Shan

    Using routine data to monitor inequalities in an acute trust: a retrospective study

    Get PDF
    <p><b>Abstract</b></p> <p><b>Background</b></p> <p>Reducing inequalities is one of the priorities of the National Health Service. However, there is no standard system for monitoring inequalities in the care provided by acute trusts. We explore the feasibility of monitoring inequalities within an acute trust using routine data.</p> <p><b>Methods</b></p> <p>A retrospective study of hospital episode statistics from one acute trust in London over three years (2007 to 2010). Waiting times, length of stay and readmission rates were described for seven common surgical procedures. Inequalities by age, sex, ethnicity and social deprivation were examined using multiple logistic regression, adjusting for the other socio-demographic variables and comorbidities. Sample size calculations were computed to estimate how many years of data would be ideal for this analysis.</p> <p><b>Results</b></p> <p>This study found that even in a large acute trust, there was not enough power to detect differences between subgroups. There was little evidence of inequalities for the outcome and process measures examined, statistically significant differences by age, sex, ethnicity or deprivation were only found in 11 out of 80 analyses. Bariatric surgery patients who were black African or Caribbean were more likely than white patients to experience a prolonged wait (longer than 64 days, aOR = 2.47, 95% CI: 1.36-4.49). Following a coronary angioplasty, patients from more deprived areas were more likely to have had a prolonged length of stay (aOR = 1.66, 95% CI: 1.25-2.20).</p> <p><b>Conclusions</b></p> <p>This study found difficulties in using routine data to identify inequalities on a trust level. Little evidence of inequalities in waiting time, length of stay or readmission rates by sex, ethnicity or social deprivation were identified although some differences were identified which warrant further investigation. Even with three years of data from a large trust there was little power to detect inequalities by procedure. Data will therefore need to be pooled from multiple trusts to detect inequalities.</p

    Changes in SARS-CoV-2 Spike versus Nucleoprotein Antibody Responses Impact the Estimates of Infections in Population-Based Seroprevalence Studies

    Get PDF
    SARS-CoV-2-specific antibody responses to the Spike (S) protein monomer, S protein native trimeric form or the nucleocapsid (N) proteins were evaluated in cohorts of individuals with acute infection (n=93) and in individuals enrolled in a post-infection seroprevalence population study (n=578) in Switzerland. Commercial assays specific for the S1 monomer, for the N protein and a newly developed Luminex assay using the S protein trimer were found to be equally sensitive in antibody detection in the acute infection phase samples. Interestingly, as compared to anti-S antibody responses, those against the N protein appear to wane in the post-infection cohort. Seroprevalence in a 'positive patient contacts' group (n=177) was underestimated by N protein assays by 10.9 to 32.2% and the 'random selected' general population group (n=311) was reduced up to 45% reduction relative to S protein assays. The overall reduction in seroprevalence targeting only anti-N antibodies for the total cohort ranged from 9.4 to 31%. Of note, the use of the S protein in its native trimer form was significantly more sensitive as compared to monomeric S proteins. These results indicate that the assessment of anti-S IgG antibody responses against the native trimeric S protein should be implemented to estimate SARS-CoV-2 infections in population-based seroprevalence studies.IMPORTANCE In the present study, we have determined SARS-CoV-2-specific antibody responses in sera of acute and post-infection phase subjects. Our results indicate that antibody responses against viral S and N proteins were equally sensitive in the acute phase of infection but that responses against N appear to wane in the post-infection phase while those against S protein persist over time. The most sensitive serological assay in both acute and post-infection phases used the native S protein trimer as binding antigen that has significantly greater conformational epitopes for antibody binding compared to the S1 monomer protein used in other assays. We believe that these results are extremely important in order to generate correct estimates of SARS-CoV-2 infections in the general population. Furthermore, the assessment of antibody responses against the trimeric S protein will be critical to evaluate the durability of the antibody response and for the characterization of a vaccine-induced antibody response

    The Eco-Epidemiology of Pacific Coast Tick Fever in California

    Get PDF
    Rickettsia philipii (type strain “Rickettsia 364D”), the etiologic agent of Pacific Coast tick fever (PCTF), is transmitted to people by the Pacific Coast tick, Dermacentor occidentalis. Following the first confirmed human case of PCTF in 2008, 13 additional human cases have been reported in California, more than half of which were pediatric cases. The most common features of PCTF are the presence of at least one necrotic lesion known as an eschar (100%), fever (85%), and headache (79%); four case-patients required hospitalization and four had multiple eschars. Findings presented here implicate the nymphal or larval stages of D. occidentalis as the primary vectors of R. philipii to people. Peak transmission risk from ticks to people occurs in late summer. Rickettsia philipii DNA was detected in D. occidentalis ticks from 15 of 37 California counties. Similarly, non-pathogenic Rickettsia rhipicephali DNA was detected in D. occidentalis in 29 of 38 counties with an average prevalence of 12.0% in adult ticks. In total, 5,601 ticks tested from 2009 through 2015 yielded an overall R. philipii infection prevalence of 2.1% in adults, 0.9% in nymphs and a minimum infection prevalence of 0.4% in larval pools. Although most human cases of PCTF have been reported from northern California, acarological surveillance suggests that R. philipii may occur throughout the distribution range of D. occidentalis

    Vitamin D and cause-specific vascular disease and mortality:a Mendelian randomisation study involving 99,012 Chinese and 106,911 European adults

    Get PDF

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe
    corecore