15 research outputs found
Recommended from our members
The classification of gene products in the molecular biology domain: Realism, objectivity, and the limitations of the Gene Ontology
Background: Controlled vocabularies in the molecular biology domain exist to facilitate data integration across database resources. One such tool is the Gene Ontology (GO), a classification designed to act as a universal index for gene products from any species. The Gene Ontology is used extensively in annotating gene products and analysing gene expression data, yet very little research exists from a library and information science perspective exploring the design principles, philosophy and social role of ontologies in biology.
Aim: To explore how molecular biologists, in creating the Gene Ontology, devised guidelines and rules for determining which scientific concepts are included in the ontology, and the criteria for how these concepts are represented.
Methods: A domain analysis approach was used to devise a mixed methodology to study the design of the Gene Ontology. Concept analysis of a GO term and a critical discourse analysis of GO developer mailing list texts were used to test whether ontological realism is a tenable basis for constructing objective ontologies. A comparison of the current GO vocabulary construction guidelines and a study of the reasons why GO terms are removed from the ontology further explored the justifications for the design of the Gene Ontology. Finally, a content analysis of published GO papers examined how authors use and cite GO data and terminology.
Results: Gene Ontology terms can be presented according to different epistemologies for concepts, indicating that ontological realism is not the only way objective ontologies can be designed. Social roles and the exercise of power were found to play an important role in determining ontology content, and poor synonym control, a lack of clear warrant for deciding terminology and arbitrary decisions to delete and invent new terms undermine the objectivity and universal applicability of the Gene Ontology. Authors exhibited poor compliance with GO data citation policies, and in re-wording and misquoting GO terminology, risk exacerbating the semantic problems this controlled vocabulary was designed to solve.
Conclusions: The failure of the Gene Ontology to define what is meant by a molecular function, the exercise of power by GO developers in clearing contentious concepts from the ontology, and the strict adherence to ontological realism, which marginalises social and subjective ways of classifying scientific concepts, limits the utility of the ontology as a tool to unify the molecular biology domain. These limitations to the Gene Ontology design could be overcome with the development of lighter, pluralistic, user-controlled âopen ontologiesâ for gene products that can work alongside more traditional, âtop-downâ developed vocabularies
Survey: Leakage and Privacy at Inference Time
Leakage of data from publicly available Machine Learning (ML) models is an
area of growing significance as commercial and government applications of ML
can draw on multiple sources of data, potentially including users' and clients'
sensitive data. We provide a comprehensive survey of contemporary advances on
several fronts, covering involuntary data leakage which is natural to ML
models, potential malevolent leakage which is caused by privacy attacks, and
currently available defence mechanisms. We focus on inference-time leakage, as
the most likely scenario for publicly available models. We first discuss what
leakage is in the context of different data, tasks, and model architectures. We
then propose a taxonomy across involuntary and malevolent leakage, available
defences, followed by the currently available assessment metrics and
applications. We conclude with outstanding challenges and open questions,
outlining some promising directions for future research
Final Report of the Independent Expert Group for the Unlocking the Value of Data Programme
This report is the final output of the Independent Expert Group on the Unlocking the Value of Data programme, to the Scottish Government. This report is a Ministerial commission, and was originally commissioned by the former Minister for Business, Trade, Tourism and Enterprise
Use of âHidden in Plain Sightâ de-identification methodology in electronic healthcare data provides minimal risk of misidentification: Results from the iCAIRD Safe Haven Artificial Intelligence Platform.
Objectives
To determine the risk of misidentification when using a âHidden In Plain Sight (HIPS)â Named Entity Recognition (NER) de-identification methodology applied to Scottish healthcare data within The Industrial Centre for Artificial Intelligence Research in Digital Diagnostics (iCAIRD) Safe Haven Artificial Intelligence Platform (SHAIP).
Approach
Rather than the traditional redaction of potential identifiable information in routinely collected healthcare data, our HIPS methodology utilises an NER âfind and replaceâ approach to de-identification that keeps the structure of text intact. This ensures that context is maintained, key to the interpretation of free text information and potential Artificial Intelligence applications.
To our knowledge these methods have been previously untested on Scottish healthcare data. We therefore performed assessment of this approach in terms of potential risk of misidentification using HIPS on structured Scottish data deployed in SHAIP as part of the iCAIRD programme.
Results
Five individual cohorts, with a total of 169,964 patients were included. For each cohort the HIPS approach was applied, and then compared to actual patient information from within the same region, in order to determine the risk of misidentification. The following fields were included: Forename, Surname, Previous Name, Gender, Date of Birth (DOB), and Postcode.
Across the five cohorts and varying combinations of identifiable data fields there were a total of 94 instances of potential misidentification (0.06%). 85/94 (90.4%) of these were for the combination of Gender, Date of Birth and Postcode. Across the five cohorts there were only 3 instances (0.002%) of Forename/Surname/DOB, and 5 instances (0.003%) of Forename/Surname/Postcode potential misidentification amongst the 169,964 patients.
Conclusions
The iCAIRD NER HIPS Methodology provides an acceptably low misidentification rate. Further work is now required to determine the recall and precision rates. Benefits of this approach include retaining the structure of free text, as well as reducing the ability to detect any potential leaked identifiable data
Barriers and facilitators of cross-sectoral data linkage to inform healthy public policy and practice: lessons from three case study projects in Scotland.
Objectives
We sought to describe barriers and facilitators faced by three research projects aiming to link routinely-collected data across various sectors, to produce evidence to inform healthy public policy. We conducted these case studies as a part of a wider research project on cross-sectoral sharing and linkage of secondary data.
Approach
We selected the case studies to cover a range of target populations and datasets. The chosen projects investigated (1) the health of care-experienced children; (2) the intersection of homelessness, justice involvement, drug use, and severe mental illness; (3) multi-morbidity among adults receiving social care. Information about timelines and governance processes was collected from lead investigators, including specific barriers and facilitators encountered, using a standardised pro forma and follow-up interviews. Thematic analysis was carried out by the research team, informed by themes identified in a parallel scoping review of existing literature on evidence use for healthy public policy and practice.
Results
Each project involved between 6 and 11 agencies, with co-ordination across multiple institutions and geographies proving challenging. Due to challenges encountered, all projects had to amend their original geographical or demographic scope. Forty-four barriers and facilitators to sharing and linkage of cross-sectoral routinely-collected data for public health research were identified. These included but were not limited to: integration of current data in an ever-changing linkage landscape; the need for timely feedback in undertaking the study; standardisation of information governance processes; highlighting the resourcing and funding issues for data linkage projects; the need for data controllers to recognise the value of such projects; and issues relating to staff turnover and workload pressures.
Conclusion
The interconnected nature of barriers and facilitators identified by the case studies suggests the importance of a whole-systems approach to cross-sectoral linkage. While literature offers relatively few case studies of cross-sectoral linkage for health research, the value of their insight into the linkage landscape derived from real-life experience is substantial
A National Network of Safe Havens:A Scottish Perspective
For over a decade, Scotland has implemented and operationalized a system of Safe Havens, which provides secure analytics platforms for researchers to access linked, deidentified electronic health records (EHRs) while managing the risk of unauthorized reidentification. In this paper, a perspective is provided on the state-of-the-art Scottish Safe Haven network, including its evolution, to define the key activities required to scale the Scottish Safe Haven networkâs capability to facilitate research and health care improvement initiatives. A set of processes related to EHR data and their delivery in Scotland have been discussed. An interview with each Safe Haven was conducted to understand their services in detail, as well as their commonalities. The results show how Safe Havens in Scotland have protected privacy while facilitating the reuse of the EHR data. This study provides a common definition of a Safe Haven and promotes a consistent understanding among the Scottish Safe Haven network and the clinical and academic research community. We conclude by identifying areas where efficiencies across the network can be made to meet the needs of population-level studies at scale
Masses, radii, and orbits of small Kepler planets : The transition from gaseous to rocky planets
We report on the masses, sizes, and orbits of the planets orbiting 22 Kepler stars. There are 49 planet candidates around these stars, including 42 detected through transits and 7 revealed by precise Doppler measurements of the host stars. Based on an analysis of the Kepler brightness measurements, along with high-resolution imaging and spectroscopy, Doppler spectroscopy, and (for 11 stars) asteroseismology, we establish low false-positive probabilities (FPPs) for all of the transiting planets (41 of 42 have an FPP under 1%), and we constrain their sizes and masses. Most of the transiting planets are smaller than three times the size of Earth. For 16 planets, the Doppler signal was securely detected, providing a direct measurement of the planet's mass. For the other 26 planets we provide either marginal mass measurements or upper limits to their masses and densities; in many cases we can rule out a rocky composition. We identify six planets with densities above 5 g cm-3, suggesting a mostly rocky interior for them. Indeed, the only planets that are compatible with a purely rocky composition are smaller than 2 R â. Larger planets evidently contain a larger fraction of low-density material (H, He, and H2O).Peer reviewedFinal Accepted Versio
SteatoSITE: an Integrated Gene-to-Outcome Data Commons for Precision Medicine Research in NAFLD
Nonalcoholic fatty liver disease (NAFLD) is the commonest cause of chronic liver disease worldwide and a growing healthcare burden. The pathobiology of NAFLD is complex, disease progression is variable and unpredictable, and there are no qualified prognostic biomarkers or licensed pharmacotherapies that can improve clinical outcomes; it represents an unmet precision medicine challenge. We established a retrospective multicentre national cohort of 940 patients, across the complete NAFLD spectrum, integrating quantitative digital pathology, hepatic RNA-sequencing and 5.67 million days of longitudinal electronic health record follow-up into a secure, searchable, open resource (SteatoSITE) to inform rational biomarker and drug development and facilitate personalised medicine approaches for NAFLD. A complementary web-based gene browser was also developed. Here, our initial analysis uncovers disease stage-specific gene expression signatures, pathogenic hepatic cell subpopulations and master regulator networks associated with disease progression in NAFLD. Additionally, we construct novel transcriptional risk prediction tools for the development of future hepatic decompensation events
An integrated gene-to-outcome multimodal database for metabolic dysfunction-associated steatotic liver disease
Metabolic dysfunction-associated steatotic liver disease (MASLD) is the commonest cause of chronic liver disease worldwide and represents an unmet precision medicine challenge. We established a retrospective national cohort of 940 histologically defined patients (55.4% men, 44.6% women; median body mass index 31.3; 32% with type 2 diabetes) covering the complete MASLD severity spectrum, and created a secure, searchable, open resource (SteatoSITE). In 668 cases and 39 controls, we generated hepatic bulk RNA sequencing data and performed differential gene expression and pathway analysis, including exploration of gender-specific differences. A web-based gene browser was also developed. We integrated histopathological assessments, transcriptomic data and 5.67 million days of time-stamped longitudinal electronic health record data to define disease-stage-specific gene expression signatures, pathogenic hepatic cell subpopulations and master regulator networks associated with adverse outcomes in MASLD. We constructed a 15-gene transcriptional risk score to predict future hepatic decompensation events (area under the receiver operating characteristic curve 0.86, 0.81 and 0.83 for 1-, 3- and 5-year risk, respectively). Additionally, thyroid hormone receptor beta regulon activity was identified as a critical suppressor of disease progression. SteatoSITE supports rational biomarker and drug development and facilitates precision medicine approaches for patients with MASLD