8 research outputs found

    Computational toolbox towards evolutionary domain mapping of membrane proteins

    Get PDF
    Curs 2012-2013Membrane proteins account for about 20% to 30% of all proteins encoded in a typical genome. They play central roles in multiple cellular processes mediating the interaction of the cell with its surrounding. Over 60% of all drug targets contain a membrane domain. The experimental difficulties of obtaining a crystal structural severely limits our ability or understanding of membrane protein function. Computational evolutionary studies of proteins are crucial for the prediction of 3D structures. In this project, we construct a tool able to quantify the evolutionary positive selective pressure on each residue of membrane proteins through maximum likelihood phylogeny reconstruction. The conservation plot combined with a structural homology model is also a potent tool to predict those residues that have essentials roles in the structure and function of a membrane protein and can be very useful in the design of validation experiments.Director/a: Mireia Olivella i Alex Peràlvare

    Characterising the source of errors for metagenomic taxonomic classification

    Get PDF
    Characterising microbial communities enables a better understanding of their complexity and the contribution to the environment. Metagenomics has been a rapidly expanding field since the revolution of next generation sequencing began, and it has a wide range of application including for medicine, agriculture, forensics, archaeology and even domestic use [Sarkar et al., 2021, Holman et al., 2017, Khodakova et al.,2014, Santiago-Rodriguez et al., 2017, Vilanova et al., 2015]. Sequencing amplicon data, such as 16S rRNA, is now commonly used to characterise the microbiome in a variety of biological samples. However, their correct taxonomic identification still remains a challenge, and often short reads are identified, correctly or not, at several ranks of the taxonomic tree other than species or subspecies level. Every metagenomic study is designed for specific needs, and it is often complicated to find a suitable bioinformatics pipeline and reference database. There is currently a lack of systematic benchmarking of in-house methods for metagenomics. The work presented in this thesis aims to establish an approach for the in silico validation of 16S rRNA metagenomic data. A method to generate realistic in silico metagenome data that resembles project-specific sequencing data is presented, including a new process to generate synthetic negative controls for amplicon data, which can be employed regularly to assess the appropriateness and optimisation of methods for specific metagenomic projects. To aid the benchmarking process, new metrics have been defined based on a measure of taxonomic distance. A k-mer based method with the lowest common ancestor approach was selected to investigate a range of factors that influence meta-taxonomic classification success. It includes the comparison of database quality filtered at various levels, and as well as a comparison of different taxonomic annotation methodologies. The experimental findings reveal the importance of having highly curated taxonomic annotations of the genetic sequences in the database, and that a missing fraction of the tree of life can lead to misclassification of any related or unrelated organisms. In some cases, it is shown that longer reads can help to improve assignment, with mutations and sequencing errors having a relatively low negative impact. The marker gene 16S rRNA has well-defined conserved and variable regions, which help to distinguish species. Therefore, these regions were studied and also recalculated using information theory, to investigate which parts of the sequence are discriminative for metagenomic taxonomic identification. In addition, linguistics methods, Term Frequency — Inverse Document Frequency (TF-IDF) coupled with multinomial naive Bayes, is shown to provide understanding of genetic signatures and is applied to generate a new method to classify taxonomically metagenomics short reads. Biological samples were taken from cattle respiratory tract, DNA was extracted and sequenced to provide metagenomic data. Two sets of experiments were carried out, (i) to compare sampling and extraction methods and (ii) to characterise the microbial community observed in young cattle in the different lung lobes and nose. The data reveal that the composition of the microbial community observed is highly dependent on the sampling method

    GRAIMATTER Green Paper:Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs)

    Get PDF
    TREs are widely, and increasingly used to support statistical analysis of sensitive data across a range of sectors (e.g., health, police, tax and education) as they enable secure and transparent research whilst protecting data confidentiality.There is an increasing desire from academia and industry to train AI models in TREs. The field of AI is developing quickly with applications including spotting human errors, streamlining processes, task automation and decision support. These complex AI models require more information to describe and reproduce, increasing the possibility that sensitive personal data can be inferred from such descriptions. TREs do not have mature processes and controls against these risks. This is a complex topic, and it is unreasonable to expect all TREs to be aware of all risks or that TRE researchers have addressed these risks in AI-specific training.GRAIMATTER has developed a draft set of usable recommendations for TREs to guard against the additional risks when disclosing trained AI models from TREs. The development of these recommendations has been funded by the GRAIMATTER UKRI DARE UK sprint research project. This version of our recommendations was published at the end of the project in September 2022. During the course of the project, we have identified many areas for future investigations to expand and test these recommendations in practice. Therefore, we expect that this document will evolve over time. The GRAIMATTER DARE UK sprint project has also developed a minimal viable product (MVP) as a suite of attack simulations that can be applied by TREs and can be accessed here (https://github.com/AI-SDC/AI-SDC).If you would like to provide feedback or would like to learn more, please contact Smarti Reel ([email protected]) and Emily Jefferson ([email protected]).The summary of our recommendations for a general public audience can be found at DOI: 10.5281/zenodo.708951

    AI-SDC

    No full text
    <p>Changes:</p> <ul> <li>Fix a bug related to the <code>rules.json</code> path when running from package (<a href="https://github.com/AI-SDC/AI-SDC/pull/247">#247</a>)</li> <li>Update user stories (<a href="https://github.com/AI-SDC/AI-SDC/pull/247">#247</a>)</li> </ul&gt

    SACRO:Semi-Automated Checking of Research Outputs

    No full text
    This project aimed to address a major bottleneck in conducting research on confidential data - the final stage of "Output Statistical Disclosure Control" (OSDC). This is where staff in a Trusted Research Environment (TRE) conduct manual checks to ensure that things a researcher wishes to take out - such as tables, plots, statistical and/or AI models- do not cause risk to any individual's privacy. To tackle this bottleneck, we proposed to:Produce a consolidated framework with a rigorous statistical basis that provides guidance for TREs to agree consistent, standard processes to assist in Quality Assurance.Design and implement a semi-automated system for checks on common research outputs, with increasing levels of support for other types such as AI.Work with a range of different types of TRE in different sectors and organisations to ensure wide applicability.Work with public and patients to explore what is needed for public trust, e.g., that any automation is acting as "an extra pair of eyes": supporting not supplanting TRE staff.Supported by funding from DARE UK (Data and Analytics Research Environments UK), we met these aims through production of documentation, open-source code repositories, and a 'Consensus' statement embodying principles organisations should uphold when deploying any sort of automated disclosure control.Looking forward, we are now ready for extensive user testing and refinement of the resources produced. Following a series of presentations to national and international audiences, a range of different organisations arein the process of trialling the SACRO toolkits. We are delighted that DARE UK has awarded funding to support a Community of Interest group (CoI). This will address ongoing support and the user-led creation of 'soft' resources (such as user guides, 'help desks', and mentoring schemes) to remove blocks to adoption: both for TREs, and crucially for researchers.There are two other areas where we are now ready to make significant advances: applying SACRO to allow principles-based OSDC for 'conceptual data spaces (e.g. via data pooling or federated analytics) and expanding the scope of risk assessment of AI/Machine Learning models to more complex models and types of data. This work is funded by UK research and Innovation, [Grant Number MC_PC_23006], as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK

    SACRO:Semi-Automated Checking of Research Outputs

    No full text
    This project aimed to address a major bottleneck in conducting research on confidential data - the final stage of "Output Statistical Disclosure Control" (OSDC). This is where staff in a Trusted Research Environment (TRE) conduct manual checks to ensure that things a researcher wishes to take out - such as tables, plots, statistical and/or AI models- do not cause risk to any individual's privacy. To tackle this bottleneck, we proposed to:Produce a consolidated framework with a rigorous statistical basis that provides guidance for TREs to agree consistent, standard processes to assist in Quality Assurance.Design and implement a semi-automated system for checks on common research outputs, with increasing levels of support for other types such as AI.Work with a range of different types of TRE in different sectors and organisations to ensure wide applicability.Work with public and patients to explore what is needed for public trust, e.g., that any automation is acting as "an extra pair of eyes": supporting not supplanting TRE staff.Supported by funding from DARE UK (Data and Analytics Research Environments UK), we met these aims through production of documentation, open-source code repositories, and a 'Consensus' statement embodying principles organisations should uphold when deploying any sort of automated disclosure control.Looking forward, we are now ready for extensive user testing and refinement of the resources produced. Following a series of presentations to national and international audiences, a range of different organisations arein the process of trialling the SACRO toolkits. We are delighted that DARE UK has awarded funding to support a Community of Interest group (CoI). This will address ongoing support and the user-led creation of 'soft' resources (such as user guides, 'help desks', and mentoring schemes) to remove blocks to adoption: both for TREs, and crucially for researchers.There are two other areas where we are now ready to make significant advances: applying SACRO to allow principles-based OSDC for 'conceptual data spaces (e.g. via data pooling or federated analytics) and expanding the scope of risk assessment of AI/Machine Learning models to more complex models and types of data. This work is funded by UK research and Innovation, [Grant Number MC_PC_23006], as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK
    corecore