31 research outputs found

    Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data

    Full text link
    Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.Comment: arXiv admin note: text overlap with arXiv:2106.1024

    Emotions and Strategies for Preparation of Emotional Speech Database

    Get PDF
    Abstract The exploration of how we as human beings react to the world and interact with it and each other remains one of the greatest challenges. The ability to recognize emotional states of a person perhaps the most important for successful inter personal social interaction. Automatic emotional speech recognition system can be characterized by the used features, the investigated emotional categories, the methods to collect speech utterances, the languages and the type of the classifier used in the experiment. Since a well defined database is the necessary precondition for improving the performance Automatic emotional speech recognition systems. This paper explores the theories that explain the social and cognitive roles of emotions and mental states and their expression in human behaviors and communication. The paper describes the planning and accomplishment of a native language emotional speech database of acted emotional speech by number of speakers, recording strategies, conversion etc as well as the alternative approach is briefly addressed. Such database would also contribute to research in intonation and emotion

    Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network

    Get PDF
    Cryptic pockets expand the scope of drug discovery by enabling targeting of proteins currently considered undruggable because they lack pockets in their ground state structures. However, identifying cryptic pockets is labor-intensive and slow. The ability to accurately and rapidly predict if and where cryptic pockets are likely to form from a structure would greatly accelerate the search for druggable pockets. Here, we present PocketMiner, a graph neural network trained to predict where pockets are likely to open in molecular dynamics simulations. Applying PocketMiner to single structures from a newly curated dataset of 39 experimentally confirmed cryptic pockets demonstrates that it accurately identifies cryptic pockets (ROC-AUC: 0.87) \u3e1,000-fold faster than existing methods. We apply PocketMiner across the human proteome and show that predicted pockets open in simulations, suggesting that over half of proteins thought to lack pockets based on available structures likely contain cryptic pockets, vastly expanding the potentially druggable proteome

    A novel ML-driven test case selection approach for enhancing the performance of grammatical evolution

    Get PDF
    Computational cost in metaheuristics such as Evolutionary Algorithm (EAs) is often a major concern, particularly with their ability to scale. In data-based training, traditional EAs typically use a significant portion, if not all, of the dataset for model training and fitness evaluation in each generation. This makes EA suffer from high computational costs incurred during the fitness evaluation of the population, particularly when working with large datasets. To mitigate this issue, we propose a Machine Learning (ML)-driven Distance-based Selection (DBS) algorithm that reduces the fitness evaluation time by optimizing test cases. We test our algorithm by applying it to 24 benchmark problems from Symbolic Regression (SR) and digital circuit domains and then using Grammatical Evolution (GE) to train models using the reduced dataset. We use GE to test DBS on SR and produce a system flexible enough to test it on digital circuit problems further. The quality of the solutions is tested and compared against state-of-the-art and conventional training methods to measure the coverage of training data selected using DBS, i.e., how well the subset matches the statistical properties of the entire dataset. Moreover, the effect of optimized training data on run time and the effective size of the evolved solutions is analyzed. Experimental and statistical evaluations of the results show our method empowered GE to yield superior or comparable solutions to the baseline (using the full datasets) with smaller sizes and demonstrates computational efficiency in terms of speed

    Identifying human interactors of SARS-CoV-2 proteins and drug targets for COVID-19 using network-based label propagation

    Full text link
    Motivated by the critical need to identify new treatments for COVID-19, we present a genome-scale, systems-level computational approach to prioritize drug targets based on their potential to regulate host- virus interactions or their downstream signaling targets. We adapt and specialize network label propagation methods to this end. We demonstrate that these techniques can predict human-SARS-CoV-2 protein interactors with high accuracy. The top-ranked proteins that we identify are enriched in host biological processes that are potentially coopted by the virus. We present cases where our methodology generates promising insights such as the potential role of HSPA5 in viral entry. We highlight the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. We identify tubulin proteins involved in ciliary assembly that are targeted by anti-mitotic drugs. Drugs that we discuss are already undergoing clinical trials to test their efficacy against COVID-19. Our prioritized list of human proteins and drug targets is available as a general resource for biological and clinical researchers who are repositioning existing and approved drugs or developing novel therapeutics as anti-COVID-19 agents.First author draf

    Comparing human–Salmonella with plant–Salmonella protein–protein interaction predictions

    Get PDF
    Salmonellosis is the most frequent foodborne disease worldwide and can be transmitted to humans by a variety of routes, especially via animal and plant products. Salmonella bacteria are believed to use not only animal and human but also plant hosts despite their evolutionary distance. This raises the question if Salmonella employs similar mechanisms in infection of these diverse hosts. Given that most of our understanding comes from its interaction with human hosts, we investigate here to what degree knowledge of Salmonella–human interactions can be transferred to the Salmonella–plant system. Reviewed are recent publications on analysis and prediction of Salmonella–host interactomes. Putative protein–protein interactions (PPIs) between Salmonella and its human and Arabidopsis hosts were retrieved utilizing purely interolog-based approaches in which predictions were inferred based on available sequence and domain information of known PPIs, and machine learning approaches that integrate a larger set of useful information from different sources. Transfer learning is an especially suitable machine learning technique to predict plant host targets from the knowledge of human host targets. A comparison of the prediction results with transcriptomic data shows a clear overlap between the host proteins predicted to be targeted by PPIs and their gene ontology enrichment in both host species and regulation of gene expression. In particular, the cellular processes Salmonella interferes with in plants and humans are catabolic processes. The details of how these processes are targeted, however, are quite different between the two organisms, as expected based on their evolutionary and habitat differences. Possible implications of this observation on evolution of host–pathogen communication are discussed
    corecore