811 research outputs found

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Full text link
    Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .Comment: Revisio

    Gene Regulatory Network Reconstruction Using Dynamic Bayesian Networks

    Get PDF
    High-content technologies such as DNA microarrays can provide a system-scale overview of how genes interact with each other in a network context. Various mathematical methods and computational approaches have been proposed to reconstruct GRNs, including Boolean networks, information theory, differential equations and Bayesian networks. GRN reconstruction faces huge intrinsic challenges on both experimental and theoretical fronts, because the inputs and outputs of the molecular processes are unclear and the underlying principles are unknown or too complex. In this work, we focused on improving the accuracy and speed of GRN reconstruction with Dynamic Bayesian based method. A commonly used structure-learning algorithm is based on REVEAL (Reverse Engineering Algorithm). However, this method has some limitations when it is used for reconstructing GRNs. For instance, the two-stage temporal Bayes network (2TBN) cannot be well recovered by application of REVEAL; it has low accuracy and speed for high dimensionality networks that has above a hundred nodes; and it even cannot accomplish the task of reconstructing a network with 400 nodes. We implemented an algorithm for DBN structure learning with Friedman\u27s score function to replace REVEAL, and tested it on reconstruction of both synthetic networks and real yeast networks and compared it with REVEAL in the absence or presence of preprocessed network generated by Zou and Conzen\u27s algorithm. The new score metric improved the precision and recall of GRN reconstruction. Networks of gene interactions were reconstructed using a Dynamic Bayesian Network (DBN) approach and were analyzed to identify the mechanism of chemical-induced reversible neurotoxicity through reconstruction of gene regulatory networks in earthworms with tools curating relevant genes from non-model organism\u27s pathway to model organism pathway

    Data Sharing Frames: How Scientists Understand the Work of Sharing Scientific Data

    Get PDF
    The curation of data is fundamental to their wider dissemination and use. This paper investigates the frames of workers who perform data curation in scientific contexts. We view data curation as a sense-making practice, where workers collaborate to disseminate meaningful data to a broad set of prospective users. Previous Information Systems investigations have suggested that data-related activities are dependent on workers’ understanding of their local work context. We expand this with an evolving and long-term view. We use a stepwise-deductive induction method to examine how scientists understand the work involved in curating scientific data for public sharing. We draw on frames as the theoretical lens of the study that enables us to identify three data sharing frames – the object, curation, and aligning frames – as important frames that shape how scientists curate data for public sharing. Our analysis provides a deeper understanding of the nuances of managing scientific data for public access. Our main contribution is the articulation of an evolving and long-term view of how workers approach their tasks in getting data ready for long-term public use

    DC-ATLAS: a systems biology resource to dissect receptor specific signal transduction in dendritic cells

    Get PDF
    BACKGROUND: The advent of Systems Biology has been accompanied by the blooming of pathway databases. Currently pathways are defined generically with respect to the organ or cell type where a reaction takes place. The cell type specificity of the reactions is the foundation of immunological research, and capturing this specificity is of paramount importance when using pathway-based analyses to decipher complex immunological datasets. Here, we present DC-ATLAS, a novel and versatile resource for the interpretation of high-throughput data generated perturbing the signaling network of dendritic cells (DCs). RESULTS: Pathways are annotated using a novel data model, the Biological Connection Markup Language (BCML), a SBGN-compliant data format developed to store the large amount of information collected. The application of DC-ATLAS to pathway-based analysis of the transcriptional program of DCs stimulated with agonists of the toll-like receptor family allows an integrated description of the flow of information from the cellular sensors to the functional outcome, capturing the temporal series of activation events by grouping sets of reactions that occur at different time points in well-defined functional modules. CONCLUSIONS: The initiative significantly improves our understanding of DC biology and regulatory networks. Developing a systems biology approach for immune system holds the promise of translating knowledge on the immune system into more successful immunotherapy strategies

    Microarray tools and analysis methods to better characterize biological networks

    Get PDF
    To accurately model a biological system (e.g. cell), we first need to characterize each of its distinct networks. While omics data has given us unprecedented insight into the structure and dynamics of these networks, the associated analysis routines are more involved and the accuracy and precision of the experimental technologies not sufficiently examined. The main focus of our research has been to develop methods and tools to better manage and interpret microarray data. How can we improve methods to store and retrieve microarray data from a relational database? What experimental and biological factors most influence our interpretation of a microarray's measurements? By accounting for these factors, can we improve the accuracy and precision of microarray measurements? It's essential to address these last two questions before using 'omics data for downstream analyses, such as inferring transciption regulatory networks from microarray data. While answers to such questions are vital to microarray research in particular, they are equally relevant to systems biology in general. We developed three studies to investigate aspects of these questions when using Affymetrix expression arrays. In the first study, we develop the Data-FATE framework to improve the handling of large scientific data sets. In the next two studies, we developed methods and tools that allow us to examine the impact of physical and technical factors known or suspected to dramatically alter the interpretation of a microarray experiment. In the second study, we develop ArrayInitiative -- a tool that simplifies the process of creating custom CDFs -- so that we can easily re-design the array specifications for Affymetrix 3' IVT expression arrays. This tool is essential for testing the impact of the various factors, and for making the framework easy to communicate and re-use. We then use ArrayInitiative in a case study to illustrate the impact of several factors known to distort microarray signals. In the third study, we systematically and exhaustively examine the effect of physical and technical factors -- both generally accepted and novel -- on our interpretation of dozens of experiments using hundreds of E. coli Affymetrix microarrays

    The value of standards for health datasets in artificial intelligence-based applications

    Get PDF
    Artificial intelligence as a medical device is increasingly being applied to healthcare for diagnosis, risk stratification and resource allocation. However, a growing body of evidence has highlighted the risk of algorithmic bias, which may perpetuate existing health inequity. This problem arises in part because of systemic inequalities in dataset curation, unequal opportunity to participate in research and inequalities of access. This study aims to explore existing standards, frameworks and best practices for ensuring adequate data diversity in health datasets. Exploring the body of existing literature and expert views is an important step towards the development of consensus-based guidelines. The study comprises two parts: a systematic review of existing standards, frameworks and best practices for healthcare datasets; and a survey and thematic analysis of stakeholder views of bias, health equity and best practices for artificial intelligence as a medical device. We found that the need for dataset diversity was well described in literature, and experts generally favored the development of a robust set of guidelines, but there were mixed views about how these could be implemented practically. The outputs of this study will be used to inform the development of standards for transparency of data diversity in health datasets (the STANDING Together initiative)

    Documenting the emergence of bio-ontologies: or, why researching bioinformatics requires HPSSB.

    Get PDF
    This paper reflects on the analytic challenges emerging from the study of bioinformatic tools recently created to store and disseminate biological data, such as databases, repositories, and bio-ontologies. I focus my discussion on the Gene Ontology, a term that defines three entities at once: a classification system facilitating the distribution and use of genomic data as evidence towards new insights; an expert community specialised in the curation of those data; and a scientific institution promoting the use of this tool among experimental biologists. These three dimensions of the Gene Ontology can be clearly distinguished analytically, but are tightly intertwined in practice. I suggest that this is true of all bioinformatic tools: they need to be understood simultaneously as epistemic, social, and institutional entities, since they shape the knowledge extracted from data and at the same time regulate the organisation, development, and communication of research. This viewpoint has one important implication for the methodologies used to study these tools; that is, the need to integrate historical, philosophical, and sociological approaches. I illustrate this claim through examples of misunderstandings that may result from a narrowly disciplinary study of the Gene Ontology, as I experienced them in my own research

    Large-Scale Differentiable Causal Discovery of Factor Graphs

    Full text link
    A common theme in causal inference is learning causal relationships between observed variables, also known as causal discovery. This is usually a daunting task, given the large number of candidate causal graphs and the combinatorial nature of the search space. Perhaps for this reason, most research has so far focused on relatively small causal graphs, with up to hundreds of nodes. However, recent advances in fields like biology enable generating experimental data sets with thousands of interventions followed by rich profiling of thousands of variables, raising the opportunity and urgent need for large causal graph models. Here, we introduce the notion of factor directed acyclic graphs (f-DAGs) as a way to restrict the search space to non-linear low-rank causal interaction models. Combining this novel structural assumption with recent advances that bridge the gap between causal discovery and continuous optimization, we achieve causal discovery on thousands of variables. Additionally, as a model for the impact of statistical noise on this estimation procedure, we study a model of edge perturbations of the f-DAG skeleton based on random graphs and quantify the effect of such perturbations on the f-DAG rank. This theoretical analysis suggests that the set of candidate f-DAGs is much smaller than the whole DAG space and thus may be more suitable as a search space in the high-dimensional regime where the underlying skeleton is hard to assess. We propose Differentiable Causal Discovery of Factor Graphs (DCD-FG), a scalable implementation of -DAG constrained causal discovery for high-dimensional interventional data. DCD-FG uses a Gaussian non-linear low-rank structural equation model and shows significant improvements compared to state-of-the-art methods in both simulations as well as a recent large-scale single-cell RNA sequencing data set with hundreds of genetic interventions.Comment: 33 pages, 12 figure

    Global data for local science: Assessing the scale of data infrastructures in biological and biomedical research

    Get PDF
    publication-status: Acceptedtypes: ArticleThe use of online databases to collect and disseminate data is typically portrayed as crucial to the management of ‘big science’. At the same time, databases are not deemed successful unless they facilitate the re-use of data towards new scientific discoveries, which often involves engaging with several highly diverse and inherently unstable research communities. This paper examines the tensions encountered by database developers in their efforts to foster both the global circulation and the local adoption of data. I focus on two prominent attempts to build data infrastructures in the fields of plant science and cancer research over the last decade: The Arabidopsis Information Resource and the Cancer Biomedical Informatics Grid. I show how curators’ experience of the diverse and dynamic nature of biological research led them to envision databases as catering primarily for local, rather than global, science; and to structure them as platforms where methodological and epistemic diversity can be expressed and explored, rather than denied or overcome. I conclude that one way to define the scale of data infrastructure is to consider the range and scope of the biological and biomedical questions which it helps to address; and that within this perspective, databases have a larger scale than the science that they serve, which tends to remain fragmented into a wide variety of specialised projects
    • …
    corecore