23 research outputs found

    Global Unique identification LINCS Digital Research Objects to Enable Citation, Reuse, and Persistence of LINCS Data.

    No full text
    The ability to cite LINCS datasets is critical for users and data producers alike. Requirements for dataset citation records have been set forth by the Joint Declaration of Data Citation Principles (JDDCP) and include attribution, a unique identifier, data persistence, verification, and interoperability. Data citation is also an important facilitator of the FAIR (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific datasets. DOIs are well aligned with journal tracking citations, and the cost is justified within the business goal. However, LINCS datasets are complex and require granular identifiers of various LINCS Digital Research Objects (DROs), including a dataset at a specific data level, a dataset group (combining all data levels from one experiment), derived datasets (e.g. computationally reprocessed, by a LINCS Center or an outside group) and also various LINCS metadata. Such identifiers are needed to describe provenance. Although DOI have been used for datasets, we preferred an open and free solution. To accomplish that we created collections of LINCS DROs in the MIRIAM Registry to generate unique, perennial and location-independent identifiers. Such collections include data-level specific dataset packages, dataset groups, small molecules, and cells. The identifiers.org service which is built upon the information stored in MIRIAM, provides directly resolvable identifiers in the form of Uniform Resource Locators (URLs). This system provides a globally unique identification scheme to which any external resource can point and a resolving system that gives the owner / creator of the resource collection flexibility to update the resolving URL without changing the global identifiers. These dataset and dataset group identifiers are the central component of the LINCS dataset citation record, which further includes the authors, title, year, repository, resource type and version. These citation records have been incorporated into the LINCS Data Portal and can be downloaded in several formats making it easy to cite a specific LINCS dataset or a dataset group. The LINCS provenance model provides a record of creation, manipulation, and source of the dataset and metadata that are part of a LINCS dataset package. It will provide mappings of LINCS dataset packages to corresponding records in public repositories and at the data generation centers. Persistent global identifiers of LINCS DROs, formal dataset provenance, and mappings of key LINCS metadata to external qualified references (such as ontologies) also are required for the persistence of LINCS beyond the funded project and independent from the current LINCS Centers

    FAIR LINCS Metadata Powered by CEDAR Cloud-Based Templates and Services

    No full text
    The Library of Integrated Network-based Signatures (LINCS) program generates a wide variety of cell-based perturbation-response signatures using diverse assay technologies. For example, LINCS includes large-scale transcriptional profiling of genetic and small molecule perturbations, and various proteomics and imaging datasets. We currently obtain metadata through an online platform, the metadata submission tool (MST), based off the use of spreadsheet data templates. While functional, it remains difficult to maintain FAIR standards, specifically remaining findable and re-usable, for metadata without (enforced) controlled vocabulary and internally built linkages to ontologies and metadata standards. To maintain FAIR-centric metadata, we have worked with the Center for Enhanced Data Annotation and Retrieval (CEDAR), to develop modular metadata templates linked to ontologies and standards present in the NCBO Bioportal. We have also developed a new LINCS Dataset Submission Tool (DST), which links new LINCS datasets to the form-fillable CEDAR templates. This metadata management framework supports authoring, curation, validation, management, and sharing of LINCS metadata, while building upon the existing LINCS metadata standards and data-release workflows. Additionally, the CEDAR technology facilitates metadata validation and testing testing, enabling users to ensure their input metadata are LINCS compliant prior to submission for public release. CEDAR templates have been developed for reagent metadata, experimental metadata, to describe assays, and to capture global dataset attributes. Integrating the submission of all these components into one submission tool and workflow we aim to significantly simplify and streamline the workflow of LINCS dataset submission, processing, validation, registration, and publication. As other projects apply the same approach, many more datasets will become cross-searchable and can be linked optimizing the metadata pathway from submission to discovery

    Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center

    No full text
    The NIH-funded LINCS Consortium is creating an extensive reference library of cell-based perturbation response signatures and sophisticated informatics tools incorporating a large number of perturbagens, model systems, and assays. To date, more than 350 datasets have been generated including transcriptomics, proteomics, epigenomics, cell phenotype and competitive binding profiling assays. The large volume and variety of data necessitate rigorous data standards and effective data management including modular data processing pipelines and end-user interfaces to facilitate accurate and reliable data exchange, curation, validation, standardization, aggregation, integration, and end user access. Deep metadata annotations and the use of qualified data standards enable integration with many external resources. Here we describe the end-to-end data processing and management at the DCIC to generate a high-quality and persistent product. Our data management and stewardship solutions enable a functioning Consortium and make LINCS a valuable scientific resource that aligns with big data initiatives such as the BD2K NIH Program and concords with emerging data science best practices including the findable, accessible, interoperable, and reusable (FAIR) principles

    LINCS Small Molecules Standardization and Annotation to Improve Data Integration, Analysis, and Modeling

    No full text
    <p>The physical properties of small molecules, in particular “drug-like” molecules, including their ability to interact with and modulate protein function, cell permeability, and metabolic stability make them powerful tools to study biological systems. Large amounts of small molecule biological activity data are publicly available and small molecules are systematically studied in the diverse profiling assays of the LINCS Consortium. To integrate LINCS data across the various assays, Centers, and with external bioactivity data requires to uniquely identify each small molecule samples tested in an assay based on its unique “active” component. Typically, this is done based on the unique chemical structure. Although non-trivial, a unique, single fragment representation of a organic small molecule can, in most cases, be generated after removing salt counter ions and addends, considering ionization states, tautomeric forms and canonicalizing the chemical structure representation, e.g. as a canonical SMILES or InChI. We implemented the chemical structure standardization using chemical informatics tools. Exceptions include small numbers of metal-organic and multi-component compounds, which we handled by manual expert curation. However, a significant challenge in standardizing small molecules lies in the considerable variability of reported chemical structures for the same compound, depending on the source. Typical and frequent errors include inversed or removed stereochemical centers, relative vs absolute stereochemical configuration, E/Z geometric isomerism of alkenes or imines, loss of aromaticity, changes in oxidation states and other problems. Further complexities can be introduced by different representations of compound mixtures. Public resources, such as PubChem, report many different chemical structures for the same compound, for example as identified by a common drug name. The apparent lack of curation of small molecule chemical structures results in error propagation, for example incorrect chemical structures submitted to PubChem, which are then referenced and potentially added to another resource.</p><p>Herein we present the chemical structure standardization and registration pipeline implemented for LINCS small molecules including manual curation, automated steps, mappings to PubChem, naming, validation, and several QC and review steps. The standardization pipeline considers stereochemical representations, mixtures of stereoisomers, geometric isomers of carbon-carbon and carbon-hetero double bonds, regio-isomers, non-isomeric mixtures, ionization states, tautomeric forms, and salt forms or other addends. We illustrate typical errors and their propagation, a problem exacerbated by the lack of user-friendly tools to enable biologists to work with complex chemical information. In LINCS we work to disambiguate compound identity during the registration process using redundant information including chemical structures, drug names, vendor information and provided cross references.</p>Standardized LINCS small molecules are mapped to PubChem, ChEMBL, ChEBI, and via UniChem to many other resources. These mappings facilitate the curation and integration of diverse external annotations, such as biochemical target information. Compound standardization and mapping makes it easy to integrate different LINCS signatures. The LINCS Small Molecule collection has been registered into the MIRIAM Registry, and identifiers.org for the Persistent URL (PURL). The identifiers.org PURL for each LINCS small molecule re-directs to the LINCS Data Portal, and the information is accessible via RESTful API, in coordination with interoperable smartAPI

    LINCSAnalytics: An integrated platform for the efficient query and computation across diverse LINCS signatures

    No full text
    <p>The Library of Integrated Network-based Signatures (LINCS) program generates a wide variety of cell-based perturbation-response signatures using diverse assay technologies. A signature, defined as a specific cellular response to a given perturbation, can hence be expressed as a function of a set of parameters: the model system (typically a cell), the perturbation (e.g. small molecule) and the detected analytes (e.g. expressed in a transcriptional profiling assay) plus additional experimental details (such as concentration and time). In order to effectively use LINCS data for a wide variety of scientific use case, signatures need to be readily queryable, retrievable and accessible for computation as a function of all of these dimensions.</p><br><p>Here we present a computational platform built on top of the open source Cloudera Hadoop platform allowing the distributed storage and processing of large datasets through a number of dedicated modules. LINCS signature data and standardized entity metadata are stored in the Hadoop Distributed Filesystem. Apache HIVE and IMPALA are responsible for the fast query and retrieval of any data point, while computation and modeling are available through Apache Spark and its Sparklyr R interface. Full accessibility to the core of the platform is achieved via a set of APIs, which also allow to build and deploy custom-made applications. As an initial demonstration, we show a simple Shiny R application to interactively query and retrieve LINCS signatures for any dimension of interest.</p>To enable the computational biology community to use LINCS data in their research via the LINCS Analytics platform, we deployed an R package that allows to retrieve the available data and metadata for any dimension of interest. It also allows on the fly aggregation of replicates and filtering by desired output values

    Metadata Standard and Data Exchange Specifications to Describe, Model, and Integrate Complex and Diverse High-Throughput Screening Data from the Library of Integrated Network-based Cellular Signatures (LINCS)

    No full text
    The National Institutes of Health Library of Integrated Network-based Cellular Signatures (LINCS) program is generating extensive multidimensional data sets, including biochemical, genome-wide transcriptional, and phenotypic cellular response signatures to a variety of small-molecule and genetic perturbations with the goal of creating a sustainable, widely applicable, and readily accessible systems biology knowledge resource. Integration and analysis of diverse LINCS data sets depend on the availability of sufficient metadata to describe the assays and screening results and on their syntactic, structural, and semantic consistency. Here we report metadata specifications for the most important molecular and cellular components and recommend them for adoption beyond the LINCS project. We focus on the minimum required information to model LINCS assays and results based on a number of use cases, and we recommend controlled terminologies and ontologies to annotate assays with syntactic consistency and semantic integrity. We also report specifications for a simple annotation format (SAF) to describe assays and screening results based on our metadata specifications with explicit controlled vocabularies. SAF specifically serves to programmatically access and exchange LINCS data as a prerequisite for a distributed information management infrastructure. We applied the metadata specifications to annotate large numbers of LINCS cell lines, proteins, and small molecules. The resources generated and presented here are freely available

    The LINCS Data Portal and FAIR LINCS Dataset Landing Pages

    No full text
    <p>The LINCS Data Portal (LDP) presents a unified interface to access LINCS datasets and metadata with mappings to several external resources. LDP provides various options to explore, query, and download LINCS dataset packages and reagents that have been described using the LINCS metadata standards.</p><p>We recently introduced LINCS Dataset Landing Pages to provide integrated access to important content for each LINCS dataset. The landing pages provide deep metadata for each LINCS dataset including description of the assays, authors, data analysis pipelines, and standardized reagents such as small molecules cell lines, antibodies, etc, with rich annotations. The landing pages are a key component to make LINCS data persistent and reusable, by integrating LINCS datasets, data processing pipelines, analytes, perturbations, model systems and related concepts as uniquely identifiable digital research objects.</p><p>LDP supports ontology-driven concept search, free text search, facet filtering, logical intersection of filters (AND, OR), and list, table, and matrix views. LDP enables download of LINCS dataset packages, which consist of released datasets and associated metadata. LDP also provides several specialized apps including small molecule compounds and cell lines. A landing page facilitates interactive exploration of all LINCS datasets via several classifications.</p>LDP is built on a robust API and is integrated with the MetaData Registry and interfaces with other components of the Integrated Knowledge Environment (IKE) developed in our Center. All LINCS datasets are also indexed in bioCADDIE DataMed

    Evolving BioAssay Ontology (BAO): modularization, integration and applications

    Get PDF
    The lack of established standards to describe and annotate biological assays and screening outcomes in the domain of drug and chemical probe discovery is a severe limitation to utilize public and proprietary drug screening data to their maximum potential. We have created the BioAssay Ontology (BAO) project ( http://bioassayontology.org ) to develop common reference metadata terms and definitions required for describing relevant information of low-and high-throughput drug and probe screening assays and results. The main objectives of BAO are to enable effective integration, aggregation, retrieval, and analyses of drug screening data. Since we first released BAO on the BioPortal in 2010 we have considerably expanded and enhanced BAO and we have applied the ontology in several internal and external collaborative projects, for example the BioAssay Research Database (BARD). We describe the evolution of BAO with a design that enables modeling complex assays including profile and panel assays such as those in the Library of Integrated Network-based Cellular Signatures (LINCS). One of the critical questions in evolving BAO is the following: how can we provide a way to efficiently reuse and share among various research projects specific parts of our ontologies without violating the integrity of the ontology and without creating redundancies. This paper provides a comprehensive answer to this question with a description of a methodology for ontology modularization using a layered architecture. Our modularization approach defines several distinct BAO components and separates internal from external modules and domain-level from structural components. This approach facilitates the generation/extraction of derived ontologies (or perspectives) that can suit particular use cases or software applications. We describe the evolution of BAO related to its formal structures, engineering approaches, and content to enable modeling of complex assays and integration with other ontologies and datasets
    corecore