3,661 research outputs found

    Theory and Practice of Data Citation

    Full text link
    Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming "data-intensive", where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has. The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation. The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association for Information Science and Technology (JASIST), 201

    XML in Motion from Genome to Drug

    Get PDF
    Information technology (IT) has emerged as a central to the solution of contemporary genomics and drug discovery problems. Researchers involved in genomics, proteomics, transcriptional profiling, high throughput structure determination, and in other sub-disciplines of bioinformatics have direct impact on this IT revolution. As the full genome sequences of many species, data from structural genomics, micro-arrays, and proteomics became available, integration of these data to a common platform require sophisticated bioinformatics tools. Organizing these data into knowledgeable databases and developing appropriate software tools for analyzing the same are going to be major challenges. XML (eXtensible Markup Language) forms the backbone of biological data representation and exchange over the internet, enabling researchers to aggregate data from various heterogeneous data resources. The present article covers a comprehensive idea of the integration of XML on particular type of biological databases mainly dealing with sequence-structure-function relationship and its application towards drug discovery. This e-medical science approach should be applied to other scientific domains and the latest trend in semantic web applications is also highlighted

    A network approach for managing and processing big cancer data in clouds

    Get PDF
    Translational cancer research requires integrative analysis of multiple levels of big cancer data to identify and treat cancer. In order to address the issues that data is decentralised, growing and continually being updated, and the content living or archiving on different information sources partially overlaps creating redundancies as well as contradictions and inconsistencies, we develop a data network model and technology for constructing and managing big cancer data. To support our data network approach for data process and analysis, we employ a semantic content network approach and adopt the CELAR cloud platform. The prototype implementation shows that the CELAR cloud can satisfy the on-demanding needs of various data resources for management and process of big cancer data

    The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web

    Get PDF
    Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption.Comment: 20 page

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    Enhancing navigation in biomedical databases by community voting and database-driven text classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.</p> <p>Results</p> <p>Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.</p> <p>Conclusion</p> <p>Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.</p> <p>The system can be accessed at <url>http://pepbank.mgh.harvard.edu</url>.</p

    Automating Pharmacokinetic Predictions in \u3cem\u3eArtemisia\u3c/em\u3e

    Get PDF
    Pharmacokinetics (PK) is the time course of a compound in the body that is dependent on mechanisms of absorption, distribution, metabolism, and excretion or ADME. A thorough understanding of PK is essential to predict the consequences of organisms exposed to chemicals. In medicine, predictions of PK of drugs allows us to properly prescribe drug treatments. In toxicology, PK allows us to predict the potential exposure of environmental contaminants and how they may affect organisms at the time of exposure or in the future. Chemical ecology could benefit from computational predictions of PK to better understand which plants are consumed or avoided by wild herbivores. A limitation in computational predictions of PK in chemical ecology is the large quantities of biodiverse natural products involved in complex plant-herbivore-microbial interactions compared to biomedical and environmental toxicology studies that focus on a select number of chemicals. The objective of this research was to automate the process of mining predicted PK of known chemical structures in plants consumed by herbivores and to use predicted PK output to test hypotheses. The first hypothesis is that because monoterpenes are smaller in molecular weight and have relatively high lipophilicity when compared to phenolics and sesquiterpenes, they would have higher absorption, be more likely to be substrates for efflux transporters that regulate absorption, and be more likely to inhibit metabolizing enzymes than phenolics and sesquiterpenes. The second hypothesis is that monoterpenes that are induced or avoided by foraging herbivores would have higher absorption, be less likely to be substates for efflux transporters, and be more likely to inhibit metabolizing enzymes compared to the individual monoterpenes that are not induced or avoided by herbivores. This automated approach used Python packages to obtain chemical notations from the PubChem website and mine predicted PK information for chemical input from the SwissADME website. The PK output from SwissADME was analyzed using ANOVAs to test for differences in molecular weight and lipophilicity among chemical classes (monoterpenes, phenolics, and sesquiterpenes). Chi-squared tests were used to assess if chemical groups had high or low absorption, were substrates of efflux transporters, or inhibited metabolizing enzymes. Mined PK data for chemicals can be used to understand drug-drug interactions in pharmacology, predict exposure to environmental contaminants in toxicology, and identify mechanisms mediating plant-microbe-herbivore interactions. However, the broad benefits of mining predicted PK across disciplines requires a workforce with competency in chemistry, physiology, and computing who can validate the automation process and test hypotheses relative to different disciplines. Course-based and Lab-based Undergraduate Research Experiences (CUREs and LUREs) have been proven to not only improve grades but also increase engagement diversity and inclusion. As a graduate teaching assistant, I created and taught a PK LURE module in an undergraduate Animal Physiology and Nutrition course to create a sustainable quality control step to validate input of chemical structures and PK output generated from the automated process. The course simultaneously provided students with an authentic research experience where they integrated chemistry, pharmacology, computing, public databases, and literature searches to propose and test new hypotheses. Students gained indispensable interdisciplinary research skills that can be transferred to jobs in veterinary and human medicine, pharmaceutics, and natural sciences. Moreover, undergraduates used existing and new PK data to generate and test novel hypotheses that go beyond the work of any single graduate student or discipline. Overall, the integration of computing and authentic research experiences has advanced the research capacity of a diverse workforce who can predict exposure and consequences of chemicals in organisms
    corecore