181 research outputs found

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Organizing scientific data sets: studying similarities and differences in metadata and subject term creation

    Get PDF
    According to Salo, the metadata entered into repositories aredisorganized and metadata schemes underlying repositories are arcane. This creates a challenging repository environment in regards to personal information management (PIM) and knowledge organization systems (KOSs). This dissertation research is a step towards addressing the need to study information organization of scientific data in more detail. METHODS: A concurrent triangulation mixed methods approach was used to study the descriptive metadata and subject term application of information professionals and scientists when working with two data sets (the bird data set and the hunting data set). Quantitative and qualitative methods were used in combination during study design, data collection, and analysis. RESULTS: A total of 27 participants, 11 information professionals and 16 scientists took part in this study. Descriptive metadata results indicate that information professionals were more likely to use standardized metadata schemes. Scientists did not use library-based standards to organize data in their own collections. Nearly all scientists mentioned how central software was to their overall data organization processes. Subject term application results suggest that the Integrated Taxonomic Information System (ITIS) was the best vocabulary for describing scientific names, while Library of Congress Subject Headings (LCSH) was best for describing topical terms. The two groups applied 45 topical terms to the bird data set and 49 topical terms to the hunting data set. Term overlap, meaning the same terms were applied by both groups, was close to 25% for each data set (27% for the bird data set and 24% for the hunting data set). Unique terms, those terms applied by either group were more widely dispersed. CONCLUSIONS: While there were similarities between the two groups, it is the differences that were the most apparent. Based on this research it is recommended that general repositories use metadata created by information professionals, while domain specific repositories use metadata created by scientists

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Modelling and measurement in synthetic biology

    Get PDF
    Synthetic biology applies engineering principles to make progress in the study of complex biological phenomena. The aim is to develop understanding through the praxis of construction and design. The computational branch of this endeavour explicitly brings the tools of abstraction and modularity to bear. This thesis pursues two distinct lines of inquiry concerning the application of computational tools in the setting of synthetic biology. One thread traces a narrative through multi-paradigm computational simulations, interpretation of results, and quantification of biological order. The other develops computational infrastructure for describing, simulating and discovering, synthetic genetic circuits. The emergence of structure in biological organisms, morphogenesis, is critically important for understanding both normal and pathological development of tissues. Here, we focus on epithelial tissues because models of two dimensional cellular monolayers are computationally tractable. We use a vertex model that consists of a potential energy minimisation process interwoven with topological changes in the graph structure of the tissue. To make this interweaving precise, we define a language for propagators from which an unambiguous description of the simulation methodology can be constructed. The vertex model is then used to reproduce laboratory results of patterning in engineered mammalian cells. The assertion that the claim of reproduction is justified is based on a novel measure of structure on coloured graphs which we call path entropy. This measure is then extended to the setting of continuous regions and used to quantify the development of structure in house mouse (Mus musculus) embryos using three dimensional segmented anatomical models. While it is recognised that DNA can be considered a powerful computational environment, it is far from obvious how to program with nucleic acids. Using rule-based modelling of modular biological parts, we develop a method for discovering synthetic genetic programs that meet a specification provided by the user. This method rests on the concept of annotation as applied to rule-based programs. We begin with annotating rules and proceed to generating entire rule-based programs from annotations themselves. Building on those tools we describe an evolutionary algorithm for discovering genetic circuits from specifications provided in terms of probability distributions. This strategy provides a dual benefit: using stochastic simulation captures circuit behaviour at low copy numbers as well as complex properties such as oscillations, and using standard biological parts produces results that are implementable in the laboratory

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    Designing Data Spaces

    Get PDF
    This open access book provides a comprehensive view on data ecosystems and platform economics from methodical and technological foundations up to reports from practical implementations and applications in various industries. To this end, the book is structured in four parts: Part I “Foundations and Contexts” provides a general overview about building, running, and governing data spaces and an introduction to the IDS and GAIA-X projects. Part II “Data Space Technologies” subsequently details various implementation aspects of IDS and GAIA-X, including eg data usage control, the usage of blockchain technologies, or semantic data integration and interoperability. Next, Part III describes various “Use Cases and Data Ecosystems” from various application areas such as agriculture, healthcare, industry, energy, and mobility. Part IV eventually offers an overview of several “Solutions and Applications”, eg including products and experiences from companies like Google, SAP, Huawei, T-Systems, Innopay and many more. Overall, the book provides professionals in industry with an encompassing overview of the technological and economic aspects of data spaces, based on the International Data Spaces and Gaia-X initiatives. It presents implementations and business cases and gives an outlook to future developments. In doing so, it aims at proliferating the vision of a social data market economy based on data spaces which embrace trust and data sovereignty

    Preface

    Get PDF
    • 

    corecore