47 research outputs found
Analysis and Synthesis of Metadata Goals for Scientific Data
The proliferation of discipline-specific metadata schemes contributes to artificial barriers that can impede interdisciplinary and transdisciplinary research. The authors considered this problem by examining the domains, objectives, and architectures of nine metadata schemes used to document scientific data in the physical, life, and social sciences. They used a mixed-methods content analysis and Greenberg’s (2005) metadata objectives, principles, domains, and architectural layout (MODAL) framework, and derived 22 metadata-related goals from textual content describing each metadata scheme. Relationships are identified between the domains (e.g., scientific discipline and type of data) and the categories of scheme objectives. For each strong correlation (\u3e0.6), a Fisher’s exact test for nonparametric data was used to determine significance (p \u3c .05).
Significant relationships were found between the domains and objectives of the schemes. Schemes describing observational data are more likely to have “scheme harmonization” (compatibility and interoperability with related schemes) as an objective; schemes with the objective “abstraction” (a conceptual model exists separate from the technical implementation) also have the objective “sufficiency” (the scheme defines a minimal amount of information to meet the needs of the community); and schemes with the objective “data publication” do not have the objective “element refinement.” The analysis indicates that many metadata-driven goals expressed by communities are independent of scientific discipline or the type of data, although they are constrained by historical community practices and workflows as well as the technological environment at the time of scheme creation. The analysis reveals 11 fundamental metadata goals for metadata documenting scientific data in support of sharing research data across disciplines and domains. The authors report these results and highlight the need for more metadata-related research, particularly in the context of recent funding agency policy changes
Resources for Lothbrok: Optimizing SPARQL Queries over Decentralized Knowledge Graphs
A repository for the resources needed to reproduce the experiments in our paper "Optimizing SPARQL Queries over Decentralized Knowledge Graphs"
DBpedia RDF2Vec Graph Embeddings
Generation of RDF2Vec embeddings.DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1]. The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021). Figure of cosine similarities between a selected set of DBpedia entities are provided in the dataset here. Generating Embeddings The code for generating these embeddings can be found here. Run the run.sh script that wraps all the necessary commmands to generate embeddings bash run.sh The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files. A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files. Run Time of Embeddings Generation Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine. Total: 1 day, 8 hours, 52 minutes, 41 secondsWalk generation: 0 days, 7 minutes, 24 minutes, 36 secondsTraining: 1 day, 1 hour, 28 minutes, 5 seconds Parameters Used Here is listed the parameters used to generate the embeddings provided here: Number of walks per entity: 100Depth (hops) per walk: 4Walk generation mode: RANDOM_WALKS_DUPLICATE_FREEThreads: # of processors / 2Training mode: sgEmbeddings vector dimension: 200Minimum word2vec word count: 1Sample rate: 0.0Training window size: 5Training epochs:
A core ontology for modeling life cycle sustainability assessment on the Semantic Web with Accompanying Database
To enable and support the uptake of semantic ontologies, we present a core ontology developed specifically to capture the data relevant for life cycle sustainability assessment. We further demonstrate the utility of the ontology by using it to integrate data relevant to sustainability assessments, such as EXIOBASE and the Yale Stocks and Flow Database to the Semantic Web. These datasets can be accessed by the machine-readable endpoint using SPARQL, a semantic query language
Resources for Lothbrok: Optimizing SPARQL Queries over Decentralized Knowledge Graphs
A repository for the resources needed to reproduce the experiments in our paper "Optimizing SPARQL Queries over Decentralized Knowledge Graphs"
Automatically Extracted SHACL Shapes for WikiData, DBpedia, YAGO-4, and LUBM & Associated Coverage Statistics
The uploaded datasets contain automatically extracted SHACL shapes for the following datasets: WikiData (the truthy dump from September 2021 filtered by removing non-English strings) [1]DBpedia [2]YAGO-4 [3] LUBM (scale factor 500) [4] The validating shapes for these datasets are generated by a program that parses the corresponding RDF files (in `.nt` format). The extracted shapes encode various SHACL constraints, e.g., sh:minCount, sh:path, sh:class, sh:datatype etc. For each shape we encode coverage in terms of number of entities satisfying such shape, this information is encoded using the void:entities predicate. We have provided as executable Jar file the program we developed to extract these SHACL shapes. More details about the datasets used to extract these shapes and how to run the Jar are available on our GitHub repository https://github.com/Kashif-Rabbani/validatingshapes. [1] Vrandečić, Denny, and Markus Krötzsch. "Wikidata: a free collaborative knowledgebase." Communications of the ACM 57.10 (2014): 78-85. [2] Auer, Sören, et al. "Dbpedia: A nucleus for a web of open data." The semantic web. Springer, Berlin, Heidelberg, 2007. 722-735. [3] Pellissier Tanon, Thomas, Gerhard Weikum, and Fabian Suchanek. "Yago 4: A reason-able knowledge base." European Semantic Web Conference. Springer, Cham, 2020. [4] Guo, Yuanbo, Zhengxiang Pan, and Jeff Heflin. "LUBM: A benchmark for OWL knowledge base systems." Journal of Web Semantics 3.2-3 (2005): 158-182
The Project of Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments
The source code, datasets, and scripts for reproducing of our paper entitled "Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments"
Tutorial for the 2022 ACM SIGMOD Conference: Spatial Data Quality in the IoT Era: Management and Exploitation
Within the rapidly expanding Internet of Things (IoT), growing amounts of spatially referenced data are being generated. Due to the dynamic, decentralized, and heterogeneous nature of the IoT, spatial IoT data (SID) quality has attracted considerable attention in academia and industry. How to invent and use technologies for managing spatial data quality and exploiting low-quality spatial data are key challenges in the IoT. In this tutorial, we highlight the SID consumption requirements in applications and offer an overview of spatial data quality in the IoT setting. In addition, we review pertinent technologies for quality management and low-quality data exploitation, and we identify trends and future directions for quality-aware SID management and utilization. The tutorial aims to not only help researchers and practitioners to better comprehend SID quality challenges and solutions, but also offer insights that may enable innovative research and applications
Generalized Approximate Message Passing Practical 2D Phase Transition Simulations Dataset
This deposition contains the results from a simulation of phase transitions for various practical 2D problem suites when using the Generalised Approximate Message Passing (GAMP) reconstruction algorithm. The deposition consists of: Five HDF5 databases containing the results from the phase transition simulations (gamp_practical_2d_phase_transitions_ID_[0-4]_of_5.hdf5). The Python script which was used to create the databases (gamp_practical_2d_phase_transitions.py). A Python module with tools needed to run the simulations (gamp_pt_tools.py). MD5 and SHA256 checksums of the databases and Python scripts (gamp_practical_2d_phase_transitions.MD5SUMS / gamp_practical_2d_phase_transitions.SHA256SUMS). The HDF5 databases are licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/) . Since the CC BY 4.0 license is not well suited for source code, the Python scripts are licensed under the BSD 2-Clause license (http://opensource.org/licenses/BSD-2-Clause) . The files are provided as-is with no warranty as detailed in the above mentioned licenses
Reconstruction Algorithms in Undersampled AFM Imaging - results
This data set contains numerical simulation results from experiments for the paper "Review of compressed sensing reconstruction algorithms in AFM cell imaging", submitted to IEEE Journal of Selected Topics in Signal Processing. The data set consists of an HDF5 file containing the simulation results as well as MD5 and SHA checksums of the HDF5 database for validating the integrity of the data after download. The data set is licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Python scripts used for producing these results as well as Python scripts for extracting images and data used in the accompanying paper from the database can be found in the accompanying deposition http://doi.org/10.5281/zenodo.18745. The data set contains images, and reconstructed versions of these, originally published in the data set available at http://dx.doi.org/10.5281/zenodo.17573
