725 research outputs found

    Schema Inference for Massive JSON Datasets

    Get PDF
    In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures sev- eral advantages, the absence of schema information has im- portant negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give com- plete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implemen- tation based on Spark, enabling reasonable schema infer- ence time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our ap- proach in terms of execution time, precision, and conciseness of inferred schemas, and scalability

    Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

    Get PDF
    El actual diluvio de datos está inundando la web con grandes volúmenes de datos representados en RDF, dando lugar a la denominada 'Web de Datos'. En esta tesis proponemos, en primer lugar, un estudio profundo de aquellos textos que nos permitan abordar un conocimiento global de la estructura real de los conjuntos de datos RDF, HDT, que afronta la representación eficiente de grandes volúmenes de datos RDF a través de estructuras optimizadas para su almacenamiento y transmisión en red. HDT representa efizcamente un conjunto de datos RDF a través de su división en tres componentes: la cabecera (Header), el diccionario (Dictionary) y la estructura de sentencias RDF (Triples). A continuación, nos centramos en proveer estructuras eficientes de dichos componentes, ocupando un espacio comprimido al tiempo que se permite el acceso directo a cualquier dat

    A method for checking the quality of geographic metadata based on ISO 19157

    Get PDF
    With recent advances in remote sensing, location-based services and other related technologies, the production of geospatial information has exponentially increased in the last decades. Furthermore, to facilitate discovery and efficient access to such information, spatial data infrastructures were promoted and standardized, with a consideration that metadata are essential to describing data and services. Standardization bodies such as the International Organization for Standardization have defined well-known metadata models such as ISO 19115. However, current metadata assets exhibit heterogeneous quality levels because they are created by different producers with different perspectives. To address quality-related concerns, several initiatives attempted to define a common framework and test the suitability of metadata through automatic controls. Nevertheless, these controls are focused on interoperability by testing the format of metadata and a set of controlled elements. In this paper, we propose a methodology of testing the quality of metadata by considering aspects other than interoperability. The proposal adapts ISO 19157 to the metadata case and has been applied to a corpus of the Spanish Spatial Data Infrastructure. The results demonstrate that our quality check helps determine different types of errors for all metadata elements and can be almost completely automated to enhance the significance of metadata

    Introduction to Linked Data

    Get PDF
    This chapter presents Linked Data, a new form of distributed data on the web which is especially suitable to be manipulated by machines and to share knowledge. By adopting the linked data publication paradigm, anybody can publish data on the web, relate it to data resources published by others and run artificial intelligence algorithms in a smooth manner. Open linked data resources may democratize the future access to knowledge by the mass of internet users, either directly or mediated through algorithms. Governments have enthusiastically adopted these ideas, which is in harmony with the broader open data movement

    Requirements for a global data infrastructure in support of CMIP6

    Get PDF
    The World Climate Research Programme (WCRP)’s Working Group on Climate Modelling (WGCM) Infrastructure Panel (WIP) was formed in 2014 in response to the explosive growth in size and complexity of Coupled Model Intercomparison Projects (CMIPs) between CMIP3 (2005–2006) and CMIP5 (2011–2012). This article presents the WIP recommendations for the global data infrastruc- ture needed to support CMIP design, future growth, and evolution. Developed in close coordination with those who build and run the existing infrastructure (the Earth System Grid Federation; ESGF), the recommendations are based on several principles beginning with the need to separate requirements, implementation, and operations. Other im- portant principles include the consideration of the diversity of community needs around data – a data ecosystem – the importance of provenance, the need for automation, and the obligation to measure costs and benefits. This paper concentrates on requirements, recognizing the diversity of communities involved (modelers, analysts, soft- ware developers, and downstream users). Such requirements include the need for scientific reproducibility and account- ability alongside the need to record and track data usage. One key element is to generate a dataset-centric rather than system-centric focus, with an aim to making the infrastruc- ture less prone to systemic failure. With these overarching principles and requirements, the WIP has produced a set of position papers, which are summa- rized in the latter pages of this document. They provide spec- ifications for managing and delivering model output, includ- ing strategies for replication and versioning, licensing, data quality assurance, citation, long-term archiving, and dataset tracking. They also describe a new and more formal approach for specifying what data, and associated metadata, should be saved, which enables future data volumes to be estimated, particularly for well-defined projects such as CMIP6. The paper concludes with a future facing consideration of the global data infrastructure evolution that follows from the blurring of boundaries between climate and weather, and the changing nature of published scientific results in the digital age

    Human Performance Contributions to Safety in Commercial Aviation

    Get PDF
    In the commercial aviation domain, large volumes of data are collected and analyzed on the failures and errors that result in infrequent incidents and accidents, but in the absence of data on behaviors that contribute to routine successful outcomes, safety management and system design decisions are based on a small sample of non- representative safety data. Analysis of aviation accident data suggests that human error is implicated in up to 80% of accidents, which has been used to justify future visions for aviation in which the roles of human operators are greatly diminished or eliminated in the interest of creating a safer aviation system. However, failure to fully consider the human contributions to successful system performance in civil aviation represents a significant and largely unrecognized risk when making policy decisions about human roles and responsibilities. Opportunities exist to leverage the vast amount of data that has already been collected, or could be easily obtained, to increase our understanding of human contributions to things going right in commercial aviation. The principal focus of this assessment was to identify current gaps and explore methods for identifying human success data generated by the aviation system, from personnel and within the supporting infrastructure

    Large-Scale Data Management and Analysis (LSDMA) - Big Data in Science

    Get PDF

    DARIAH and the Benelux

    Get PDF

    The building and application of a semantic platform for an e-research society

    No full text
    This thesis reviews the area of e-Research (the use of electronic infrastructure to support research) and considers how the insight gained from the development of social networking sites in the early 21st century might assist researchers in using this infrastructure. In particular it examines the myExperiment project, a website for e-Research that allows users to upload, share and annotate work flows and associated files, using a social networking framework. This Virtual Organisation (VO) supports many of the attributes required to allow a community of users to come together to build an e-Research society. The main focus of the thesis is how the emerging society that is developing out of my-Experiment could use Semantic Web technologies to provide users with a significantly richer representation of their research and research processes to better support reproducible research. One of the initial major contributions was building an ontology for myExperiment. Through this it became possible to build an API for generating and delivering this richer representation and an interface for querying it. Having this richer representation it has been possible to follow Linked Data principles to link up with other projects that have this type of representation. Doing this has allowed additional data to be provided to the user and has begun to set in context the data produced by myExperiment. The way that the myExperiment project has gone about this task and consideration of how changes may affect existing users, is another major contribution of this thesis. Adding a semantic representation to an emergent e-Research society like myExperiment,has given it the potential to provide additional applications. In particular the capability to support Research Objects, an encapsulation of a scientist's research or research process to support reproducibility. The insight gained by adding a semantic representation to myExperiment, has allowed this thesis to contribute towards the design of the architecture for these Research Objects that use similar Semantic Web technologies. The myExperiment ontology has been designed such that it can be aligned with other ontologies. Scientific Discourse, the collaborative argumentation of different claims and hypotheses, with the support of evidence from experiments, to construct, confirm or disprove theories requires the capability to represent experiments carried out in silico. This thesis discusses how, as part of the HCLS Scientific Discourse subtask group, the myExperiment ontology has begun to be aligned with other scientific discourse ontologies to provide this capability. It also compares this alignment of ontologies with the architecture for Research Objects. This thesis has also examines how myExperiment's Linked Data and that of other projects can be used in the design of novel interfaces. As a theoretical exercise, it considers how this Linked Data might be used to support a Question-Answering system, that would allow users to query myExperiment's data in a more efficient and user-friendly way. It concludes by reviewing all the steps undertaken to provide a semantic platform for an emergent e-Research society to facilitate the sharing of research and its processes to support reproducible research. It assesses their contribution to enhancing the features provided by myExperiment, as well as e-Research as a whole. It considers how the contributions provided by this thesis could be extended to produce additional tools that will allow researchers to make greater use of the rich data that is now available, in a way that enhances their research process rather than significantly changing it or adding extra workload
    corecore