Search CORE

1,080 research outputs found

Schema Inference for Massive JSON Datasets

Author: Dario Colazzo
Giorgio Ghelli
Houssem Ben Lahmar
Mohamed Amine Baazizi
SARTIANI CARLO
Publication venue: OpenProceedings.org
Publication date: 01/01/2017
Field of study

In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures sev- eral advantages, the absence of schema information has im- portant negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give com- plete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implemen- tation based on Spark, enabling reasonable schema infer- ence time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our ap- proach in terms of execution time, precision, and conciseness of inferred schemas, and scalability

Archivio della Ricerca - Università della Basilicata

Archivio della Ricerca - Università di Pisa

A Type System for Interactive JSON Schema Inference (Extended Abstract)

Author: Baazizi Mohamed-Amine
Colazzo Dario
Ghelli Giorgio
Sartiani Carlo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)
Publication date: 01/01/2019
Field of study

In this paper we present the first JSON type system that provides the possibility of inferring a schema by adopting different levels of precision/succinctness for different parts of the dataset, under user control. This feature gives the data analyst the possibility to have detailed schemas for parts of the data of greater interest, while more succinct schema is provided for other parts, and the decision can be changed as many times as needed, in order to explore the schema in a gradual fashion, moving the focus to different parts of the collection, without the need of reprocessing data and by only performing type rewriting operations on the most precise schema

Archivio della Ricerca - Università della Basilicata

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

Author: Mior Michael J.
Publication venue
Publication date: 06/07/2023
Field of study

Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability

arXiv.org e-Print Archive

Big Data Mining and Semantic Technologies: Challenges and Opportunities

Author: Ms. Yesha Mehta, Dr. Sanjay Buch
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/07/2015
Field of study

Big data a term coined due to the explosion in the quantity and diversity of high frequency digital data which is having a potential for valuable insights has drawn the most attention in the area of research and development. Converting big data to actionable insights requires depth understanding of big data, its characteristics, challenges and current technological trends. A rise of big data is changing the existing data storage, management, processing and analytical mechanisms and leads to the new architecture/ecosystems to handle big data applications. This paper covers finding of our research study about big data characteristic, various types of analysis associated with it and basic big data types. First, we are presenting the big data study from data mining and analysis perspective and discuss the challenges and next, we present the result of research study on meaningful use of big data in the context of semantic technologies. Moreover, we discuss various case studies related to social media analysis and recent development trends to identify potential research directions for big data with semantic technologies. DOI: 10.17762/ijritcc2321-8169.150711

International Journal on Recent and Innovation Trends in Computing and Communication

Dataset Discovery and Exploration: A Survey

Author: Chen Jiaoyan
Paton Norman
Wu Zhenyu
Publication venue
Publication date: 04/10/2023
Field of study

The University of Manchester - Institutional Repository