114,948 research outputs found

    Automated schema matching techniques: an exploratory study

    Get PDF
    Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made

    Towards an integrated discovery system

    Get PDF
    Previous research on machine discovery has focused on limited parts of the empirical discovery task. In this paper we describe IDS, an integrated system that addresses both qualitative and quantitative discovery. The program represents its knowledge in terms of qualitative schemas, which it discovers by interacting with a simulated physical environment. Once IDS has formulated a qualitative schema, it uses that schema to design experiments and to constrain the search for quantitative laws. We have carried out preliminary tests in the domain of heat phenomena. In this context the system has discovered both intrinsic properties, such as the melting point of substances, and numeric laws, such as the conservation of mass for objects going through a phase change

    Valentine: Evaluating Matching Techniques for Dataset Discovery

    Full text link
    Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods

    Causal schema induction for knowledge discovery

    Full text link
    Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes.Comment: 8 pages, appendi

    JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

    Full text link
    Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability

    Discovery-based edit assistance for spreadsheets

    Get PDF
    Spreadsheets can be viewed as a highly flexible endusers programming environment which enjoys wide-spread adoption. But spreadsheets lack many of the structured programming concepts of regular programming paradigms. In particular, the lack of data structures in spreadsheets may lead spreadsheet users to cause redundancy, loss, or corruption of data during edit actions. In this paper, we demonstrate how implicit structural properties of spreadsheet data can be exploited to offer edit assistance to spreadsheet users. Our approach is based on the discovery of functional dependencies among data items which allow automatic reconstruction of a relational database schema. From this schema, new formulas and visual objects are embedded into the spreadsheet to offer features for auto-completion, guarded deletion, and controlled insertion. Schema discovery and spreadsheet enhancement are carried out automatically in the background and do not disturb normal user experience

    Data linking: capturing and utilising implicit schema-level relations

    Get PDF
    Schema-level heterogeneity represents an obstacle for automated discovery of coreference resolution links between individuals. Although there is a multitude of existing schema matching solutions, the Linked Data environment differs from the standard scenario assumed by these tools. In particular, large volumes of data are available, and repositories are connected into a graph by instance-level mappings. In this paper we describe how these features can be utilised to produce schema-level mappings which facilitate the instance coreference resolution process. Initial experiments applying this approach to public datasets have produced encouraging results

    Dealing with Uncertainty in Lexical Annotation

    Get PDF
    We present ALA, a tool for the automatic lexical annotation (i.e.annotation w.r.t. a thesaurus/lexical resource) of structured and semi-structured data sources and the discovery of probabilistic lexical relationships in a data integration environment. ALA performs automatic lexical annotation through the use of probabilistic annotations, i.e. an annotation is associated to a probability value. By performing probabilistic lexical annotation, we discover probabilistic inter-sources lexical relationships among schema elements. ALA extends the lexical annotation module of the MOMIS data integration system. However, it may be applied in general in the context of schema mapping discovery, ontology merging and data integration system and it is particularly suitable for performing “on-the-fly” data integration or probabilistic ontology matching

    Jesus Teaching Through Discovery

    Full text link
    What made Jesus’ teaching effective? Jesus’ teaching was effective because it resulted in changing the hearers’ heart and having the hearer apply his message to their lives. Jesus’ teaching amazed listeners, for example, after hearing the Sermon on the Mount the crowds were amazed (Matthew 7:28). He taught ordinary, unschooled, disciples for three years and their teaching changed the entire world of their time and continues to affect our world today. The hearers of his teaching opened their “eyes and ears”. What made his teaching so successful? His teaching consisted of a set of procedures. Jesus identified the teaching moments; facilitated inquiry by giving inspiring questions, enabled audiences to formulate hypothesizes through insights, and encouraged his audiences to apply their learning to practical situations. Jesus knew that learning was not simply memorizing facts or reciting the Law of Moses. Learning involved organizing new facts to existing schema and applying that new information. His teaching is typically a discovery learning process. The following article will review Jesus’ teaching method through the modern lens of discovery learning
    • …
    corecore