114,948 research outputs found
Automated schema matching techniques: an exploratory study
Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made
Towards an integrated discovery system
Previous research on machine discovery has focused on limited parts of the empirical discovery task. In this paper we describe IDS, an integrated system that addresses both qualitative and quantitative discovery. The program represents its knowledge in terms of qualitative schemas, which it discovers by interacting with a simulated physical environment. Once IDS has formulated a qualitative schema, it uses that schema to design experiments and to constrain the search for quantitative laws. We have carried out preliminary tests in the domain of heat phenomena. In this context the system has discovered both intrinsic properties, such as the melting point of substances, and numeric laws, such as the conservation of mass for objects going through a phase change
Valentine: Evaluating Matching Techniques for Dataset Discovery
Data scientists today search large data lakes to discover and integrate
datasets. In order to bring together disparate data sources, dataset discovery
methods rely on some form of schema matching: the process of establishing
correspondences between datasets. Traditionally, schema matching has been used
to find matching pairs of columns between a source and a target schema.
However, the use of schema matching in dataset discovery methods differs from
its original use. Nowadays schema matching serves as a building block for
indicating and ranking inter-dataset relationships. Surprisingly, although a
discovery method's success relies highly on the quality of the underlying
matching algorithms, the latest discovery methods employ existing schema
matching algorithms in an ad-hoc fashion due to the lack of openly-available
datasets with ground truth, reference method implementations, and evaluation
metrics. In this paper, we aim to rectify the problem of evaluating the
effectiveness and efficiency of schema matching methods for the specific needs
of dataset discovery. To this end, we propose Valentine, an extensible
open-source experiment suite to execute and organize large-scale automated
matching experiments on tabular data. Valentine includes implementations of
seminal schema matching methods that we either implemented from scratch (due to
absence of open source code) or imported from open repositories. The
contributions of Valentine are: i) the definition of four schema matching
scenarios as encountered in dataset discovery methods, ii) a principled dataset
fabrication process tailored to the scope of dataset discovery methods and iii)
the most comprehensive evaluation of schema matching techniques to date,
offering insight on the strengths and weaknesses of existing techniques, that
can serve as a guide for employing schema matching in future dataset discovery
methods
Causal schema induction for knowledge discovery
Making sense of familiar yet new situations typically involves making
generalizations about causal schemas, stories that help humans reason about
event sequences. Reasoning about events includes identifying cause and effect
relations shared across event instances, a process we refer to as causal schema
induction. Statistical schema induction systems may leverage structural
knowledge encoded in discourse or the causal graphs associated with event
meaning, however resources to study such causal structure are few in number and
limited in size. In this work, we investigate how to apply schema induction
models to the task of knowledge discovery for enhanced search of
English-language news texts. To tackle the problem of data scarcity, we present
Torquestra, a manually curated dataset of text-graph-schema units integrating
temporal, event, and causal structures. We benchmark our dataset on three
knowledge discovery tasks, building and evaluating models for each. Results
show that systems that harness causal structure are effective at identifying
texts sharing similar causal meaning components rather than relying on lexical
cues alone. We make our dataset and models available for research purposes.Comment: 8 pages, appendi
JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery
Schema discovery is an important aspect to working with data in formats such
as JSON. Unlike relational databases, JSON data sets often do not have
associated structural information. Consumers of such datasets are often left to
browse through data in an attempt to observe commonalities in structure across
documents to construct suitable code for data processing. However, this process
is time-consuming and error-prone. Existing distributed approaches to mining
schemas present a significant usability advantage as they provide useful
metadata for large data sources. However, depending on the data source, ad hoc
queries for estimating other properties to help with crafting an efficient data
pipeline can be expensive. We propose JSONoid, a distributed schema discovery
process augmented with additional metadata in the form of monoid data
structures that are easily maintainable in a distributed setting. JSONoid
subsumes several existing approaches to distributed schema discovery with
similar performance. Our approach also adds significant useful additional
information about data values to discovered schemas with linear scalability
Discovery-based edit assistance for spreadsheets
Spreadsheets can be viewed as a highly flexible endusers programming environment which enjoys wide-spread adoption. But spreadsheets lack many of the structured programming concepts of regular programming paradigms. In particular, the lack of data structures in spreadsheets may lead spreadsheet users to cause redundancy, loss, or corruption of data during edit actions. In this paper, we demonstrate how implicit structural properties of spreadsheet data can be exploited to offer edit assistance to spreadsheet users. Our approach is based on the discovery of functional dependencies among data items which allow automatic reconstruction of a relational database schema. From this schema, new formulas and visual objects are embedded into the spreadsheet to offer features for auto-completion, guarded deletion, and controlled insertion. Schema discovery and spreadsheet enhancement are carried out automatically in the background and do not disturb normal user experience
Data linking: capturing and utilising implicit schema-level relations
Schema-level heterogeneity represents an obstacle for automated discovery of coreference resolution links between individuals. Although there is a multitude of existing schema matching solutions, the Linked Data environment differs from the standard scenario assumed by these tools. In particular, large volumes of data are available, and repositories are connected into a graph by instance-level mappings. In this paper we describe how these features can be utilised to produce schema-level mappings which facilitate the instance coreference resolution process. Initial experiments applying this approach to public datasets have produced encouraging results
Dealing with Uncertainty in Lexical Annotation
We present ALA, a tool for the automatic lexical annotation (i.e.annotation w.r.t. a thesaurus/lexical resource) of structured and semi-structured data sources and the discovery of probabilistic lexical relationships in a data integration environment. ALA performs automatic lexical annotation through the use of probabilistic annotations, i.e. an annotation is associated to a probability value. By performing probabilistic lexical annotation, we discover probabilistic inter-sources lexical relationships among schema elements. ALA extends the lexical annotation module of the MOMIS data integration system. However, it may be applied in general in the context of schema mapping discovery, ontology merging and data integration system and it is particularly suitable for performing “on-the-fly” data integration or probabilistic ontology matching
Jesus Teaching Through Discovery
What made Jesus’ teaching effective? Jesus’ teaching was effective because it resulted in changing the hearers’ heart and having the hearer apply his message to their lives. Jesus’ teaching amazed listeners, for example, after hearing the Sermon on the Mount the crowds were amazed (Matthew 7:28). He taught ordinary, unschooled, disciples for three years and their teaching changed the entire world of their time and continues to affect our world today. The hearers of his teaching opened their “eyes and ears”. What made his teaching so successful? His teaching consisted of a set of procedures. Jesus identified the teaching moments; facilitated inquiry by giving inspiring questions, enabled audiences to formulate hypothesizes through insights, and encouraged his audiences to apply their learning to practical situations.
Jesus knew that learning was not simply memorizing facts or reciting the Law of Moses. Learning involved organizing new facts to existing schema and applying that new information. His teaching is typically a discovery learning process. The following article will review Jesus’ teaching method through the modern lens of discovery learning
- …