Search CORE

6,740 research outputs found

Profiling relational data: a survey

Author: Abedjan Ziawasch
Golab Lukasz
Naumann Felix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/06/2015
Field of study

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

DSpace@MIT

Crossref

Generic analysis support for understanding, evaluating and comparing enterprise architecture models

Author: Langermeier Melanie
Publication venue
Publication date: 06/08/2020
Field of study

Enterprise Architecture Management (EAM) is one mean to deal with the increasing complexity of today’s IT landscapes. Architectural models are used within EAM to describe the business processes, the used applications, the required infrastructure as well as the dependencies between them. The creation of those models is expensive, since the whole organization and therewith a large amount of data has to be considered. It is important to make use of these models and reuse them for planning purposes and decision making. The models are a solid foundation for various kinds of analyses that support the understanding, evaluation and comparisons of them. Analyses can approximate the effects of the retirement of an application or of a server failure. It is also possible to quantify the models using metrics like the IT coverage of business processes or the workload of a server. The generation of views sets the focus on a specific aspect of the model. An example is the limitation to the processes and applications of a specific organization unit. Architectural models can also be used for planning purposes. The development of a target architecture is supported by identifying weak points and evaluating planning scenarios. Current approaches for EAM analysis are typically isolated ones, addressing only a limited subset of the different analysis goals. An integrated approach that covers the different information demands of the stakeholders is missing. Additionally, the analysis approaches are highly dependent on the utilized meta model. This is a serious problem since the EAM domain is characterized by a large variety of frameworks and meta models. In this thesis, we propose a generic framework that supports the different analysis activities during EAM. We develop the required techniques for the specification and execution of analyses, independently from the utilized meta model. An analysis language is implemented for the definition and customization of the analyses according to the current needs of the stakeholder. Thereby, we focus on reuse and a generic definition. We utilize a generic representation format to be able to abstract from the great variety of used meta models in the EAM domain. The execution of the analyses is done with Semantic Web Technologies and data-flow based model analysis. The framework is applied for the identification of weak points as well as the evaluation of planning scenarios regarding consistency of changes and goal fulfillment. Two methods are developed for these tasks, as well as respective analysis support is identified and implemented. These are, for example, a change impact analysis, specific metrics or the scoping of the architectural model according to different aspects. Finally, the coverage of the framework regarding existing EA analysis approaches is determined with a scenario-based evaluation. The applicability and relevance of the language and of the proposed methods is proved within three large case studies

OPUS Augsburg

Data quality evaluation through data quality rules and data provenance.

Author: Zanzi Antonella
Publication venue: Italy
Publication date
Field of study

The application and exploitation of large amounts of data play an ever-increasing role in today’s research, government, and economy. Data understanding and decision making heavily rely on high quality data; therefore, in many different contexts, it is important to assess the quality of a dataset in order to determine if it is suitable to be used for a specific purpose. Moreover, as the access to and the exchange of datasets have become easier and more frequent, and as scientists increasingly use the World Wide Web to share scientific data, there is a growing need to know the provenance of a dataset (i.e., information about the processes and data sources that lead to its creation) in order to evaluate its trustworthiness. In this work, data quality rules and data provenance are used to evaluate the quality of datasets. Concerning the first topic, the applied solution consists in the identification of types of data constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. We selected some of the data constraints and dependencies already considered in the data quality field, but we also used order dependencies and existence constraints as quality rules. In addition, we developed some algorithms to discover the types of dependencies used in the tool. To deal with the provenance of data, the Open Provenance Model (OPM) was adopted, an experimental query language for querying OPM graphs stored in a relational database was implemented, and an approach to design OPM graphs was proposed

InsubriaSPACE - Thesis PhD Repository

DQ2S- A framework for data quality-aware information management

Author: Dong Chao
Sampaio Pedro
Sampaio Sandra F Mendes
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Crossref

The University of Manchester - Institutional Repository

Learning Language from a Large (Unannotated) Corpus

Author: Goertzel Ben
Vepstas Linas
Publication venue
Publication date: 14/01/2014
Field of study

A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

arXiv.org e-Print Archive

CiteSeerX

NLSC: Unrestricted Natural Language-based Service Composition through Sentence Embeddings

Author: Akoju Sushma A.
Dangi Ankit
Romero Oscar J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/06/2019
Field of study

Current approaches for service composition (assemblies of atomic services) require developers to use: (a) domain-specific semantics to formalize services that restrict the vocabulary for their descriptions, and (b) translation mechanisms for service retrieval to convert unstructured user requests to strongly-typed semantic representations. In our work, we argue that effort to developing service descriptions, request translations, and matching mechanisms could be reduced using unrestricted natural language; allowing both: (1) end-users to intuitively express their needs using natural language, and (2) service developers to develop services without relying on syntactic/semantic description languages. Although there are some natural language-based service composition approaches, they restrict service retrieval to syntactic/semantic matching. With recent developments in Machine learning and Natural Language Processing, we motivate the use of Sentence Embeddings by leveraging richer semantic representations of sentences for service description, matching and retrieval. Experimental results show that service composition development effort may be reduced by more than 44\% while keeping a high precision/recall when matching high-level user requests with low-level service method invocations.Comment: This paper will appear on SCC'19 (IEEE International Conference on Services Computing) on July 1

arXiv.org e-Print Archive

Crossref