5,124 research outputs found
Data quality evaluation in data integration systems
This thesis deals with data quality evaluation in Data Integration Systems (DIS). Specifically, we address the problems of evaluating the quality of the data conveyed to users in response to their queries and verifying if users quality expectations can be achieved. We also analyze how quality measures can be used for improving the DIS and enforcing data quality. Our approach consists in studying one quality factor at a time, analyzing its impact within a DIS, proposing techniques for its evaluation and proposing improvement actions for its enforcement. Among the quality factors that have been proposed, this thesis analyzes two of the most used ones: data freshness and data accurac
Data quality maintenance in Data Integration Systems
A Data Integration System (DIS) is an information system that integrates data from a set of heterogeneous and autonomous information sources and provides it to users. Quality in these systems consists of various factors that are measured in data. Some of the usually considered ones are completeness, accuracy, accessibility, freshness, availability. In a DIS, quality factors are associated to the sources, to the extracted and transformed information, and to the information provided by the DIS to the user. At the same time, the user has the possibility of posing quality requirements associated to his data requirements. DIS Quality is considered as better, the nearer it is to the user quality requirements. DIS quality depends on data sources quality, on data transformations and on quality required by users. Therefore, DIS quality is a property that varies in function of the variations of these three other properties. The general goal of this thesis is to provide mechanisms for maintaining DIS quality at a level that satisfies the user quality requirements, minimizing the modifications to the system that are generated by quality changes.
The proposal of this thesis allows constructing and maintaining a DIS that is tolerant to quality changes. This means that the DIS is constructed taking into account previsions of quality behavior, such that if changes occur according to these previsions the system is not affected at all by them. These previsions are provided by models of quality behavior of DIS data, which must be maintained up to date. With this strategy, the DIS is affected only when quality behavior models change, instead of being affected each time there is a quality variation in the system. The thesis has a probabilistic approach, which allows modeling the behavior of the quality factors at the sources and at the DIS, allows the users to state flexible quality requirements (using probabilities), and provides tools, such as certainty, mathematical expectation, etc., that help to decide which quality changes are relevant to the DIS quality. The probabilistic models are monitored in order to detect source quality changes, strategy that allows detecting changes on quality behavior and not only punctual quality changes. We propose to monitor also other DIS properties that affect its quality, and for each of these changes decide if they affect the behavior of DIS quality, taking into account DIS quality models. Finally, the probabilistic approach is also applied at the moment of determining actions to take in order to improve DIS quality. For the interpretation of DIS situation we propose to use statistics, which include, in particular, the history of the quality models
Quality measures for ETL processes: from goals to implementation
Extraction transformation loading (ETL) processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of business process management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated, and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper, we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using goal modeling techniques. We showcase the use of goal modeling for ETL process design through a use case, where we employ the use of a goal model that includes quantitative components (i.e., indicators) for evaluation and analysis of alternative design decisions.Peer ReviewedPostprint (author's final draft
The Meaning of Memory Safety
We give a rigorous characterization of what it means for a programming
language to be memory safe, capturing the intuition that memory safety supports
local reasoning about state. We formalize this principle in two ways. First, we
show how a small memory-safe language validates a noninterference property: a
program can neither affect nor be affected by unreachable parts of the state.
Second, we extend separation logic, a proof system for heap-manipulating
programs, with a memory-safe variant of its frame rule. The new rule is
stronger because it applies even when parts of the program are buggy or
malicious, but also weaker because it demands a stricter form of separation
between parts of the program state. We also consider a number of pragmatically
motivated variations on memory safety and the reasoning principles they
support. As an application of our characterization, we evaluate the security of
a previously proposed dynamic monitor for memory safety of heap-allocated data.Comment: POST'18 final versio
A unified view of data-intensive flows in business intelligence systems : a survey
Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft
Framework to Automatically Determine the Quality of Open Data Catalogs
Data catalogs play a crucial role in modern data-driven organizations by
facilitating the discovery, understanding, and utilization of diverse data
assets. However, ensuring their quality and reliability is complex, especially
in open and large-scale data environments. This paper proposes a framework to
automatically determine the quality of open data catalogs, addressing the need
for efficient and reliable quality assessment mechanisms. Our framework can
analyze various core quality dimensions, such as accuracy, completeness,
consistency, scalability, and timeliness, offer several alternatives for the
assessment of compatibility and similarity across such catalogs as well as the
implementation of a set of non-core quality dimensions such as provenance,
readability, and licensing. The goal is to empower data-driven organizations to
make informed decisions based on trustworthy and well-curated data assets. The
source code that illustrates our approach can be downloaded from
https://www.github.com/jorge-martinez-gil/dataq/.Comment: 25 page
Current Challenges and Visions in Music Recommender Systems Research
Music recommender systems (MRS) have experienced a boom in recent years,
thanks to the emergence and success of online streaming services, which
nowadays make available almost all music in the world at the user's fingertip.
While today's MRS considerably help users to find interesting music in these
huge catalogs, MRS research is still facing substantial challenges. In
particular when it comes to build, incorporate, and evaluate recommendation
strategies that integrate information beyond simple user--item interactions or
content-based descriptors, but dig deep into the very essence of listener
needs, preferences, and intentions, MRS research becomes a big endeavor and
related publications quite sparse.
The purpose of this trends and survey article is twofold. We first identify
and shed light on what we believe are the most pressing challenges MRS research
is facing, from both academic and industry perspectives. We review the state of
the art towards solving these challenges and discuss its limitations. Second,
we detail possible future directions and visions we contemplate for the further
evolution of the field. The article should therefore serve two purposes: giving
the interested reader an overview of current challenges in MRS research and
providing guidance for young researchers by identifying interesting, yet
under-researched, directions in the field
- …