8 research outputs found
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives
How did the popularity of the Greek Prime Minister evolve in 2015? How did
the predominant sentiment about him vary during that period? Were there any
controversial sub-periods? What other entities were related to him during these
periods? To answer these questions, one needs to analyze archived documents and
data about the query entities, such as old news articles or social media
archives. In particular, user-generated content posted in social networks, like
Twitter and Facebook, can be seen as a comprehensive documentation of our
society, and thus meaningful analysis methods over such archived data are of
immense value for sociologists, historians and other interested parties who
want to study the history and evolution of entities and events. To this end, in
this paper we propose an entity-centric approach to analyze social media
archives and we define measures that allow studying how entities were reflected
in social media in different time periods and under different aspects, like
popularity, attitude, controversiality, and connectedness with other entities.
A case study using a large Twitter archive of four years illustrates the
insights that can be gained by such an entity-centric and multi-aspect
analysis.Comment: This is a preprint of an article accepted for publication in the
International Journal on Digital Libraries (2018
RDF data evolution: efficient detection and semantic representation of changes
ABSTRACT Many RDF data sources are constantly changing for both data and vocabulary (ontology) levels. Many integration tasks are impacted by these changes. In this context, it is important to develop approaches to detect and represent these changes. Many studies have focused on the detection, the representation and the management of changes at the ontology level. In this paper, we present an approach which allows to detect and represent elementary and complex changes that can be detected when we focus only on the data level. A first experiment was conducted on different versions of DBpedia
DELTA-R: a change detection approach for RDF datasets
This paper presents the DELTA-R approach that detects and
classifies the changes between two versions of a linked dataset. It contributes
to the state of the art firstly: by proposing a more granular classification of
the resource level changes, and secondly: by automatically selecting the
appropriate resource properties to identify the same resources in different
versions of a linked dataset with different URIs and similar representation.
The paper also presents the DELTA-R change model to represent the
changes detected by the DELTA-R approach. This model bridges the gap
between resource-centric and triple-centric views of changes in linked
datasets. As a result, a single change detection mechanism will be able to
support the use cases like interlink maintenance and dataset or replica
synchronization. Additionally, the paper describes an experiment conducted
to examine the accuracy of the DELTA-R approach in detecting the changes
between two versions of a linked dataset. The result indicates that the
accuracy of DELTA-R approach outperforms the state of the art approaches
by up to 4%. It is demonstrated that the proposed more granular
classification of changes helped to identifyup to 1529 additional updated
resources compered to X.By means of a case study, we demonstrate the
support of DELTA-R approach and change model for an interlink
maintenance use case. The result shows that 100% of the broken interlinks
were repaired between DBpedia person snapshot 3.7 and Freebase
Automated Knowledge Base Quality Assessment and Validation based on Evolution Analysis
In recent years, numerous efforts have been put towards sharing Knowledge Bases (KB) in the Linked Open Data (LOD) cloud. These KBs are being used for various tasks, including performing data analytics or building question answering systems. Such KBs evolve continuously: their data (instances) and schemas can be updated, extended, revised and refactored. However, unlike in more controlled types of knowledge bases, the evolution of KBs exposed in the LOD cloud is usually unrestrained, what may cause data to suffer from a variety of quality issues, both at a semantic level and at a pragmatic level. This situation affects negatively data stakeholders – consumers, curators, etc. –. Data quality is commonly related to the perception of the fitness for use, for a certain application or use case. Therefore, ensuring the quality of the data of a knowledge base that evolves is vital. Since data is derived from autonomous, evolving, and increasingly large data providers, it is impractical to do manual data curation, and at the same time, it is very challenging to do a continuous automatic assessment of data quality. Ensuring the quality of a KB is a non-trivial task since they are based on a combination of structured information supported by models, ontologies, and vocabularies, as well as queryable endpoints, links, and mappings. Thus, in this thesis, we explored two main areas in assessing KB quality: (i) quality assessment using KB evolution analysis, and (ii) validation using machine learning models. The evolution of a KB can be analyzed using fine-grained “change” detection at low-level or using “dynamics” of a dataset at high-level. In this thesis, we present a novel knowledge base quality assessment approach using evolution analysis. The proposed approach uses data profiling on consecutive knowledge base releases to compute quality measures that allow detecting quality issues. However, the first step in building the quality assessment approach was to identify the quality characteristics. Using high-level change detection as measurement functions, in this thesis we present four quality characteristics: Persistency, Historical Persistency, Consistency and Completeness. Persistency and historical persistency measures concern the degree of changes and lifespan of any entity type. Consistency and completeness measures identify properties with incomplete information and contradictory facts. The approach has been assessed both quantitatively and qualitatively on a series of releases from two knowledge bases, eleven releases of DBpedia and eight releases of 3cixty Nice. However, high-level changes, being coarse-grained, cannot capture all possible quality issues. In this context, we present a validation strategy whose rationale is twofold. First, using manual validation from qualitative analysis to identify causes of quality issues. Then, use RDF data profiling information to generate integrity constraints. The validation approach relies on the idea of inducing RDF shape by exploiting SHALL constraint components. In particular, this approach will learn, what are the integrity constraints that can be applied to a large KB by instructing a process of statistical analysis, which is followed by a learning model. We illustrate the performance of our validation approach by using five learning models over three sub-tasks, namely minimum cardinality, maximum cardinality, and range constraint. The techniques of quality assessment and validation developed during this work are automatic and can be applied to different knowledge bases independently of the domain. Furthermore, the measures are based on simple statistical operations that make the solution both flexible and scalable
An evaluation of the challenges of Multilingualism in Data Warehouse development
In this paper we discuss Business Intelligence and define what is meant by support for Multilingualism in a Business Intelligence reporting context. We identify support for Multilingualism as a challenging issue which has implications for data warehouse design and reporting performance. Data warehouses are a core component of most Business Intelligence systems and the star schema is the approach most widely used to develop data warehouses and dimensional Data Marts. We discuss the way in which Multilingualism can be supported in the Star Schema and identify that current approaches have serious limitations which include data redundancy and data manipulation, performance and maintenance issues. We propose a new approach to enable the optimal application of multilingualism in Business Intelligence. The proposed approach was found to produce satisfactory results when used in a proof-of-concept environment. Future work will include testing the approach in an enterprise environmen