123 research outputs found
Graph Summarization
The continuous and rapid growth of highly interconnected datasets, which are
both voluminous and complex, calls for the development of adequate processing
and analytical techniques. One method for condensing and simplifying such
datasets is graph summarization. It denotes a series of application-specific
algorithms designed to transform graphs into more compact representations while
preserving structural patterns, query answers, or specific property
distributions. As this problem is common to several areas studying graph
topologies, different approaches, such as clustering, compression, sampling, or
influence detection, have been proposed, primarily based on statistical and
optimization methods. The focus of our chapter is to pinpoint the main graph
summarization methods, but especially to focus on the most recent approaches
and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie
Coverage-Based Summaries for RDF KBs
As more and more data become available as linked data, the need for efficient and effective methods for their exploration becomes apparent. Semantic summaries try to extract meaning from data, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summaryâs utility for query answering, i.e. the query coverage. In this poster paper, we present an algorithm, trying to maximize the aforementioned query coverage, using ideas borrowed from result diversification. The key idea of our algorithm is that, instead of focusing only to the âcentralâ nodes, to push node selection also to the perimeter of the graph. Our experiments show the potential of our algorithm and demonstrate the considerable advantages gained for answering larger fragments of user queries.acceptedVersionPeer reviewe
RDF graph summarization: principles, techniques and applications (tutorial)
International audienceThe explosion in the amount of the RDF on the Web has lead to the need to explore, query and understand such data sources. The task is challenging due to the complex and heterogeneous structure of RDF graphs which, unlike relational databases, do not come with a structure-dictating schema. Summarization has been applied to RDF data to facilitate these tasks. Its purpose is to extract concise and meaningful information from RDF knowledge bases, representing their content as faithfully as possible. There is no single concept of RDF summary, and not a single but many approaches to build such summaries; the summarization goal, and the main computational tools employed for summarizing graphs, are the main factors behind this diversity. This tutorial presents a structured analysis and comparison existing works in the area of RDF summarization; it is based upon a recent survey which we co-authored with colleagues [3]. We present the concepts at the core of each approach, outline their main technical aspects and implementation. We conclude by identifying the most pertinent summarization method for different usage scenarios, and discussing areas where future effort is needed
Instance-Based Lossless Summarization of Knowledge Graph With Optimized Triples and Corrections (IBA-OTC)
Knowledge graph (KG) summarization facilitates efficient information retrieval for exploring complex structural data. For fast information retrieval, it requires processing on redundant data. However, it necessitates the completion of information in a summary graph. It also saves computational time during data retrieval, storage space, in-memory visualization, and preserving structure after summarization. State-of-the-art approaches summarize a given KG by preserving its structure at the cost of information loss. Additionally, the approaches not preserving the underlying structure, compromise the summarization ratio by focusing only on the compression of specific regions. In this way, these approaches either miss preserving the original facts or the wrong prediction of inferred information. To solve these problems, we present a novel framework for generating a lossless summary by preserving the structure through super signatures and their corresponding corrections. The proposed approach summarizes only the naturally overlapped instances while maintaining its information and preserving the underlying Resource Description Framework RDF graph. The resultant summary is composed of triples with positive, negative, and star corrections that are optimized by the smart calling of two novel functions namely merge and disperse . To evaluate the effectiveness of our proposed approach, we perform experiments on nine publicly available real-world knowledge graphs and obtain a better summarization ratio than state-of-the-art approaches by a margin of 10% to 30% with achieving its completeness, correctness, and compactness. In this way, the retrieval of common events and groups by queries is accelerated in the resultant graph
The Case of Wikidata
Since its launch in 2012, Wikidata has grown to become the largest open knowledge
base (KB), containing more than 100 million data items and over 6 million registered
users. Wikidata serves as the structured data backbone of Wikipedia, addressing
data inconsistencies, and adhering to the motto of âserving anyone anywhere in
the world,â a vision realized through the diversity of knowledge. Despite being
a collaboratively contributed platform, the Wikidata community heavily relies on
bots, automated accounts with batch, and speedy editing rights, for a majority of
edits. As Wikidata approaches its first decade, the question arises: How close is
Wikidata to achieving its vision of becoming a global KB and how diverse is it in
serving the global population? This dissertation investigates the current status of
Wikidataâs diversity, the role of bot interventions on diversity, and how bots can be
leveraged to improve diversity within the context of Wikidata.
The methodologies used in this study are mapping study and content analysis, which
led to the development of three datasets: 1) Wikidata Research Articles Dataset,
covering the literature on Wikidata from its first decade of existence sourced from
online databases to inspect its current status; 2) Wikidata Requests-for-Permissions
Dataset, based on the pages requesting bot rights on the Wikidata website to explore
bots from a community perspective; and 3) Wikidata Revision History Dataset,
compiled from the edit history of Wikidata to investigate bot editing behavior and
its impact on diversity, all of which are freely available online.
The insights gained from the mapping study reveal the growing popularity of Wikidata
in the research community and its various application areas, indicative of its
progress toward the ultimate goal of reaching the global community. However, there
is currently no research addressing the topic of diversity in Wikidata, which could
shed light on its capacity to serve a diverse global population. To address this gap,
this dissertation proposes a diversity measurement concept that defines diversity in
a KB context in terms of variety, balance, and disparity and is capable of assessing
diversity in a KB from two main angles: user and data. The application of this concept
on the domains and classes of the Wikidata Revision History Dataset exposes
imbalanced content distribution across Wikidata domains, which indicates low data
diversity in Wikidata domains.
Further analysis discloses that bots have been active since the inception of Wikidata,
and the community embraces their involvement in content editing tasks, often
importing data from Wikipedia, which shows a low diversity of sources in bot edits.
Bots and human users engage in similar editing tasks but exhibit distinct editing patterns.
The findings of this thesis confirm that bots possess the potential to influence
diversity within Wikidata by contributing substantial amounts of data to specific
classes and domains, leading to an imbalance. However, this potential can also be
harnessed to enhance coverage in classes with limited content and restore balance,
thus improving diversity. Hence, this study proposes to enhance diversity through
automation and demonstrate the practical implementation of the recommendations
using a specific use case.
In essence, this research enhances our understanding of diversity in relation to a KB,
elucidates the influence of automation on data diversity, and sheds light on diversity
improvement within a KB context through the usage of automation.Seit seiner EinfuÌhrung im Jahr 2012 hat sich Wikidata zu der gröĂten offenen Wissensdatenbank
entwickelt, die mehr als 100 Millionen Datenelemente und uÌber 6
Millionen registrierte Benutzer enthĂ€lt. Wikidata dient als das strukturierte RuÌckgrat
von Wikipedia, indem es Datenunstimmigkeiten angeht und sich dem Motto
verschrieben hat, âjedem uÌberall auf der Welt zu dienenâ, eine Vision, die durch die
DiversitÀt des Wissens verwirklicht wird. Trotz seiner kooperativen Natur ist die
Wikidata-Community in hohem MaĂe auf Bots, automatisierte Konten mit Batch-
Verarbeitung und schnelle Bearbeitungsrechte angewiesen, um die Mehrheit der
Bearbeitungen durchzufuÌhren.
Da Wikidata seinem ersten Jahrzehnt entgegengeht, stellt sich die Frage: Wie nahe
ist Wikidata daran, seine Vision, eine globale Wissensdatenbank zu werden, zu verwirklichen,
und wie ausgeprĂ€gt ist seine Dienstleistung fuÌr die globale Bevölkerung?
Diese Dissertation untersucht den aktuellen Status der DiversitÀt von Wikidata,
die Rolle von Bot-Eingriffen in Bezug auf DiversitÀt und wie Bots im Kontext von
Wikidata zur Verbesserung der DiversitÀt genutzt werden können.
Die in dieser Studie verwendeten Methoden sind Mapping-Studie und Inhaltsanalyse,
die zur Entwicklung von drei DatensĂ€tzen gefuÌhrt haben: 1) Wikidata Research
Articles Dataset, die die Literatur zu Wikidata aus dem ersten Jahrzehnt aus
Online-Datenbanken umfasst, um den aktuellen Stand zu untersuchen; 2) Requestfor-
Permission Dataset, der auf den Seiten zur Beantragung von Bot-Rechten auf
der Wikidata-Website basiert, um Bots aus der Perspektive der Gemeinschaft zu
untersuchen; und 3)Wikidata Revision History Dataset, der aus der Bearbeitungshistorie
von Wikidata zusammengestellt wurde, um das Bearbeitungsverhalten von
Bots zu untersuchen und dessen Auswirkungen auf die DiversitÀt, die alle online frei
verfuÌgbar sind.
Die Erkenntnisse aus der Mapping-Studie zeigen die wachsende Beliebtheit von Wikidata
in der Forschungsgemeinschaft und in verschiedenen Anwendungsbereichen,
was auf seinen Fortschritt hin zur letztendlichen Zielsetzung hindeutet, die globale
Gemeinschaft zu erreichen. Es gibt jedoch derzeit keine Forschung, die sich mit
dem Thema der DiversitÀt in Wikidata befasst und Licht auf seine FÀhigkeit werfen
könnte, eine vielfĂ€ltige globale Bevölkerung zu bedienen. Um diese LuÌcke zu
schlieĂen, schlĂ€gt diese Dissertation ein Konzept zur Messung der DiversitĂ€t vor,
das die DiversitÀt im Kontext einer Wissensbasis anhand von Vielfalt, Balance und
Diskrepanz definiert und in der Lage ist, die DiversitÀt aus zwei Hauptperspektiven
zu bewerten: Benutzer und Daten.
Die Anwendung dieses Konzepts auf die Bereiche und Klassen des Wikidata Revision
History Dataset zeigt eine unausgewogene Verteilung des Inhalts uÌber die Bereiche
von Wikidata auf, was auf eine geringe DiversitÀt der Daten in den Bereichen von
Wikidata hinweist.
Weitere Analysen zeigen, dass Bots seit der GruÌndung von Wikidata aktiv waren
und von der Gemeinschaft inhaltliche Bearbeitungsaufgaben angenommen werden,
oft mit Datenimporten aus Wikipedia, was auf eine geringe DiversitÀt der Quellen
bei Bot-Bearbeitungen hinweist. Bots und menschliche Benutzer fuÌhren Ă€hnliche
Bearbeitungsaufgaben aus, zeigen jedoch unterschiedliche Bearbeitungsmuster. Die
Ergebnisse dieser Dissertation bestÀtigen, dass Bots das Potenzial haben, die DiversitÀt in Wikidata zu beeinflussen, indem sie bedeutende Datenmengen zu bestimmten
Klassen und Bereichen beitragen, was zu einer Ungleichgewichtung fuÌhrt.
Dieses Potenzial kann jedoch auch genutzt werden, um die Abdeckung in Klassen
mit begrenztem Inhalt zu verbessern und das Gleichgewicht wiederherzustellen, um
die DiversitÀt zu verbessern. Daher schlÀgt diese Studie vor, die DiversitÀt durch
Automatisierung zu verbessern und die praktische Umsetzung der Empfehlungen
anhand eines spezifischen Anwendungsfalls zu demonstrieren.
Kurz gesagt trÀgt diese Forschung dazu bei, unser VerstÀndnis der DiversitÀt im
Kontext einer Wissensbasis zu vertiefen, wirft Licht auf den Einfluss von Automatisierung
auf die DiversitÀt von Daten und zeigt die Verbesserung der DiversitÀt im
Kontext einer Wissensbasis durch die Verwendung von Automatisierung auf
Statistically-driven generation of multidimensional analytical schemas from linked data
The ever-increasing Linked Data (LD) initiative has given place to open, large amounts of semi-structured and rich data published on the Web. However, effective analytical tools that aid the user in his/her analysis and go beyond browsing and querying are still lacking. To address this issue, we propose the automatic generation of multidimensional analytical stars (MDAS). The success of the multidimensional (MD) model for data analysis has been in great part due to its simplicity. Therefore, in this paper we aim at automatically discovering MD conceptual patterns that summarize LD. These patterns resemble the MD star schema typical of relational data warehousing. The underlying foundations of our method is a statistical framework that takes into account both concept and instance data. We present an implementation that makes use of the statistical framework to generate the MDAS. We have performed several experiments that assess and validate the statistical approach with two well-known and large LD sets.This
research
has
been
partially
funded
by
the
âMinisterio
de
EconomĂa
y
Competitividadâ with
contract
number
TIN2014-55335-R.
Victoria
Nebot
was
supported
by
the
UJI
Postdoctoral
Fel-
lowship
program
with
reference
PI14490
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Validation and Evaluation
In this technical report, we present prototypical implementations of
innovative tools and methods for personalized and contextualized (multimedia)
search, collaborative ontology evolution, ontology evaluation and cost models,
and dynamic access and trends in distributed (semantic) knowledge, developed
according to the working plan outlined in Technical Report TR-B-12-04. The
prototypes complete the next milestone on the path to an integral Corporate
Semantic Web architecture based on the three pillars Corporate Ontology
Engineering, Corporate Semantic Collaboration, and Corporate Semantic Search,
as envisioned in TR-B-08-09
- âŠ