121 research outputs found
Scalable Data Integration for Linked Data
Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster
Semantic Systems. The Power of AI and Knowledge Graphs
This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies
frances : cloud-based historical text mining with deep learning and parallel processing
frances is an advanced cloud-based text mining digital platform that leverages information extraction, knowledge graphs, natural language processing (NLP), deep learning, and parallel processing techniques. It has been specifically designed to unlock the full potential of historical digital textual collections, such as those from the National Library of Scotland, offering cloud-based capabilities and extended support for complex NLP analyses and data visualizations. frances enables realtime recurrent operational text mining and provides robust capabilities for temporal analysis, accompanied by automatic visualizations for easy result inspection. In this paper, we present the motivation behind the development of frances, emphasizing its innovative design and novel implementation aspects. We also outline future development directions. Additionally, we evaluate the platform through two comprehensive case studies in history and publishing history. Feedback from participants in these studies demonstrates that frances accelerates their work and facilitates rapid testing and dissemination of ideas.Postprin
Alexandria: Extensible Framework for Rapid Exploration of Social Media
The Alexandria system under development at IBM Research provides an
extensible framework and platform for supporting a variety of big-data
analytics and visualizations. The system is currently focused on enabling rapid
exploration of text-based social media data. The system provides tools to help
with constructing "domain models" (i.e., families of keywords and extractors to
enable focus on tweets and other social media documents relevant to a project),
to rapidly extract and segment the relevant social media and its authors, to
apply further analytics (such as finding trends and anomalous terms), and
visualizing the results. The system architecture is centered around a variety
of REST-based service APIs to enable flexible orchestration of the system
capabilities; these are especially useful to support knowledge-worker driven
iterative exploration of social phenomena. The architecture also enables rapid
integration of Alexandria capabilities with other social media analytics
system, as has been demonstrated through an integration with IBM Research's
SystemG. This paper describes a prototypical usage scenario for Alexandria,
along with the architecture and key underlying analytics.Comment: 8 page
ΠΠΊΡΡΠΆΠ΅ΡΠ΅ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΈ ΠΎΡΠ΅Π½Ρ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
Linking and publishing data in the Linked Open Data format increases the interoperability
and discoverability of resources over the Web. To accomplish this, the process comprises
several design decisions, based on the Linked Data principles that, on one hand, recommend to
use standards for the representation and the access to data on the Web, and on the other hand
to set hyperlinks between data from different sources.
Despite the efforts of the World Wide Web Consortium (W3C), being the main international
standards organization for the World Wide Web, there is no one tailored formula for publishing
data as Linked Data. In addition, the quality of the published Linked Open Data (LOD) is a
fundamental issue, and it is yet to be thoroughly managed and considered.
In this doctoral thesis, the main objective is to design and implement a novel framework for
selecting, analyzing, converting, interlinking, and publishing data from diverse sources,
simultaneously paying great attention to quality assessment throughout all steps and modules
of the framework. The goal is to examine whether and to what extent are the Semantic Web
technologies applicable for merging data from different sources and enabling end-users to
obtain additional information that was not available in individual datasets, in addition to the
integration into the Semantic Web community space. Additionally, the Ph.D. thesis intends to
validate the applicability of the process in the specific and demanding use case, i.e. for creating
and publishing an Arabic Linked Drug Dataset, based on open drug datasets from selected
Arabic countries and to discuss the quality issues observed in the linked data life-cycle. To that
end, in this doctoral thesis, a Semantic Data Lake was established in the pharmaceutical domain
that allows further integration and developing different business services on top of the
integrated data sources. Through data representation in an open machine-readable format, the
approach offers an optimum solution for information and data dissemination for building
domain-specific applications, and to enrich and gain value from the original dataset. This thesis
showcases how the pharmaceutical domain benefits from the evolving research trends for
building competitive advantages. However, as it is elaborated in this thesis, a better
understanding of the specifics of the Arabic language is required to extend linked data
technologies utilization in targeted Arabic organizations.ΠΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ ΡΠΎΡΠΌΠ°ΡΡ "ΠΠΎΠ²Π΅Π·Π°Π½ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ" (Π΅Π½Π³.
Linked Open Data) ΠΏΠΎΠ²Π΅ΡΠ°Π²Π° ΠΈΠ½ΡΠ΅ΡΠΎΠΏΠ΅ΡΠ°Π±ΠΈΠ»Π½ΠΎΡΡ ΠΈ ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡΠΈ Π·Π° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΠ΅ΡΡΡΡΠ°
ΠΏΡΠ΅ΠΊΠΎ Web-Π°. ΠΡΠΎΡΠ΅Ρ ΡΠ΅ Π·Π°ΡΠ½ΠΎΠ²Π°Π½ Π½Π° Linked Data ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈΠΌΠ° (W3C, 2006) ΠΊΠΎΡΠΈ ΡΠ° ΡΠ΅Π΄Π½Π΅
ΡΡΡΠ°Π½Π΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ° ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΠ΅ ΠΈ ΠΏΡΠΈΡΡΡΠΏ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° Π½Π° WΠ΅Π±Ρ (RDF, OWL,
SPARQL), Π° ΡΠ° Π΄ΡΡΠ³Π΅ ΡΡΡΠ°Π½Π΅, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈ ΡΡΠ³Π΅ΡΠΈΡΡ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ Ρ
ΠΈΠΏΠ΅ΡΠ²Π΅Π·Π° ΠΈΠ·ΠΌΠ΅ΡΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
ΠΈΠ· ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ°.
Π£ΠΏΡΠΊΠΎΡ Π½Π°ΠΏΠΎΡΠΈΠΌΠ° W3C ΠΊΠΎΠ½Π·ΠΎΡΡΠΈΡΡΠΌΠ° (W3C ΡΠ΅ Π³Π»Π°Π²Π½Π° ΠΌΠ΅ΡΡΠ½Π°ΡΠΎΠ΄Π½Π° ΠΎΡΠ³Π°Π½ΠΈΠ·Π°ΡΠΈΡΠ° Π·Π°
ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° Web-Ρ), Π½Π΅ ΠΏΠΎΡΡΠΎΡΠΈ ΡΠ΅Π΄ΠΈΠ½ΡΡΠ²Π΅Π½Π° ΡΠΎΡΠΌΡΠ»Π° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ ΠΏΡΠΎΡΠ΅ΡΠ°
ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ Linked Data ΡΠΎΡΠΌΠ°ΡΡ. Π£Π·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ Π΄Π° ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅Ρ
ΠΎΠ±ΡΠ°Π²ΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎΠ΄Π»ΡΡΡΡΡΡΠΈ Π·Π° Π±ΡΠ΄ΡΡΠΈ ΡΠ°Π·Π²ΠΎΡ Web-Π°, Ρ ΠΎΠ²ΠΎΡ
Π΄ΠΎΠΊΡΠΎΡΡΠΊΠΎΡ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ, Π³Π»Π°Π²Π½ΠΈ ΡΠΈΡ ΡΠ΅ (1) Π΄ΠΈΠ·Π°ΡΠ½ ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΠ° ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΎΠ³ ΠΎΠΊΠ²ΠΈΡΠ°
Π·Π° ΠΈΠ·Π±ΠΎΡ, Π°Π½Π°Π»ΠΈΠ·Ρ, ΠΊΠΎΠ½Π²Π΅ΡΠ·ΠΈΡΡ, ΠΌΠ΅ΡΡΡΠΎΠ±Π½ΠΎ ΠΏΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΈΠ·
ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ° ΠΈ (2) Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΡΠΈΠΌΠ΅Π½Π° ΠΎΠ²ΠΎΠ³ ΠΏΡΠΈΡΡΡΠΏΠ° Ρ ΡΠ°ΡΠΌΠ°ΡeΡΡΡΠΊΠΎΠΌ Π΄ΠΎΠΌΠ΅Π½Ρ.
ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° Π΄ΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° Π΄Π΅ΡΠ°ΡΠ½ΠΎ ΠΈΡΡΡΠ°ΠΆΡΡΠ΅ ΠΏΠΈΡΠ°ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ
ΠΈ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
Π΅ΠΊΠΎΡΠΈΡΡΠ΅ΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Linked Data Ecosystems), ΡΠ·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ
ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡ ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎΠ³ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°. Π Π°Π΄ ΡΠ΅ ΠΌΠΎΡΠΈΠ²ΠΈΡΠ°Π½ ΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ Π΄Π° ΡΠ΅
ΠΎΠΌΠΎΠ³ΡΡΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ° ΠΈΠ· Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° Π΄Π° ΡΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
Π²Π΅Π± ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°
ΠΏΠΎΠ²Π΅ΠΆΡ ΡΠ²ΠΎΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠΊΠ΅ ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΠΌ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ°, ΠΊΠ°ΠΎ Π½ΠΏΡ. DBpedia-ΡΠΎΠΌ. Π¦ΠΈΡ ΡΠ΅ Π΄Π° ΡΠ΅ ΠΈΡΠΏΠΈΡΠ°
Π΄Π° Π»ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π°ΡΡ ΠΊΡΠ°ΡΡΠΈΠΌ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΠΌΠ° Π΄Π° Π΄ΠΎΠ±ΠΈΡΡ
Π΄ΠΎΠ΄Π°ΡΠ½Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ΅ ΠΊΠΎΡΠ΅ Π½ΠΈΡΡ Π΄ΠΎΡΡΡΠΏΠ½Π΅ Ρ ΠΏΠΎΡΠ΅Π΄ΠΈΠ½Π°ΡΠ½ΠΈΠΌ ΡΠΊΡΠΏΠΎΠ²ΠΈΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°, ΠΏΠΎΡΠ΅Π΄
ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΡΠ΅ Ρ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈ WΠ΅Π± ΠΏΡΠΎΡΡΠΎΡ.
ΠΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° ΠΏΡΠ΅Π΄Π»Π°ΠΆΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡΡ Π·Π° ΡΠ°Π·Π²ΠΎΡ Π°ΠΏΠ»ΠΈΠΊΠ°ΡΠΈΡΠ΅ Π·Π° ΡΠ°Π΄ ΡΠ°
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΠΌ (Linked) ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ° ΡΠΎΡΡΠ²Π΅ΡΡΠΊΠΎ ΡΠ΅ΡΠ΅ΡΠ΅ ΠΊΠΎΡΠ΅ ΠΎΠΌΠΎΠ³ΡΡΡΡΠ΅
ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΠΊΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΎΠ³ ΡΠΊΡΠΏΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎ Π»Π΅ΠΊΠΎΠ²ΠΈΠΌΠ° ΠΈΠ· ΠΈΠ·Π°Π±ΡΠ°Π½ΠΈΡ
Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°. ΠΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΈ ΡΠΊΡΠΏ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΡΠ΅ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ°Π½ Ρ ΠΎΠ±Π»ΠΈΠΊΡ Π‘Π΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΎΠ³ ΡΠ΅Π·Π΅ΡΠ°
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Semantic Data Lake).
ΠΠ²Π° ΡΠ΅Π·Π° ΠΏΠΎΠΊΠ°Π·ΡΡΠ΅ ΠΊΠ°ΠΊΠΎ ΡΠ°ΡΠΌΠ°ΡΠ΅ΡΡΡΠΊΠ° ΠΈΠ½Π΄ΡΡΡΡΠΈΡΠ° ΠΈΠΌΠ° ΠΊΠΎΡΠΈΡΡΠΈ ΠΎΠ΄ ΠΏΡΠΈΠΌΠ΅Π½Π΅
ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° ΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΊΠΈΡ
ΡΡΠ΅Π½Π΄ΠΎΠ²Π° ΠΈΠ· ΠΎΠ±Π»Π°ΡΡΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°. ΠΠ΅ΡΡΡΠΈΠΌ, ΠΊΠ°ΠΊΠΎ ΡΠ΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ°Π½ΠΎ Ρ ΠΎΠ²ΠΎΡ ΡΠ΅Π·ΠΈ, ΠΏΠΎΡΡΠ΅Π±Π½ΠΎ ΡΠ΅ Π±ΠΎΡΠ΅ ΡΠ°Π·ΡΠΌΠ΅Π²Π°ΡΠ΅
ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΎΡΡΠΈ Π°ΡΠ°ΠΏΡΠΊΠΎΠ³ ΡΠ΅Π·ΠΈΠΊΠ° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ Linked Data Π°Π»Π°ΡΠ° ΠΈ ΡΡΡ
ΠΎΠ²Ρ ΠΏΡΠΈΠΌΠ΅Π½Ρ
ΡΠ° ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°
Distributed Holistic Clustering on Linked Data
Link discovery is an active field of research to support data integration in
the Web of Data. Due to the huge size and number of available data sources,
efficient and effective link discovery is a very challenging task. Common
pairwise link discovery approaches do not scale to many sources with very large
entity sets. We here propose a distributed holistic approach to link many data
sources based on a clustering of entities that represent the same real-world
object. Our clustering approach provides a compact and fused representation of
entities, and can identify errors in existing links as well as many new links.
We support a distributed execution of the clustering approach to achieve faster
execution times and scalability for large real-world data sets. We provide a
novel gold standard for multi-source clustering, and evaluate our methods with
respect to effectiveness and efficiency for large data sets from the geographic
and music domains
- β¦