765 research outputs found
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
When things matter: A survey on data-centric Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, but several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy and continuous. This paper reviews the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed
Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP
Extracting valuable insights from vast amounts of unstructured digital text presents significant challenges across diverse domains. This research addresses this challenge by proposing a novel pipeline-based system that generates domain-agnostic and task-agnostic text representations. The proposed approach leverages labeled property graphs (LPG) to encode contextual information, facilitating the integration of diverse linguistic elements into a unified representation. The proposed system enables efficient graph-based querying and manipulation by addressing the crucial aspect of comprehensive context modeling and fine-grained semantics. The effectiveness of the proposed system is demonstrated through the implementation of NLP components that operate on LPG-based representations. Additionally, the proposed approach introduces specialized patterns and algorithms to enhance specific NLP tasks, including nominal mention detection, named entity disambiguation, event enrichments, event participant detection, and temporal link detection. The evaluation of the proposed approach, using the MEANTIME corpus comprising manually annotated documents, provides encouraging results and valuable insights into the system\u27s strengths. The proposed pipeline-based framework serves as a solid foundation for future research, aiming to refine and optimize LPG-based graph structures to generate comprehensive and semantically rich text representations, addressing the challenges associated with efficient information extraction and analysis in NLP
ΠΠΊΡΡΠΆΠ΅ΡΠ΅ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρ ΠΈ ΠΎΡΠ΅Π½Ρ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
Linking and publishing data in the Linked Open Data format increases the interoperability
and discoverability of resources over the Web. To accomplish this, the process comprises
several design decisions, based on the Linked Data principles that, on one hand, recommend to
use standards for the representation and the access to data on the Web, and on the other hand
to set hyperlinks between data from different sources.
Despite the efforts of the World Wide Web Consortium (W3C), being the main international
standards organization for the World Wide Web, there is no one tailored formula for publishing
data as Linked Data. In addition, the quality of the published Linked Open Data (LOD) is a
fundamental issue, and it is yet to be thoroughly managed and considered.
In this doctoral thesis, the main objective is to design and implement a novel framework for
selecting, analyzing, converting, interlinking, and publishing data from diverse sources,
simultaneously paying great attention to quality assessment throughout all steps and modules
of the framework. The goal is to examine whether and to what extent are the Semantic Web
technologies applicable for merging data from different sources and enabling end-users to
obtain additional information that was not available in individual datasets, in addition to the
integration into the Semantic Web community space. Additionally, the Ph.D. thesis intends to
validate the applicability of the process in the specific and demanding use case, i.e. for creating
and publishing an Arabic Linked Drug Dataset, based on open drug datasets from selected
Arabic countries and to discuss the quality issues observed in the linked data life-cycle. To that
end, in this doctoral thesis, a Semantic Data Lake was established in the pharmaceutical domain
that allows further integration and developing different business services on top of the
integrated data sources. Through data representation in an open machine-readable format, the
approach offers an optimum solution for information and data dissemination for building
domain-specific applications, and to enrich and gain value from the original dataset. This thesis
showcases how the pharmaceutical domain benefits from the evolving research trends for
building competitive advantages. However, as it is elaborated in this thesis, a better
understanding of the specifics of the Arabic language is required to extend linked data
technologies utilization in targeted Arabic organizations.ΠΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ ΡΠΎΡΠΌΠ°ΡΡ "ΠΠΎΠ²Π΅Π·Π°Π½ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ" (Π΅Π½Π³.
Linked Open Data) ΠΏΠΎΠ²Π΅ΡΠ°Π²Π° ΠΈΠ½ΡΠ΅ΡΠΎΠΏΠ΅ΡΠ°Π±ΠΈΠ»Π½ΠΎΡΡ ΠΈ ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡΠΈ Π·Π° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΠ΅ΡΡΡΡΠ°
ΠΏΡΠ΅ΠΊΠΎ Web-Π°. ΠΡΠΎΡΠ΅Ρ ΡΠ΅ Π·Π°ΡΠ½ΠΎΠ²Π°Π½ Π½Π° Linked Data ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈΠΌΠ° (W3C, 2006) ΠΊΠΎΡΠΈ ΡΠ° ΡΠ΅Π΄Π½Π΅
ΡΡΡΠ°Π½Π΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ° ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΠ΅ ΠΈ ΠΏΡΠΈΡΡΡΠΏ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° Π½Π° WΠ΅Π±Ρ (RDF, OWL,
SPARQL), Π° ΡΠ° Π΄ΡΡΠ³Π΅ ΡΡΡΠ°Π½Π΅, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΈ ΡΡΠ³Π΅ΡΠΈΡΡ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ Ρ
ΠΈΠΏΠ΅ΡΠ²Π΅Π·Π° ΠΈΠ·ΠΌΠ΅ΡΡ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°
ΠΈΠ· ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ°.
Π£ΠΏΡΠΊΠΎΡ Π½Π°ΠΏΠΎΡΠΈΠΌΠ° W3C ΠΊΠΎΠ½Π·ΠΎΡΡΠΈΡΡΠΌΠ° (W3C ΡΠ΅ Π³Π»Π°Π²Π½Π° ΠΌΠ΅ΡΡΠ½Π°ΡΠΎΠ΄Π½Π° ΠΎΡΠ³Π°Π½ΠΈΠ·Π°ΡΠΈΡΠ° Π·Π°
ΡΡΠ°Π½Π΄Π°ΡΠ΄Π΅ Π·Π° Web-Ρ), Π½Π΅ ΠΏΠΎΡΡΠΎΡΠΈ ΡΠ΅Π΄ΠΈΠ½ΡΡΠ²Π΅Π½Π° ΡΠΎΡΠΌΡΠ»Π° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ ΠΏΡΠΎΡΠ΅ΡΠ°
ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Ρ Linked Data ΡΠΎΡΠΌΠ°ΡΡ. Π£Π·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ Π΄Π° ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅Ρ
ΠΎΠ±ΡΠ°Π²ΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎΠ΄Π»ΡΡΡΡΡΡΠΈ Π·Π° Π±ΡΠ΄ΡΡΠΈ ΡΠ°Π·Π²ΠΎΡ Web-Π°, Ρ ΠΎΠ²ΠΎΡ
Π΄ΠΎΠΊΡΠΎΡΡΠΊΠΎΡ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ, Π³Π»Π°Π²Π½ΠΈ ΡΠΈΡ ΡΠ΅ (1) Π΄ΠΈΠ·Π°ΡΠ½ ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΠ° ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΎΠ³ ΠΎΠΊΠ²ΠΈΡΠ°
Π·Π° ΠΈΠ·Π±ΠΎΡ, Π°Π½Π°Π»ΠΈΠ·Ρ, ΠΊΠΎΠ½Π²Π΅ΡΠ·ΠΈΡΡ, ΠΌΠ΅ΡΡΡΠΎΠ±Π½ΠΎ ΠΏΠΎΠ²Π΅Π·ΠΈΠ²Π°ΡΠ΅ ΠΈ ΠΎΠ±ΡΠ°Π²ΡΠΈΠ²Π°ΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΈΠ·
ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΡ
ΠΈΠ·Π²ΠΎΡΠ° ΠΈ (2) Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΡΠΈΠΌΠ΅Π½Π° ΠΎΠ²ΠΎΠ³ ΠΏΡΠΈΡΡΡΠΏΠ° Ρ ΡΠ°ΡΠΌΠ°ΡeΡΡΡΠΊΠΎΠΌ Π΄ΠΎΠΌΠ΅Π½Ρ.
ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° Π΄ΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° Π΄Π΅ΡΠ°ΡΠ½ΠΎ ΠΈΡΡΡΠ°ΠΆΡΡΠ΅ ΠΏΠΈΡΠ°ΡΠ΅ ΠΊΠ²Π°Π»ΠΈΡΠ΅ΡΠ° Π²Π΅Π»ΠΈΠΊΠΈΡ
ΠΈ
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ
Π΅ΠΊΠΎΡΠΈΡΡΠ΅ΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Linked Data Ecosystems), ΡΠ·ΠΈΠΌΠ°ΡΡΡΠΈ Ρ ΠΎΠ±Π·ΠΈΡ
ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡ ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎΠ³ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΡ
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°. Π Π°Π΄ ΡΠ΅ ΠΌΠΎΡΠΈΠ²ΠΈΡΠ°Π½ ΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ Π΄Π° ΡΠ΅
ΠΎΠΌΠΎΠ³ΡΡΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ° ΠΈΠ· Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° Π΄Π° ΡΠΏΠΎΡΡΠ΅Π±ΠΎΠΌ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
Π²Π΅Π± ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°
ΠΏΠΎΠ²Π΅ΠΆΡ ΡΠ²ΠΎΡΠ΅ ΠΏΠΎΠ΄Π°ΡΠΊΠ΅ ΡΠ° ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈΠΌ ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ°, ΠΊΠ°ΠΎ Π½ΠΏΡ. DBpedia-ΡΠΎΠΌ. Π¦ΠΈΡ ΡΠ΅ Π΄Π° ΡΠ΅ ΠΈΡΠΏΠΈΡΠ°
Π΄Π° Π»ΠΈ ΠΎΡΠ²ΠΎΡΠ΅Π½ΠΈ ΠΏΠΎΠ΄Π°ΡΠΈ ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π°ΡΡ ΠΊΡΠ°ΡΡΠΈΠΌ ΠΊΠΎΡΠΈΡΠ½ΠΈΡΠΈΠΌΠ° Π΄Π° Π΄ΠΎΠ±ΠΈΡΡ
Π΄ΠΎΠ΄Π°ΡΠ½Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡΠ΅ ΠΊΠΎΡΠ΅ Π½ΠΈΡΡ Π΄ΠΎΡΡΡΠΏΠ½Π΅ Ρ ΠΏΠΎΡΠ΅Π΄ΠΈΠ½Π°ΡΠ½ΠΈΠΌ ΡΠΊΡΠΏΠΎΠ²ΠΈΠΌΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ°, ΠΏΠΎΡΠ΅Π΄
ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΡΠ΅ Ρ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈ WΠ΅Π± ΠΏΡΠΎΡΡΠΎΡ.
ΠΠΎΠΊΡΠΎΡΡΠΊΠ° Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ° ΠΏΡΠ΅Π΄Π»Π°ΠΆΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡΡ Π·Π° ΡΠ°Π·Π²ΠΎΡ Π°ΠΏΠ»ΠΈΠΊΠ°ΡΠΈΡΠ΅ Π·Π° ΡΠ°Π΄ ΡΠ°
ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΠΌ (Linked) ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ° ΡΠΎΡΡΠ²Π΅ΡΡΠΊΠΎ ΡΠ΅ΡΠ΅ΡΠ΅ ΠΊΠΎΡΠ΅ ΠΎΠΌΠΎΠ³ΡΡΡΡΠ΅
ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΠΊΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΎΠ³ ΡΠΊΡΠΏΠ° ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΠΎ Π»Π΅ΠΊΠΎΠ²ΠΈΠΌΠ° ΠΈΠ· ΠΈΠ·Π°Π±ΡΠ°Π½ΠΈΡ
Π°ΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°. ΠΠΎΠ½ΡΠΎΠ»ΠΈΠ΄ΠΎΠ²Π°Π½ΠΈ ΡΠΊΡΠΏ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° ΡΠ΅ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠΈΡΠ°Π½ Ρ ΠΎΠ±Π»ΠΈΠΊΡ Π‘Π΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΎΠ³ ΡΠ΅Π·Π΅ΡΠ°
ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° (Π΅Π½Π³. Semantic Data Lake).
ΠΠ²Π° ΡΠ΅Π·Π° ΠΏΠΎΠΊΠ°Π·ΡΡΠ΅ ΠΊΠ°ΠΊΠΎ ΡΠ°ΡΠΌΠ°ΡΠ΅ΡΡΡΠΊΠ° ΠΈΠ½Π΄ΡΡΡΡΠΈΡΠ° ΠΈΠΌΠ° ΠΊΠΎΡΠΈΡΡΠΈ ΠΎΠ΄ ΠΏΡΠΈΠΌΠ΅Π½Π΅
ΠΈΠ½ΠΎΠ²Π°ΡΠΈΠ²Π½ΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° ΠΈ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΊΠΈΡ
ΡΡΠ΅Π½Π΄ΠΎΠ²Π° ΠΈΠ· ΠΎΠ±Π»Π°ΡΡΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°. ΠΠ΅ΡΡΡΠΈΠΌ, ΠΊΠ°ΠΊΠΎ ΡΠ΅ Π΅Π»Π°Π±ΠΎΡΠΈΡΠ°Π½ΠΎ Ρ ΠΎΠ²ΠΎΡ ΡΠ΅Π·ΠΈ, ΠΏΠΎΡΡΠ΅Π±Π½ΠΎ ΡΠ΅ Π±ΠΎΡΠ΅ ΡΠ°Π·ΡΠΌΠ΅Π²Π°ΡΠ΅
ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΎΡΡΠΈ Π°ΡΠ°ΠΏΡΠΊΠΎΠ³ ΡΠ΅Π·ΠΈΠΊΠ° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½ΡΠ°ΡΠΈΡΡ Linked Data Π°Π»Π°ΡΠ° ΠΈ ΡΡΡ
ΠΎΠ²Ρ ΠΏΡΠΈΠΌΠ΅Π½Ρ
ΡΠ° ΠΏΠΎΠ΄Π°ΡΠΈΠΌΠ° ΠΈΠ· ΠΡΠ°ΠΏΡΠΊΠΈΡ
Π·Π΅ΠΌΠ°ΡΠ°
Extracting and Cleaning RDF Data
The RDF data model has become a prevalent format to represent heterogeneous data because of its versatility. The capability of dismantling information from its native formats and representing it in triple format offers a simple yet powerful way of modelling data that is obtained from multiple sources. In addition, the triple format and schema constraints of the RDF model make the RDF data easy to process as labeled, directed graphs.
This graph representation of RDF data supports higher-level analytics by enabling querying using different techniques and querying languages, e.g., SPARQL. Anlaytics that require structured data are supported by transforming the graph data on-the-fly to populate the target schema that is needed for downstream analysis. These target schemas are defined by downstream applications according to their information need.
The flexibility of RDF data brings two main challenges. First, the extraction of RDF data is a complex task that may involve domain expertise about the information required to be extracted for different applications. Another significant aspect of analyzing RDF data is its quality, which depends on multiple factors including the reliability of data sources and the accuracy of the extraction systems. The quality of the analysis depends mainly on the quality of the underlying data. Therefore, evaluating and improving the quality of RDF data has a direct effect on the correctness of downstream analytics.
This work presents multiple approaches related to the extraction and quality evaluation of RDF data. To cope with the large amounts of data that needs to be extracted, we present DSTLR, a scalable framework to extract RDF triples from semi-structured and unstructured data sources. For rare entities that fall on the long tail of information, there may not be enough signals to support high-confidence extraction. Towards this problem, we present an approach to estimate property values for long tail entities. We also present multiple algorithms and approaches that focus on the quality of RDF data. These include discovering quality constraints from RDF data, and utilizing machine learning techniques to repair errors in RDF data
A Learning Health System for Radiation Oncology
The proposed research aims to address the challenges faced by clinical data science researchers in radiation oncology accessing, integrating, and analyzing heterogeneous data from various sources. The research presents a scalable intelligent infrastructure, called the Health Information Gateway and Exchange (HINGE), which captures and structures data from multiple sources into a knowledge base with semantically interlinked entities. This infrastructure enables researchers to mine novel associations and gather relevant knowledge for personalized clinical outcomes.
The dissertation discusses the design framework and implementation of HINGE, which abstracts structured data from treatment planning systems, treatment management systems, and electronic health records. It utilizes disease-specific smart templates for capturing clinical information in a discrete manner. HINGE performs data extraction, aggregation, and quality and outcome assessment functions automatically, connecting seamlessly with local IT/medical infrastructure.
Furthermore, the research presents a knowledge graph-based approach to map radiotherapy data to an ontology-based data repository using FAIR (Findable, Accessible, Interoperable, Reusable) concepts. This approach ensures that the data is easily discoverable and accessible for clinical decision support systems. The dissertation explores the ETL (Extract, Transform, Load) process, data model frameworks, ontologies, and provides a real-world clinical use case for this data mapping.
To improve the efficiency of retrieving information from large clinical datasets, a search engine based on ontology-based keyword searching and synonym-based term matching tool was developed. The hierarchical nature of ontologies is leveraged to retrieve patient records based on parent and children classes. Additionally, patient similarity analysis is conducted using vector embedding models (Word2Vec, Doc2Vec, GloVe, and FastText) to identify similar patients based on text corpus creation methods. Results from the analysis using these models are presented.
The implementation of a learning health system for predicting radiation pneumonitis following stereotactic body radiotherapy is also discussed. 3D convolutional neural networks (CNNs) are utilized with radiographic and dosimetric datasets to predict the likelihood of radiation pneumonitis. DenseNet-121 and ResNet-50 models are employed for this study, along with integrated gradient techniques to identify salient regions within the input 3D image dataset. The predictive performance of the 3D CNN models is evaluated based on clinical outcomes.
Overall, the proposed Learning Health System provides a comprehensive solution for capturing, integrating, and analyzing heterogeneous data in a knowledge base. It offers researchers the ability to extract valuable insights and associations from diverse sources, ultimately leading to improved clinical outcomes. This work can serve as a model for implementing LHS in other medical specialties, advancing personalized and data-driven medicine
From Text to Knowledge with Graphs: modelling, querying and exploiting textual content
This paper highlights the challenges, current trends, and open issues related
to the representation, querying and analytics of content extracted from texts.
The internet contains vast text-based information on various subjects,
including commercial documents, medical records, scientific experiments,
engineering tests, and events that impact urban and natural environments.
Extracting knowledge from this text involves understanding the nuances of
natural language and accurately representing the content without losing
information. This allows knowledge to be accessed, inferred, or discovered. To
achieve this, combining results from various fields, such as linguistics,
natural language processing, knowledge representation, data storage, querying,
and analytics, is necessary. The vision in this paper is that graphs can be a
well-suited text content representation once annotated and the right querying
and analytics techniques are applied. This paper discusses this hypothesis from
the perspective of linguistics, natural language processing, graph models and
databases and artificial intelligence provided by the panellists of the DOING
session in the MADICS Symposium 2022
Scalable Data Integration for Linked Data
Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster
- β¦