3 research outputs found

    Knowledge extraction from unstructured data and classification through distributed ontologies

    Get PDF
    The World Wide Web has changed the way humans use and share any kind of information. The Web removed several access barriers to the information published and has became an enormous space where users can easily navigate through heterogeneous resources (such as linked documents) and can easily edit, modify, or produce them. Documents implicitly enclose information and relationships among them which become only accessible to human beings. Indeed, the Web of documents evolved towards a space of data silos, linked each other only through untyped references (such as hypertext references) where only humans were able to understand. A growing desire to programmatically access to pieces of data implicitly enclosed in documents has characterized the last efforts of the Web research community. Direct access means structured data, thus enabling computing machinery to easily exploit the linking of different data sources. It has became crucial for the Web community to provide a technology stack for easing data integration at large scale, first structuring the data using standard ontologies and afterwards linking them to external data. Ontologies became the best practices to define axioms and relationships among classes and the Resource Description Framework (RDF) became the basic data model chosen to represent the ontology instances (i.e. an instance is a value of an axiom, class or attribute). Data becomes the new oil, in particular, extracting information from semi-structured textual documents on the Web is key to realize the Linked Data vision. In the literature these problems have been addressed with several proposals and standards, that mainly focus on technologies to access the data and on formats to represent the semantics of the data and their relationships. With the increasing of the volume of interconnected and serialized RDF data, RDF repositories may suffer from data overloading and may become a single point of failure for the overall Linked Data vision. One of the goals of this dissertation is to propose a thorough approach to manage the large scale RDF repositories, and to distribute them in a redundant and reliable peer-to-peer RDF architecture. The architecture consists of a logic to distribute and mine the knowledge and of a set of physical peer nodes organized in a ring topology based on a Distributed Hash Table (DHT). Each node shares the same logic and provides an entry point that enables clients to query the knowledge base using atomic, disjunctive and conjunctive SPARQL queries. The consistency of the results is increased using data redundancy algorithm that replicates each RDF triple in multiple nodes so that, in the case of peer failure, other peers can retrieve the data needed to resolve the queries. Additionally, a distributed load balancing algorithm is used to maintain a uniform distribution of the data among the participating peers by dynamically changing the key space assigned to each node in the DHT. Recently, the process of data structuring has gained more and more attention when applied to the large volume of text information spread on the Web, such as legacy data, news papers, scientific papers or (micro-)blog posts. This process mainly consists in three steps: \emph{i)} the extraction from the text of atomic pieces of information, called named entities; \emph{ii)} the classification of these pieces of information through ontologies; \emph{iii)} the disambigation of them through Uniform Resource Identifiers (URIs) identifying real world objects. As a step towards interconnecting the web to real world objects via named entities, different techniques have been proposed. The second objective of this work is to propose a comparison of these approaches in order to highlight strengths and weaknesses in different scenarios such as scientific and news papers, or user generated contents. We created the Named Entity Recognition and Disambiguation (NERD) web framework, publicly accessible on the Web (through REST API and web User Interface), which unifies several named entity extraction technologies. Moreover, we proposed the NERD ontology, a reference ontology for comparing the results of these technologies. Recently, the NERD ontology has been included in the NIF (Natural language processing Interchange Format) specification, part of the Creating Knowledge out of Interlinked Data (LOD2) project. Summarizing, this dissertation defines a framework for the extraction of knowledge from unstructured data and its classification via distributed ontologies. A detailed study of the Semantic Web and knowledge extraction fields is proposed to define the issues taken under investigation in this work. Then, it proposes an architecture to tackle the single point of failure issue introduced by the RDF repositories spread within the Web. Although the use of ontologies enables a Web where data is structured and comprehensible by computing machinery, human users may take advantage of it especially for the annotation task. Hence, this work describes an annotation tool for web editing, audio and video annotation in a web front end User Interface powered on the top of a distributed ontology. Furthermore, this dissertation details a thorough comparison of the state of the art of named entity technologies. The NERD framework is presented as technology to encompass existing solutions in the named entity extraction field and the NERD ontology is presented as reference ontology in the field. Finally, this work highlights three use cases with the purpose to reduce the amount of data silos spread within the Web: a Linked Data approach to augment the automatic classification task in a Systematic Literature Review, an application to lift educational data stored in Sharable Content Object Reference Model (SCORM) data silos to the Web of data and a scientific conference venue enhancer plug on the top of several data live collectors. Significant research efforts have been devoted to combine the efficiency of a reliable data structure and the importance of data extraction techniques. This dissertation opens different research doors which mainly join two different research communities: the Semantic Web and the Natural Language Processing community. The Web provides a considerable amount of data where NLP techniques may shed the light within it. The use of the URI as a unique identifier may provide one milestone for the materialization of entities lifted from a raw text to real world object

    Concevoir des applications internet des objets sémantiques

    Get PDF
    According to Cisco's predictions, there will be more than 50 billions of devices connected to the Internet by 2020.The devices and produced data are mainly exploited to build domain-specific Internet of Things (IoT) applications. From a data-centric perspective, these applications are not interoperable with each other.To assist users or even machines in building promising inter-domain IoT applications, main challenges are to exploit, reuse, interpret and combine sensor data.To overcome interoperability issues, we designed the Machine-to-Machine Measurement (M3) framework consisting in:(1) generating templates to easily build Semantic Web of Things applications, (2) semantically annotating IoT data to infer high-level knowledge by reusing as much as possible the domain knowledge expertise, and (3) a semantic-based security application to assist users in designing secure IoT applications.Regarding the reasoning part, stemming from the 'Linked Open Data', we propose an innovative idea called the 'Linked Open Rules' to easily share and reuse rules to infer high-level abstractions from sensor data.The M3 framework has been suggested to standardizations and working groups such as ETSI M2M, oneM2M, W3C SSN ontology and W3C Web of Things. Proof-of-concepts of the flexible M3 framework have been developed on the cloud (http://www.sensormeasurement.appspot.com/) and embedded on Android-based constrained devices.Selon les prévisions de Cisco , il y aura plus de 50 milliards d'appareils connectés à Internet d'ici 2020. Les appareils et les données produites sont principalement exploitées pour construire des applications « Internet des Objets (IdO) ». D'un point de vue des données, ces applications ne sont pas interopérables les unes avec les autres. Pour aider les utilisateurs ou même les machines à construire des applications 'Internet des Objets' inter-domaines innovantes, les principaux défis sont l'exploitation, la réutilisation, l'interprétation et la combinaison de ces données produites par les capteurs. Pour surmonter les problèmes d'interopérabilité, nous avons conçu le système Machine-to-Machine Measurement (M3) consistant à: (1) enrichir les données de capteurs avec les technologies du web sémantique pour décrire explicitement leur sens selon le contexte, (2) interpréter les données des capteurs pour en déduire des connaissances supplémentaires en réutilisant autant que possible la connaissance du domaine définie par des experts, et (3) une base de connaissances de sécurité pour assurer la sécurité dès la conception lors de la construction des applications IdO. Concernant la partie raisonnement, inspiré par le « Web de données », nous proposons une idée novatrice appelée le « Web des règles » afin de partager et réutiliser facilement les règles pour interpréter et raisonner sur les données de capteurs. Le système M3 a été suggéré à des normalisations et groupes de travail tels que l'ETSI M2M, oneM2M, W3C SSN et W3C Web of Things. Une preuve de concept de M3 a été implémentée et est disponible sur le web (http://www.sensormeasurement.appspot.com/) mais aussi embarqu

    Semantic Selection of Internet Sources through SWRL Enabled OWL Ontologies

    Get PDF
    This research examines the problem of Information Overload (IO) and give an overview of various attempts to resolve it. Furthermore, argue that instead of fighting IO, it is advisable to start learning how to live with it. It is unlikely that in modern information age, where users are producer and consumer of information, the amount of data and information generated would decrease. Furthermore, when managing IO, users are confined to the algorithms and policies of commercial Search Engines and Recommender Systems (RSs), which create results that also add to IO. this research calls to initiate a change in thinking: this by giving greater power to users when addressing the relevance and accuracy of internet searches, which helps in IO. However powerful search engines are, they do not process enough semantics in the moment when search queries are formulated. This research proposes a semantic selection of internet sources, through SWRL enabled OWL ontologies. the research focuses on SWT and its Stack because they (a)secure the semantic interpretation of the environments where internet searches take place and (b) guarantee reasoning that results in the selection of suitable internet sources in a particular moment of internet searches. Therefore, it is important to model the behaviour of users through OWL concepts and reason upon them in order to address IO when searching the internet. Thus, user behaviour is itemized through user preferences, perceptions and expectations from internet searches. The proposed approach in this research is a Software Engineering (SE) solution which provides computations based on the semantics of the environment stored in the ontological model
    corecore