594 research outputs found

    Constitute: The worldā€™s constitutions to read, search, and compare

    Get PDF
    Constitutional design and redesign is constant. Over the last 200 years, countries have replaced their constitutions an average of every 19 years and some have amended them almost yearly. A basic problem in the drafting of these documents is the search and analysis of model text deployed in other jurisdictions. Traditionally, this process has been ad hoc and the results suboptimal. As a result, drafters generally lack systematic information about the institutional options and choices available to them. In order to address this informational need, the investigators developed a web application, Constitute [online at http://www.constituteproject.org], with the use of semantic technologies. Constitute provides searchable access to the worldā€™s constitutions using the conceptualization, texts, and data developed by the Comparative Constitutions Project. An OWL ontology represents 330 ā€˜ā€˜topicsā€™ā€™ ā€“ e.g. right to health ā€“ with which the investigators have tagged relevant provisions of nearly all constitutions in force as of September of 2013. The tagged texts were then converted to an RDF representation using R2RML mappings and Capsentaā€™s Ultrawrap. The portal implements semantic search features to allow constitutional drafters to read, search, and compare the worldā€™s constitutions. The goal of the project is to improve the efficiency and systemization of constitutional design and, thus, to support the independence and self-reliance of constitutional drafters.Governmen

    Knowledge extraction from unstructured data and classification through distributed ontologies

    Get PDF
    The World Wide Web has changed the way humans use and share any kind of information. The Web removed several access barriers to the information published and has became an enormous space where users can easily navigate through heterogeneous resources (such as linked documents) and can easily edit, modify, or produce them. Documents implicitly enclose information and relationships among them which become only accessible to human beings. Indeed, the Web of documents evolved towards a space of data silos, linked each other only through untyped references (such as hypertext references) where only humans were able to understand. A growing desire to programmatically access to pieces of data implicitly enclosed in documents has characterized the last efforts of the Web research community. Direct access means structured data, thus enabling computing machinery to easily exploit the linking of different data sources. It has became crucial for the Web community to provide a technology stack for easing data integration at large scale, first structuring the data using standard ontologies and afterwards linking them to external data. Ontologies became the best practices to define axioms and relationships among classes and the Resource Description Framework (RDF) became the basic data model chosen to represent the ontology instances (i.e. an instance is a value of an axiom, class or attribute). Data becomes the new oil, in particular, extracting information from semi-structured textual documents on the Web is key to realize the Linked Data vision. In the literature these problems have been addressed with several proposals and standards, that mainly focus on technologies to access the data and on formats to represent the semantics of the data and their relationships. With the increasing of the volume of interconnected and serialized RDF data, RDF repositories may suffer from data overloading and may become a single point of failure for the overall Linked Data vision. One of the goals of this dissertation is to propose a thorough approach to manage the large scale RDF repositories, and to distribute them in a redundant and reliable peer-to-peer RDF architecture. The architecture consists of a logic to distribute and mine the knowledge and of a set of physical peer nodes organized in a ring topology based on a Distributed Hash Table (DHT). Each node shares the same logic and provides an entry point that enables clients to query the knowledge base using atomic, disjunctive and conjunctive SPARQL queries. The consistency of the results is increased using data redundancy algorithm that replicates each RDF triple in multiple nodes so that, in the case of peer failure, other peers can retrieve the data needed to resolve the queries. Additionally, a distributed load balancing algorithm is used to maintain a uniform distribution of the data among the participating peers by dynamically changing the key space assigned to each node in the DHT. Recently, the process of data structuring has gained more and more attention when applied to the large volume of text information spread on the Web, such as legacy data, news papers, scientific papers or (micro-)blog posts. This process mainly consists in three steps: \emph{i)} the extraction from the text of atomic pieces of information, called named entities; \emph{ii)} the classification of these pieces of information through ontologies; \emph{iii)} the disambigation of them through Uniform Resource Identifiers (URIs) identifying real world objects. As a step towards interconnecting the web to real world objects via named entities, different techniques have been proposed. The second objective of this work is to propose a comparison of these approaches in order to highlight strengths and weaknesses in different scenarios such as scientific and news papers, or user generated contents. We created the Named Entity Recognition and Disambiguation (NERD) web framework, publicly accessible on the Web (through REST API and web User Interface), which unifies several named entity extraction technologies. Moreover, we proposed the NERD ontology, a reference ontology for comparing the results of these technologies. Recently, the NERD ontology has been included in the NIF (Natural language processing Interchange Format) specification, part of the Creating Knowledge out of Interlinked Data (LOD2) project. Summarizing, this dissertation defines a framework for the extraction of knowledge from unstructured data and its classification via distributed ontologies. A detailed study of the Semantic Web and knowledge extraction fields is proposed to define the issues taken under investigation in this work. Then, it proposes an architecture to tackle the single point of failure issue introduced by the RDF repositories spread within the Web. Although the use of ontologies enables a Web where data is structured and comprehensible by computing machinery, human users may take advantage of it especially for the annotation task. Hence, this work describes an annotation tool for web editing, audio and video annotation in a web front end User Interface powered on the top of a distributed ontology. Furthermore, this dissertation details a thorough comparison of the state of the art of named entity technologies. The NERD framework is presented as technology to encompass existing solutions in the named entity extraction field and the NERD ontology is presented as reference ontology in the field. Finally, this work highlights three use cases with the purpose to reduce the amount of data silos spread within the Web: a Linked Data approach to augment the automatic classification task in a Systematic Literature Review, an application to lift educational data stored in Sharable Content Object Reference Model (SCORM) data silos to the Web of data and a scientific conference venue enhancer plug on the top of several data live collectors. Significant research efforts have been devoted to combine the efficiency of a reliable data structure and the importance of data extraction techniques. This dissertation opens different research doors which mainly join two different research communities: the Semantic Web and the Natural Language Processing community. The Web provides a considerable amount of data where NLP techniques may shed the light within it. The use of the URI as a unique identifier may provide one milestone for the materialization of entities lifted from a raw text to real world object

    Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review

    Get PDF
    A Systematic Literature Review (SLR) identifies, evaluates, and synthesizes the literature available for a given topic. This generally requires a significant human workload and has subjectivity bias that could affect the results of such a review. Automated document classification can be a valuable tool for recommending the selection of studies. In this article, we propose an automated pre-selection approach based on text mining and semantic enrichment techniques. Each document is firstly processed by a named entity extractor. The DBpedia URIs coming from the entity linking process are used as external sources of information. Our system collects the bag of words of those sources and it adds them to the initial document. A Multinomial Naive Bayes classifier discriminates whether the enriched document belongs to the positive example set or not. We used an existing manually performed SLR as benchmark data set. We trained our system with different configurations of relevant documents and we tested the goodness of our approach with an empirical assessment. Results show a reduction of the manual workload of 18% that a human researcher has to spend, while holding a remarkable 95% of recall, important condition for the nature itself of SLRs. We measure the effect of the enrichment process to the precision of the classifier and we observed a gain up to 5%

    Requirements and Use Cases ; Report I on the sub-project Smart Content Enrichment

    Get PDF
    In this technical report, we present the results of the first milestone phase of the Corporate Smart Content sub-project "Smart Content Enrichment". We present analyses of the state of the art in the fields concerning the three working packages defined in the sub-project, which are aspect-oriented ontology development, complex entity recognition, and semantic event pattern mining. We compare the research approaches related to our three research subjects and outline briefly our future work plan

    Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article

    Get PDF
    With the rapid development of the digital humanities (DH) field, demands for historical and cultural heritage data have generated deep interest in the data provided by libraries, archives, and museums (LAMs). In order to enhance LAM dataā€™s quality and discoverability while enabling a self-sustaining ecosystem, ā€œsemantic enrichmentā€ becomes a strategy increasingly used by LAMs during recent years. This article introduces a number of semantic enrichment methods and efforts that can be applied to LAM data at various levels, aiming to support deeper and wider exploration and use of LAM data in DH research. The real cases, research projects, experiments, and pilot studies shared in this article demonstrate endless potential for LAM data, whether they are structured, semi-structured, or unstructured, regardless of what types of original artifacts carry the data. Following their roadmaps would encourage more effective initiatives and strengthen this effort to maximize LAM dataā€™s discoverability, use- and reuse-ability, and their value in the mainstream of DH and Semantic Web

    Enrichment of the DBpedia NIF dataset

    Get PDF
    DBpedia je komunitnĆ­ ĆŗsilĆ­ založenĆ© na davu, jehož cĆ­lem je zĆ­skĆ”vat informace z Ė‡clĆ”nkĀ°u Wikipedie a poskytovat tyto informace ve strojovĖ‡e Ė‡citelnĆ©m formĆ”tu. DatovĆ½ soubor DBpedia NIF poskytuje obsah vÅ”ech Ė‡clĆ”nkĀ°u Wikipedie ve 128 jazycĆ­ch. KoneĖ‡cnĆ½m cĆ­lem prĆ”ce je obohatit datovou sadu o dalÅ”Ć­ informace, jejichž hlavnĆ­ vĆ½zvou je velikost datovĆ© sady. Implementace spoĖ‡cĆ­vĆ” v pĖ‡redbĖ‡ežnĆ©m zpracovĆ”nĆ­ souboru dat oddĖ‡elenĆ­m obsahu jednotlivĆ½ch Ė‡clĆ”nkĀ°u Wikipedie do samostatnĆ½ch souborĀ°u, protože soubor dat NIF obsahuje obsah vÅ”ech Ė‡clĆ”nkĀ°u v jednom obrovskĆ©m souboru. PĖ‡redbĖ‡ežnĆ© zpracovĆ”nĆ­ textu napomĆ”hĆ” k využitĆ­ datovĆ© sady pro Å”kolenĆ­ rĀ°uznĆ½ch Ćŗloh zpracovĆ”nĆ­ pĖ‡rirozenĆ©ho jazyka. Po provedenĆ­ nĆ”sleduje provedenĆ­ NLP ĆŗkolĀ°u, a to rozdĖ‡elenĆ­ vĖ‡et, Tokenizace, Ė‡cĆ”st znaĖ‡ckovĆ”nĆ­ Ė‡reĖ‡ci. Nakonec pĖ‡risp Ė‡et do komunity DBpedia pĖ‡ridĆ”nĆ­m dalÅ”Ć­ch odkazĀ°u na Ė‡clĆ”nky wikipedia. KoneĖ‡cnĖ‡e, vyhodnocenĆ­ vĆ½sledkĀ°u a kontrola sprĆ”vnosti vĆ½sledkĀ°u statisticky.DBpedia is a crowd-sourced community effort which aims at extracting information from Wikipedia articles and providing this information in a machine-readable format. DBpedia NIF dataset provides the content of all Wikipedia articles in 128 languages. The ultimate goal of the thesis is to enrich the dataset with additional information where the main challenge is the size of the dataset. The implementation comprises of pre-processing the dataset by segregating the contents of individual Wikipedia articles into separate files, as the NIF dataset comprises the contents of all the articles in one huge file. The text pre-processing helps in order to use the dataset for training different Natural language processing tasks. The implementation is followed by performing NLP tasks namely sentence splitting, Tokenization, Part of speech tagging. Eventually contribute to the DBpedia community by adding additional links to the wikipedia articles. Finally, evaluating the results and checking the correctness of the results statistically
    • ā€¦
    corecore