Search CORE

366 research outputs found

PaaS Cloud Service for Cost-Effective Harvesting, Processing and Linking of Unstructured Open Government Data

Author: Toompuu Mailis
Publication venue
Publication date: 01/01/2015
Field of study

Selle projekti eesmärk on luua pilveteenus, mis võimaldaks struktueerimata avalike andmete töötlemist, selleks, et luua semantiline andmete (veebis olevatest dokumentidest leitud organisatsioonide, kohanimede ja isikunimede) ressursikirjeldusraamistiku - Resource Description Framework (RDF) - graaf, mis on ka masinloetav. Pilveteenus saab sisendiks veebiroomaja toodetud logifaili üle 3 miljoni reaga. Igal real on veebiaadress avalikule dokumendile, mis avatakse, loetakse ning kasutades - tööriista eestikeelsest tekstist nimeolemite leidmiseks- Estnltk-d, eraldatakse organisatsiooonide ja kohtade nimetused ja inimeste nimed. Seejärel lisatakse leitud nimed/nimetused RDF graafi, kasutades olemasolevat Pythoni teeki RDFlib. RDF graafis nimed/nimetused lingitakse nende veebiaadressidega, kus asub seda nime/nimetust sisaldav avalik dokument. Dokumendid arhiveeritakse lugemise hetkel neis olnud sisuga. Lisaks sisaldab teenus igakuist andmete ülekontrollimist, et tuvastada dokumentide muutusi ja vajadusel värskendada RDF graafe. Genereeritud RDF graafe kasutatakse SPARQL päringute tegemiseks, mida saavad teha kasutajad graafilise kasutajaliidese kaudu või masinad veebiteenust kasutades. Projekti oluline väljakutse on luua arhitektuur, mis töötleks andmeid võimalikult kiiresti, sest sisendfail on suur (test-logifailis on üle 3 miljoni rea, kus igal real olev URL võib viidata mahukale dokumendile). Selleks jooksutab teenus seal kus võimalik, protsesse paralleelselt, kasutades Google’i virtuaalmasinaid (Google Compute Engine) ja iga virtuaalmasina kõiki protsessoreid.The aim of this project is to develop a cloud platform service for transforming Open Government Data to Linked Open Government Data. This service receives log file, created by web crawler, with URLs (over 3000000) to some open document as an input. It then opens the document, reads its content and with using "Open source tools for Estonian natural language processing" (Estnltk), finds names of locations, organizations and people. Using Psython library "RDFlib", these names are added to the Resource Description Framework (RDF) graph, so that the names become linked to the URLs that refer to the documents. In order to archive current state of accessed document, this service downloads all processed documents. The service also enables monthly updates system of the already processed documents in order to generate new RDF relations if some of the documents have changed. Generated RDFs are publicly available and the service includes SPARQL endpoint for userss (graphical user interface) and machines (web services) for cost-effective querying of linked entities from the RDF files. An important challenge of this service is to speed up its performance, because the documents behind these 3+ billion URLs may be large. To achieve that, parallel processes are run where possible: using several virtual machines and all CPUs in a virtual machine. This is tested in Google Compute Engin

DSpace at Tartu University Library

Statistical Extraction of Multilingual Natural Language Patterns for RDF Predicates: Algorithms and Applications

Author: Gerber Daniel
Publication venue
Publication date: 07/06/2016
Field of study

The Data Web has undergone a tremendous growth period. It currently consists of more then 3300 publicly available knowledge bases describing millions of resources from various domains, such as life sciences, government or geography, with over 89 billion facts. In the same way, the Document Web grew to the state where approximately 4.55 billion websites exist, 300 million photos are uploaded on Facebook as well as 3.5 billion Google searches are performed on average every day. However, there is a gap between the Document Web and the Data Web, since for example knowledge bases available on the Data Web are most commonly extracted from structured or semi-structured sources, but the majority of information available on the Web is contained in unstructured sources such as news articles, blog post, photos, forum discussions, etc. As a result, data on the Data Web not only misses a significant fragment of information but also suffers from a lack of actuality since typical extraction methods are time-consuming and can only be carried out periodically. Furthermore, provenance information is rarely taken into consideration and therefore gets lost in the transformation process. In addition, users are accustomed to entering keyword queries to satisfy their information needs. With the availability of machine-readable knowledge bases, lay users could be empowered to issue more specific questions and get more precise answers. In this thesis, we address the problem of Relation Extraction, one of the key challenges pertaining to closing the gap between the Document Web and the Data Web by four means. First, we present a distant supervision approach that allows finding multilingual natural language representations of formal relations already contained in the Data Web. We use these natural language representations to find sentences on the Document Web that contain unseen instances of this relation between two entities. Second, we address the problem of data actuality by presenting a real-time data stream RDF extraction framework and utilize this framework to extract RDF from RSS news feeds. Third, we present a novel fact validation algorithm, based on natural language representations, able to not only verify or falsify a given triple, but also to find trustworthy sources for it on the Web and estimating a time scope in which the triple holds true. The features used by this algorithm to determine if a website is indeed trustworthy are used as provenance information and therewith help to create metadata for facts in the Data Web. Finally, we present a question answering system that uses the natural language representations to map natural language question to formal SPARQL queries, allowing lay users to make use of the large amounts of data available on the Data Web to satisfy their information need

Qucosa - Publikationsserver der Universität Leipzig

Automatic Extraction and Assessment of Entities from the Web

Author: Urbansky David
Publication venue
Publication date: 15/10/2012
Field of study

The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to ﬁnd all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The ﬁndings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research ﬁelds, such as question answering, named entity recognition, and information retrieval

Technische Universität Dresden: Qucosa

Knowledge-Based Techniques for Scholarly Data Access: Towards Automatic Curation

Author: De Nart Dario
Publication venue: Università degli Studi di Udine
Publication date: 03/04/2017
Field of study

Accessing up-to-date and quality scientific literature is a critical preliminary step in any research activity. Identifying relevant scholarly literature for the extents of a given task or application is, however a complex and time consuming activity. Despite the large number of tools developed over the years to support scholars in their literature surveying activity, such as Google Scholar, Microsoft Academic search, and others, the best way to access quality papers remains asking a domain expert who is actively involved in the field and knows research trends and directions. State of the art systems, in fact, either do not allow exploratory search activity, such as identifying the active research directions within a given topic, or do not offer proactive features, such as content recommendation, which are both critical to researchers. To overcome these limitations, we strongly advocate a paradigm shift in the development of scholarly data access tools: moving from traditional information retrieval and filtering tools towards automated agents able to make sense of the textual content of published papers and therefore monitor the state of the art. Building such a system is however a complex task that implies tackling non trivial problems in the fields of Natural Language Processing, Big Data Analysis, User Modelling, and Information Filtering. In this work, we introduce the concept of Automatic Curator System and present its fundamental components.openDottorato di ricerca in InformaticaopenDe Nart, Dari

Archivio istituzionale della ricerca - Università degli Studi di Udine

Extracting Temporal Expressions from Unstructured Open Resources

Author: Valizadeh Ardalan Zagros
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Report of the Stanford Linked Data Workshop

Author: Calter Mimi
Glaser Hugh
Keller Michael A
Persons Jerry
Publication venue: Council on Library and Information Resources
Publication date: 01/10/2011
Field of study

The Stanford University Libraries and Academic Information Resources (SULAIR) with the Council on Library and Information Resources (CLIR) conducted at week-long workshop on the prospects for a large scale, multi-national, multi-institutional prototype of a Linked Data environment for discovery of and navigation among the rapidly, chaotically expanding array of academic information resources. As preparation for the workshop, CLIR sponsored a survey by Jerry Persons, Chief Information Architect emeritus of SULAIR that was published originally for workshop participants as background to the workshop and is now publicly available. The original intention of the workshop was to devise a plan for such a prototype. However, such was the diversity of knowledge, experience, and views of the potential of Linked Data approaches that the workshop participants turned to two more fundamental goals: building common understanding and enthusiasm on the one hand and identifying opportunities and challenges to be confronted in the preparation of the intended prototype and its operation on the other. In pursuit of those objectives, the workshop participants produced:1. a value statement addressing the question of why a Linked Data approach is worth prototyping;2. a manifesto for Linked Libraries (and Museums and Archives and …);3. an outline of the phases in a life cycle of Linked Data approaches;4. a prioritized list of known issues in generating, harvesting & using Linked Data;5. a workflow with notes for converting library bibliographic records and other academic metadata to URIs;6. examples of potential “killer apps” using Linked Data: and7. a list of next steps and potential projects.This report includes a summary of the workshop agenda, a chart showing the use of Linked Data in cultural heritage venues, and short biographies and statements from each of the participants

Southampton (e-Prints Soton)

Mining Meaning from Wikipedia

Author: Legg Catherine
Medelyan Olena
Milne David
Witten Ian H.
Publication venue
Publication date: 01/01/2008
Field of study

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie

arXiv.org e-Print Archive

CiteSeerX

Deakin Research Online

Research Commons@Waikato

Semantic Systems. The Power of AI and Knowledge Graphs

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

OAPEN Library

Improving data management through automatic information extraction model in ontology for road asset management

Author: Lei Xiang
Publication venue: Curtin University
Publication date: 01/01/2023
Field of study

lRoads are a critical component of transportation infrastructure, and their effective maintenance is paramount in ensuring their continued functionality and safety. This research proposes a novel information management approach based on state-of-the-art deep learning models and ontologies. The approach can automatically extract, integrate, complete, and search for project knowledge buried in unstructured text documents. The approach on the one hand facilitates implementation of modern management approaches, i.e., advanced working packaging to delivery success road management projects, on the other hand improves information management practices in the construction industry

espace@Curtin

Linked Open Data - Creating Knowledge Out of Interlinked Data: Results of the LOD2 Project

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Database Management; Artificial Intelligence (incl. Robotics); Information Systems and Communication Servic

OAPEN Library