Search CORE

57,026 research outputs found

Semantic Knowledge Extraction from Research Documents

Author: Akihiro Fujii
Rishabh Upadhyay
Publication venue: 'Polish Information Processing Society PTI'
Publication date: 01/10/2016
Field of study

Crossref

Directory of Open Access Journals

Information Extraction Techniques for the Purposes of Semantic Indexing of Archaeological Resources

Author: Binding C
Tudhope D
Vlachidis A
Publication venue: Digital Heritage 2013: Interfaces with the Past
Publication date: 06/07/2013
Field of study

The paper describes the use of Information Extraction (IE), a Natural Language Processing (NLP) technique to assist ‘rich’ semantic indexing of diverse archaeological text resources. Such unpublished online documents are often referred to as ‘Grey Literature’. Established document indexing techniques are not sufficient to satisfy user information needs that expand beyond the limits of a simple term matching search. The focus of the research is to direct a semantic-aware 'rich' indexing of diverse natural language resources with properties capable of satisfying information retrieval from on-line publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project in the UoG Hypermedia Research Unit. The study proposes the use of knowledge resources and conceptual models to assist an Information Extraction process able to provide ‘rich’ semantic indexing of archaeological documents capable of resolving linguistic ambiguities of indexed terms. CRM CIDOC-EH, a standard core ontology in cultural heritage, and the English Heritage (EH) Thesauri for archaeological concepts are employed to drive the Information Extraction process and to support the aims of a semantic framework in which indexed terms are capable of supporting semantic-aware access to on-line resources. The paper describes the process of semantic indexing of archaeological concepts (periods and finds) in a corpus of 535 grey literature documents using a rule based Information Extraction technique facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules. Illustrative examples demonstrate the different stages of the process. Initial results suggest that the combination of information extraction with knowledge resources and standard core conceptual models is capable of supporting semantic aware and linguistically disambiguate term indexing

UCL Discovery

Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Author: Fu Cheng
Li Jingyang
Li Yongbin
Liu Tingwen
Sun Jian
Tang Chengguang
Yu Bowen
Yu Haiyang
Zhang Zhenyu
Publication venue
Publication date: 14/07/2022
Field of study

Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.Comment: Accepted to ACM Multimedia (MM) Industry Track 202

arXiv.org e-Print Archive

Structuring and extracting knowledge for the support of hypothesis generation in molecular biology

Author: A Gomez-Perez
Andrew P Gibson
B Smith
C Goble
CA Goble
CD Manning
CJ Mungall
DA Moreira
DL Rubin
E Neumann
Edgar Meij
EJ Meij
G Antoniou
I Spasic
IH Witten
J Broekstra
JA Kors
Konstantinos Krommydas
LD Stein
LJ Post
M Ashburner
M Missikoff
M Scott Marshall
M Weeber
MA Inda
Marco Roos
Martijn Schuemie
O Tuason
P Fisher
P Missier
P Romano
Pieter W Adriaans
PJ Verschure
R Hoehndorf
R Jelier
R Stevens
R Witte
S Jupp
S Katrenko
S Katrenko
S Katrenko
Sophia Katrenko
T Clark
Willem Robert van Hage
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Hypothesis generation in molecular and cellular biology is an empirical process in which knowledge derived from prior experiments is distilled into a comprehensible model. The requirement of automated support is exemplified by the difficulty of considering all relevant facts that are contained in the millions of documents available from PubMed. Semantic Web provides tools for sharing prior knowledge, while information retrieval and information extraction techniques enable its extraction from literature. Their combination makes prior knowledge available for computational analysis and inference. While some tools provide complete solutions that limit the control over the modeling and extraction processes, we seek a methodology that supports control by the experimenter over these critical processes. Results: We describe progress towards automated support for the generation of biomolecular hypotheses. Semantic Web technologies are used to structure and store knowledge, while a workflow extracts knowledge from text. We designed minimal proto-ontologies in OWL for capturing different aspects of a text mining experiment: the biological hypothesis, text and documents, text mining, and workflow provenance. The models fit a methodology that allows focus on the requirements of a single experiment while supporting reuse and posterior analysis of extracted knowledge from multiple experiments. Our workflow is composed of services from the 'Adaptive Information Disclosure Application' (AIDA) toolkit as well as a few others. The output is a semantic model with putative biological relations, with each relation linked to the corresponding evidence. Conclusion: We demonstrated a 'do-it-yourself' approach for structuring and extracting knowledge in the context of experimental research on biomolecular mechanisms. The methodology can be used to bootstrap the construction of semantically rich biological models using the results of knowledge extraction processes. Models specific to particular experiments can be constructed that, in turn, link with other semantic models, creating a web of knowledge that spans experiments. Mapping mechanisms can link to other knowledge resources such as OBO ontologies or SKOS vocabularies. AIDA Web Services can be used to design personalized knowledge extraction procedures. In our example experiment, we found three proteins (NF-Kappa B, p21, and Bax) potentially playing a role in the interplay between nutrients and epigenetic gene regulation

Crossref

VU Research Portal

Springer - Publisher Connector

PubMed Central

EUR Research Repository

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Literature-driven Curation for Taxonomic Name Databases

Author: De Roeck Anne
Morse David
Willis Alistair
Yang Hui
Publication venue
Publication date: 12/09/2013
Field of study

Digitized biodiversity literature provides a wealth of content for using biodiversity knowledge by machines. However, identifying taxonomic names and the associated semantic metadata is a difficult and labour intensive process. We present a system to support human assisted creation of semantic metadata. Information extraction techniques auto-matically identify taxonomic names from scanned documents. They are then presented to users for manual correction or verification. The tools that support the curation process include taxonomic name identification and mapping, and community-driven taxonomic name verification. Our research shows the potential for these information extrac-tion techniques to support research and curation in disciplines dependent upon scanned document

CiteSeerX

Open Research Online (The Open University)

Content-based video indexing for the support of digital library search

Author: Agrawal Rakesh
Apers Peter M.G.
Blok H.E.
Jonker Willem
Kersten M.
Petkovic M.
van Zwol Roelof
Windhouwer M.
Publication venue: IEEE Computer society Press
Publication date: 01/01/2002
Field of study

Presents a digital library search engine that combines efforts of the AMIS and DMW research projects, each covering significant parts of the problem of finding the required information in an enormous mass of data. The most important contributions of our work are the following: (1) We demonstrate a flexible solution for the extraction and querying of meta-data from multimedia documents in general. (2) Scalability and efficiency support are illustrated for full-text indexing and retrieval. (3) We show how, for a more limited domain, like an intranet, conceptual modelling can offer additional and more powerful query facilities. (4) In the limited domain case, we demonstrate how domain knowledge can be used to interpret low-level features into semantic content. In this short description, we focus on the first and fourth item

CWI's Institutional Repository

Pure OAI Repository

University of Twente Research Information

International Migration, Integration and Social Cohesion online publications

Chatbots4Mobile: Feature-oriented knowledge base generation using natural language

Author: Franch Gutiérrez Javier
Marco Gómez Jordi
Motger de la Encarnación Joaquim
Publication venue: CEUR-WS.org
Publication date: 01/01/2023
Field of study

Chatbots4Mobile is a research project from the GESSI research group (UPC-BarcelonaTech) which aims at designing and developing a task oriented, knowledge based conversational agent to support mobile users in the process of managing and integrating the functionalities exposed by their own application portfolio. To support the design of the required knowledge base, the project focuses on the application of Natural Language Processing (NLP) techniques to infer extended knowledge about the features exposed by a subset of mobile applications, including feature extraction from app-related documents, syntactic and semantic similarity analysis between features, and intent/entity classification focused on functionality identification from user requests. As next steps, we are focusing on the evaluation of embedded linguistic knowledge in large language models, as well as the application of granular sentiment analysis techniques to discern biased documents and process sentiment-based user feedback.With the support from the Secretariat for Universities and Research of the Ministry of Business and Knowledge of the Government of Catalonia and the European Social Fund. This paper has been funded by the Spanish Ministerio de Ciencia e Innovación under project / funding scheme PID2020-117191RB-I00 / AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

A hybrid NLP & semantic knowledgebase approach for the intelligent exploration of Arabic documents

Author: Khalil H
Publication venue
Publication date: 01/08/2017
Field of study

In the contemporary era, a colossal amount of information is published daily on the Web in the form of articles, documents, reviews, blogs and social media posts. As most of this data is available in the form of unstructured documents, it makes it challenging and timeconsuming to extract non-trivial, previously unknown, and potentially useful knowledge from the published documents. Hence, extracting useful knowledge from unstructured text, i.e., Information Extraction, is becoming an increasingly significant aspect of knowledge discovery. This work focuses on Information Extraction form Arabic unstructured text, which is an especially challenging task as Arabic is a highly inflectional and derivational language. The problem is compounded by the lack of mature tools and advanced research in Arabic Natural Language Processing (NLP) in comparison to European languages for instance. The principal objective of this research work is presenting a comprehensive methodology for integrating domain knowledge with Natural Language Processing techniques that were proven effective in solving most classification problems in order to improve the Information extraction process form online unstructured data. The importance of NLP tools lies in that they play a key role in allowing semantic concept tagging of unstructured text, and so realize the Semantic Web. This work presents a novel rule-based approach that uses linguistic grammar-based techniques to extract Arabic composite names from Arabic text. Our approach uniquely exploits the genitive Arabic grammar rules; in particular, the rules regarding the identification of definite nouns (معرفة) and indefinite nouns (نكرة) to support the process of extracting composite names. Furthermore, this approach does not place any constraints on the length of the Arabic composite name. The results of our experiments show that there are improvement in recognizing Arabic composite names entity in the Arabic language text. Our research also contributes a novel, knowledge-based approach to relation extraction from unstructured Arabic text, which is based on the principles of Functional Discourse Grammar (FDG). We further improve the approach by integrating it with Machine Learning relation classification, resulting in a hybrid relation extraction algorithm that can handle especially complex Arabic sentence structures. The accuracy of our relation classification efforts was extensively evaluated by means of experimental evaluation that evidenced the accuracy of the FDG relation extraction approach and the improvement gained by the Machine Learning integration. The essential NLP algorithms of entity recognition and relation extraction were deployed in a Semantic Knowledge-base that was built from the outset to model the knowledge of the problem domain. The semantic modelling of the knowledgebase aided improving the accuracy of the NLP algorithms by leveraging relevant domain knowledge published in Open Linked Datasets. Moreover, the extracted information was semantically tagged and inserted into the Semantic Knowledge-base, which facilitated building advanced rules to infer new interesting information from the extracted knowledge as well as utilising advanced query mechanisms for intelligently exploring the mined problem domain knowledge

Nottingham Trent Institutional Repository (IRep)