Search CORE

3 research outputs found

The application of ontologies in digital library: a meta-synthesis approach

Author: Babalhavaeji Fahimeh
Hariri Nadjla
Norouzi yaghoub
Pajohan Mehdi
Publication venue: Central Library of University of Tehran
Publication date: 22/03/2021
Field of study

Objective: , the present study examines the current status of the use of ontologies in the digital library area through the analysis of studies in this field. Methodology: The present research is a qualitative study using the meta-synthesis method. In order to collect data in this study, the library method, and to analyze data the seventh-step process of Sandelowski & Barroso for meta-synthesis was used. The research population of the study includes related studies (articles and dissertations) in the area of ontology applications in digital libraries retrieved from scientific databases. CASP evaluation checklist was used to ensure the quality of the studies. Finally, out of 267 retrieved studies, 43 titles were selected and analyzed. Findings: Analysis of studies in the area of ontology application in the digital library led to the identification of 4 categories, 8 components, and 48 dimensions in this field. The main categories include the application of ontology in digital library services, the application of ontology in digital library structures, the basis of ontology application in digital libraries, and the application of ontologies in covering the subject domain of digital libraries. Originality: In this study, which seems to have never been done before, a comprehensive analysis of the field of ontology application in digital libraries, the current situation and its dimensions were presented. Also, by clarifying the topics that have been less addressed, new research subjects were provided for researchers in this field

E-LIS

Named entity extraction from Portuguese web text

Author: André Ricardo Oliveira Pires
Publication venue
Publication date: 07/07/2017
Field of study

No contexto de Processamento de Linguagem Natural, a tarefa de Reconhecimento de Entidades Mencionadas (REM) foca-se na extração e classificação de entidades mencionadas de texto livre, como notícias, que geralmente tem uma estrutura de frases particular. A deteção de entidades suporta tarefas mais complexas, como Extração de Relações ou Pesquisa Orientada a Entidades, como no motor de pesquisa do ANT.Há algumas ferramentas de REM focadas na língua portuguesa, tais como o Palavras ou o NERP-CRF, mas o seu F-measure ainda está abaixo do obtido usando outras ferramentas disponíveis, por exemplo com base num corpus inglês anotado, treinado com o Stanford CoreNLP ou com o OpenNLP.O ANT é um motor de busca orientado a entidades da Universidade do Porto (UP). Este sistema de pesquisa indexa as informação disponível no SIGARRA, o sistema de informação da UP. Atualmente usa seletores construídos manualmente para extrair entidades, baseados em XPath ou CSS, que são dependentes da estrutura da página. Além disso, não funcionam em texto livre, especialmente nas notícias do SIGARRA. Um método baseado em aprendizagem computacional permite a automatização da tarefa de extração, tornando-a escalável, independente da estrutura, diminuindo o esforço de trabalho exigido e o tempo consumido.Nesta dissertação, eu avaliei ferramentas de REM existentes para selecionar a melhor abordagem e configuração a utilizar em relação à língua portuguesa, particularmente no domínio das notícias do SIGARRA. A avaliação foi feita com base em dois conjuntos de dados, a coleção HAREM, e um subconjunto manualmente anotado de notícias do SIGARRA, que foram usados para calcular o desempenho das ferramentas usando precision, recall e F-measure. A expansão da base de conhecimento existente ajudará a indexar as páginas do SIGARRA proporcionando uma experiência de pesquisa orientada a entidades mais rica e com nova informação, bem como um melhor esquema de classificação baseado no contexto adicional disponibilizado ao motor de busca. A comunidade científica também beneficia deste trabalho, com múltiplos manuais detalhados resultantes da análise sistemática das ferramentas, em particular para a língua Portuguesa.Primeiramente, eu analisei a performance base de algumas ferramentas selecionadas (Stanford CoreNLP, OpenNLP, spaCy e NLTK) com a coleção HAREM, obtendo os melhores resultados com o Stanford CoreNLP, seguido do OpenNLP. De seguida, efetuei um estudo aos hiperparametros, de modo a selecionar a melhor configuração para cada ferramenta, conseguindo alcançar melhorias em cada ferramenta, principalmente no classificador de Entropia Máxima do NLTK, em que houve melhorias no F-measure de 1.11% para 35.24%. Finalmente, usando a melhor configuração, eu repeti o treino com o dataset das notícias do SIGARRA, tendo obtido F-measures de 86.64%, para o Stanford CoreNLP. Para além disso, dado que também foi o melhor na performance base, leva-me à conclusão que o Stanford CoreNLP é a melhor opção para este contexto.In the context of Natural Language Processing, the Named Entity Recognition (NER) task focuses on extracting and classifying named entities from free text, such as news, which usually has a particular phrasing structure. Entity detection supports more complex tasks, such as Relation Extraction or Entity-Oriented Search, for instance the ANT search engine.There are some NER tools focused on the Portuguese language, such as Palavras or NERP-CRF, but their F-measure is still below the F-measure obtained by other available tools, for instance based on an annotated English corpus, trained with Stanford CoreNLP or with OpenNLP.ANT is an entity-oriented search engine for the University of Porto (UP). This search system indexes the information available in SIGARRA, the information system of the UP. Currently it uses handcrafted selectors to extract entities, based on XPath or CSS, which are dependent on the structure of the page. Furthermore, it does not work on free text, specially on SIGARRA's news. Using a machine learning method allows for the automation of the extraction task, making it scalable, structure independent and lowering the required work effort and consumed time.In this dissertation, I evaluate existing NER tools in order to select the best approach and configuration for the Portuguese language, particularly in the domain of SIGARRA's news. The evaluation was done based on two datasets, the HAREM collection, and a manually annotated subset of SIGARRA's news, which are used to assess the tools' performance using precision, recall and F-measure. Expanding the existing knowledge base will help index SIGARRA pages by providing a richer entity-oriented search experience with new information, as well as a better ranking scheme based on the additional context made available to the search engine. The scientific community also benefits from this work, with several detailed manuals resulting of the systematic analysis of available tools, in particular for the Portuguese language.First, I carried an out-of-the-box performance analysis of some selected tools (Stanford CoreNLP, OpenNLP, spaCy and NLTK) with the HAREM dataset, obtaining the best results for Stanford CoreNLP, followed by OpenNLP. Then, I performed a hyperparamenter study in order to select the best configuration for each tool, having achieved better-than-default results in each tool, particularly for NLTK's Maximum Entropy classifier, increasing the F-measure from 1.11% to 35.24%. Finally, using the best configuration, I repeated the training process with the SIGARRA News Corpus, having achieved F-measures as high as 86.64%, for Stanford CoreNLP. Furthermore, given this was also the out-of-the box winner, it leads me to conclude that Stanford CoreNLP is the best option for this particular context

Repositório Aberto da Universidade do Porto

A framework for automatic population of ontology-based digital libraries

Author: B Motik
C Faria
DC Wimalasuriya
G Adorni
H Saggion
HB Zghal
J Jiang
J Piskorski
JM Ruiz-Martınez
K Bontcheva
M Horridge
R Benammar
R Dale
S Boschi
T Berners-Lee
TR Gruber
V Bush
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Maintaining updated ontology-based digital libraries faces two main issues. First, documents are often unstructured and in heterogeneous data formats, making it even more difficult to extract information and search in. Second, manual ontology population is time consuming and therefore automatic methods to support this process are needed. In this paper, we present an ontology-based framework aiming at populating ontologies. In particular, we propose an approach for triplet extraction from heterogeneous and unstructured documents in order to automatically populate ontology-based digital libraries. Finally, we evaluate the proposed framework on a real world case study

Crossref

Archivio istituzionale della ricerca - Università di Genova