Search CORE

169 research outputs found

Toponym Disambiguation in Information Retrieval

Author: Buscaldi Davide
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 12/11/2010
Field of study

In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such kind of information is contained as place names, or toponyms. Toponym ambiguity represents an important issue in Geographical Information Retrieval (GIR), due to the fact that queries are geographically constrained. There has been a struggle to nd speci c geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. Recently, some Ph.D. theses have dealt with Toponym Disambiguation (TD) from di erent perspectives, from the development of resources for the evaluation of Toponym Disambiguation (Leidner (2007)) to the use of TD to improve geographical scope resolution (Andogah (2010)). The Ph.D. thesis presented here introduces a TD method based on WordNet and carries out a detailed study of the relationship of Toponym Disambiguation to some IR applications, such as GIR, Question Answering (QA) and Web retrieval. The work presented in this thesis starts with an introduction to the applications in which TD may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used as placename repositories; these resources are the equivalent to language dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci

RiuNet

An Ensemble Method Based on the Combination of Transformers with Convolutional Neural Networks to Detect Artificially Generated Text

Author: Buscaldi Davide
Liyanage Vijini
Publication venue
Publication date: 26/10/2023
Field of study

Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.Comment: In Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association (ALTA 2023

arXiv.org e-Print Archive

Passage retrieval in legal texts

Author: Buscaldi Davide
Correa García Santiago
Rosso Paolo
Publication venue: 'Elsevier BV'
Publication date: 17/03/2011
Field of study

[EN] Legal texts usually comprise many kinds of texts, such as contracts, patents and treaties. These texts usually include a huge quantity of unstructured information written in natural language. Thanks to automatic analysis and Information Retrieval (IR) techniques, it is possible to filter out information that is not relevant and, therefore, to reduce the amount of documents that users need to browse to find the information they are looking for. In this paper we adapted the JIRS passage retrieval system to work with three kinds of legal texts: treaties, patents and contracts, studying the issues related with the processing of this kind of information. In particular, we studied how a passage retrieval system might be linked up to automated analysis based on logic and algebraic programming for the detection of conflicts in contracts. In our set-up, a contract is translated into formal clauses, which are analysed by means of a model checking tool; then, the passage retrieval system is used to extract conflicting sentences from the original contract text. © 2011 Elsevier Inc. All rights reserved.We thank the MICINN (Plan I+D+i) TEXT-ENTERPRISE 2.0: (TIN2009-13391-C04-03) research project. The work of the second author has been possible thanks to a scholarship funded by Maat Gknowledge in the framework of the project with the Universidad Politécnica de Valencia Módulo de servicios semánticos de la plataforma GRosso, P.; Correa García, S.; Buscaldi, D. (2011). Passage retrieval in legal texts. Journal of Logic and Algebraic Programming. 80(3-5):139-153. doi:10.1016/j.jlap.2011.02.001S139153803-

Elsevier - Publisher Connector

RiuNet

Construction d'ontologies à partir d'une collection de pages web structurées

Author: Aussenac-Gilles Nathalie
Buscaldi Davide
Comparot Catherine
Kamel Mouna
Publication venue: HAL CCSD
Publication date: 03/07/2013
Field of study

MoanoDe nombreuses collections de documents disponibles sur le web décrivent les caractéristiques d'entités d'un même type (e.g. des produits, des plantes), chaque page présentant une de ces entités. Ces documents sont des sources de connaissances particulièrement adaptées pour la construction d'ontologies. Alors qu'ils partagent une même mise en forme régulière, ils contiennent moins de texte rédigé que des fichiers textes mais leur architecture est riche de sens. De ce fait, les méthodes linguistiques classiques pour identifier des concepts et des relations sont moins adaptées pour les analyser. Nous proposons une approche exploitant les diverses propriétés de ces documents, combinant analyse de la structure et de la mise en forme avec une analyse linguistique, et exploitant leur annotation sémantique

Scientific Publications of the University of Toulouse II Le Mirail

HAL-Paris 13

IRADABE: Adapting English Lexicons to the Italian Sentiment Polarity Classification task

Author: Buscaldi Davide
Hernandez-Farias Irazú
Priego-Sánchez Belém
Publication venue: HAL CCSD
Publication date: 09/12/2014
Field of study

International audienceInterest in the Sentiment Analysis task has been growing in recent years due to the importance of applications that may benefit from such kind of information. In this paper we addressed the polarity classification task of Italian tweets by using a supervised machine learning approach. We developed a set of features and used them in a machine learning system in order to decide if a tweet is subjective or objective. The polarity result itself was then used as an additional feature to determine whether a tweet contains ironical content or not. We faced the lack of resources in Italian by translating (mostly automatically) existing resources for the English language. Our model obtained good results in the SentiPolC 2014 task, being one of the best ranked systems.L'interesse nell'analisi automatica dei sentimenti è continuamente cresciuto negli ultimi anni per via dell'importanza delle applicazioni in cui questo tipo di analisi può essere utilizzato. In quest'articolo descriviamo gli esperimenti portati a termine nel campo della classificazione della polarità di tweets scritti in italiano, usando un approccio basato sull'apprendimento automatico. Un certo numero di criteri è stato utilizzato come features per assegnare una polarità e quindi determinare se i tweets contengono dell'ironia o meno. Per questi esperimenti, la mancanza di risorse (in particolare di dizionari specializzati) è stata compensata adattando delle risorse esistenti per la lingua inglese, in gran parte utilizzando delle tecniche di traduzione automatica. Il modello così ottenutò e stato uno dei migliori nel task SentiPolC a Evalita 2014

HAL-Paris 13

Hal-Diderot

A semi-automatic approach for building ontologies from a collection of structured web documents

Author: Aussenac-Gilles Nathalie
Buscaldi Davide
Comparot Catherine
Kamel Mouna
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceMany collections of structured documents are available on the web. The collection generally describes the characteristics of entities from a single type, where each page describes one entity. These documents are adequate knowledge sources for building ontologies. As they benefit from a strong and shared layout, they contain less well written text than plain text files but their architecture is very meaningful. Classical linguistic-based methods for identifying concepts and relations are no longer appropriate for analyzing them. The approach we propose in this paper exploits various properties of such documents, combining layout/formatting analysis and linguistic analysis, and using semantic annotation

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

HAL Descartes

HAL-Paris 13

Détection et classification non supervisées de relations sémantiques dans des articles scientifiques

Author: Buscaldi Davide
Charnois Thierry
Gábor Kata
Tellier Isabelle
Zargayouna Haïfa
Publication venue: HAL CCSD
Publication date: 04/07/2016
Field of study

International audienceDans cet article, nous abordons une tâche encore peu explorée, consistant à extraire automatiquement l'état de l'art d'un domaine scientifique à partir de l'analyse d'articles de ce domaine. Nous la ramenons à deux sous-tâches élémentaires : l'identification de concepts et la reconnaissance de relations entre ces concepts. Une extraction terminologique permet d'identifier les concepts candidats, qui sont ensuite alignés à des ressources externes. Dans un deuxième temps, nous cherchons à reconnaître et classifier automatiquement les relations sémantiques entre concepts de manière non-supervisée, en nous appuyant sur différentes techniques de clustering et de biclustering. Nous mettons en oeuvre ces deux étapes dans un corpus extrait de l'archive de l'ACL Anthology. Une analyse manuelle nous a permis de proposer une typologie des relations sémantiques, et de classifier un échantillon d'instances de relations. Les premières évaluations suggèrent l'intérêt du biclustering pour détecter de nouveaux types de relations dans le corpus. ABSTRACT Unsupervised Classification of Semantic Relations in Scientific Papers In this article, we tackle the yet unexplored task of automatically building the "state of the art" of a scientific domain from a corpus of research papers. This task is defined as a sequence of two basic steps : finding concepts and recognizing the relations between them. First, candidate concepts are identified using terminology extraction, and subsequently linked to external resources. Second, semantic relations between entities are categorized with different clustring and biclustering algorithms. Experiences were carried out on the ACL Anthology Corpus. Results are evaluated against a hand-crafted typology of semantic relations and manually categorized examples. The first results indicate that biclustering techniques may indeed be useful for detecting new types of relations. MOTS-CLÉS : analyse de la littérature scientifique, extraction de relations, clustering, biclustering

HAL-Paris 13

Generating knowledge graphs by employing Natural Language Processing and Machine Learning techniques within the scholarly domain

Author: Buscaldi Davide
Dessì Danilo
Motta Enrico
Osborne Francesco
Reforgiato Recupero Diego
Publication venue: 'Elsevier BV'
Publication date: 28/10/2020
Field of study

The continuous growth of scientific literature brings innovations and, at the same time, raises new challenges. One of them is related to the fact that its analysis has become difficult due to the high volume of published papers for which manual effort for annotations and management is required. Novel technological infrastructures are needed to help researchers, research policy makers, and companies to time-efficiently browse, analyse, and forecast scientific research. Knowledge graphs i.e., large networks of entities and relationships, have proved to be effective solution in this space. Scientific knowledge graphs focus on the scholarly domain and typically contain metadata describing research publications such as authors, venues, organizations, research topics, and citations. However, the current generation of knowledge graphs lacks of an explicit representation of the knowledge presented in the research papers. As such, in this paper, we present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications and integrates them in a large-scale knowledge graph. Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, iii) show the advantage of such an hybrid system over alternative approaches, and vi) as a chosen use case, we generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain. As our approach is general and can be applied to any domain, we expect that it can facilitate the management, analysis, dissemination, and processing of scientific knowledge

arXiv.org e-Print Archive

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Università di Cagliari

HAL-Paris 13

Recommended from our members

Mining Scholarly Publications for Scientific Knowledge Graph Construction

Author: Buscaldi Davide
Dessì Danilo
Motta Enrico
Osborne Francesco
Reforgiato Recupero Diego
Publication venue
Publication date: 01/01/2019
Field of study

In this paper, we present a preliminary approach that uses a set of NLP and Deep Learning methods for extracting entities and relationships from research publications and then integrates them in a Knowledge Graph. More specifically, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, and iii) analyse an automatically generated Knowledge Graph including 10,425 entities and 25,655 relationships in the field of Semantic Web

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Università di Cagliari