169 research outputs found
Toponym Disambiguation in Information Retrieval
In recent years, geography has acquired a great importance in the context of
Information Retrieval (IR) and, in general, of the automated processing of
information in text. Mobile devices that are able to surf the web and at the
same time inform about their position are now a common reality, together
with applications that can exploit this data to provide users with locally
customised information, such as directions or advertisements. Therefore,
it is important to deal properly with the geographic information that is
included in electronic texts. The majority of such kind of information is
contained as place names, or toponyms.
Toponym ambiguity represents an important issue in Geographical Information
Retrieval (GIR), due to the fact that queries are geographically constrained.
There has been a struggle to nd speci c geographical IR methods
that actually outperform traditional IR techniques. Toponym ambiguity
may constitute a relevant factor in the inability of current GIR systems to
take advantage from geographical knowledge. Recently, some Ph.D. theses
have dealt with Toponym Disambiguation (TD) from di erent perspectives,
from the development of resources for the evaluation of Toponym Disambiguation
(Leidner (2007)) to the use of TD to improve geographical scope
resolution (Andogah (2010)). The Ph.D. thesis presented here introduces
a TD method based on WordNet and carries out a detailed study of the
relationship of Toponym Disambiguation to some IR applications, such as
GIR, Question Answering (QA) and Web retrieval.
The work presented in this thesis starts with an introduction to the applications
in which TD may result useful, together with an analysis of the
ambiguity of toponyms in news collections. It could not be possible to
study the ambiguity of toponyms without studying the resources that are
used as placename repositories; these resources are the equivalent to language
dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci
An Ensemble Method Based on the Combination of Transformers with Convolutional Neural Networks to Detect Artificially Generated Text
Thanks to the state-of-the-art Large Language Models (LLMs), language
generation has reached outstanding levels. These models are capable of
generating high quality content, thus making it a challenging task to detect
generated text from human-written content. Despite the advantages provided by
Natural Language Generation, the inability to distinguish automatically
generated text can raise ethical concerns in terms of authenticity.
Consequently, it is important to design and develop methodologies to detect
artificial content. In our work, we present some classification models
constructed by ensembling transformer models such as Sci-BERT, DeBERTa and
XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate
that the considered ensemble architectures surpass the performance of the
individual transformer models for classification. Furthermore, the proposed
SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared
task 2023 data.Comment: In Proceedings of the 21st Annual Workshop of the Australasian
Language Technology Association (ALTA 2023
Passage retrieval in legal texts
[EN] Legal texts usually comprise many kinds of texts, such as contracts, patents and treaties. These texts usually include a huge quantity of unstructured information written in natural language. Thanks to automatic analysis and Information Retrieval (IR) techniques, it is possible to filter out information that is not relevant and, therefore, to reduce the amount of documents that users need to browse to find the information they are looking for. In this paper we adapted the JIRS passage retrieval system to work with three kinds of legal texts: treaties, patents and contracts, studying the issues related with the processing of this kind of information. In particular, we studied how a passage retrieval system might be linked up to automated analysis based on logic and algebraic programming for the detection of conflicts in contracts. In our set-up, a contract is translated into formal clauses, which are analysed by means of a model checking tool; then, the passage retrieval system is used to extract conflicting sentences from the original contract text. © 2011 Elsevier Inc. All rights reserved.We thank the MICINN (Plan I+D+i) TEXT-ENTERPRISE 2.0: (TIN2009-13391-C04-03) research project. The work of the
second author has been possible thanks to a scholarship funded by Maat Gknowledge in the framework of the project with
the Universidad PolitĂ©cnica de Valencia MĂłdulo de servicios semánticos de la plataforma GRosso, P.; Correa GarcĂa, S.; Buscaldi, D. (2011). Passage retrieval in legal texts. Journal of Logic and Algebraic Programming. 80(3-5):139-153. doi:10.1016/j.jlap.2011.02.001S139153803-
Construction d'ontologies à partir d'une collection de pages web structurées
MoanoDe nombreuses collections de documents disponibles sur le web décrivent les caractéristiques d'entités d'un même type (e.g. des produits, des plantes), chaque page présentant une de ces entités. Ces documents sont des sources de connaissances particulièrement adaptées pour la construction d'ontologies. Alors qu'ils partagent une même mise en forme régulière, ils contiennent moins de texte rédigé que des fichiers textes mais leur architecture est riche de sens. De ce fait, les méthodes linguistiques classiques pour identifier des concepts et des relations sont moins adaptées pour les analyser. Nous proposons une approche exploitant les diverses propriétés de ces documents, combinant analyse de la structure et de la mise en forme avec une analyse linguistique, et exploitant leur annotation sémantique
IRADABE: Adapting English Lexicons to the Italian Sentiment Polarity Classification task
International audienceInterest in the Sentiment Analysis task has been growing in recent years due to the importance of applications that may benefit from such kind of information. In this paper we addressed the polarity classification task of Italian tweets by using a supervised machine learning approach. We developed a set of features and used them in a machine learning system in order to decide if a tweet is subjective or objective. The polarity result itself was then used as an additional feature to determine whether a tweet contains ironical content or not. We faced the lack of resources in Italian by translating (mostly automatically) existing resources for the English language. Our model obtained good results in the SentiPolC 2014 task, being one of the best ranked systems.L'interesse nell'analisi automatica dei sentimenti è continuamente cresciuto negli ultimi anni per via dell'importanza delle applicazioni in cui questo tipo di analisi può essere utilizzato. In quest'articolo descriviamo gli esperimenti portati a termine nel campo della classificazione della polarità di tweets scritti in italiano, usando un approccio basato sull'apprendimento automatico. Un certo numero di criteri è stato utilizzato come features per assegnare una polarità e quindi determinare se i tweets contengono dell'ironia o meno. Per questi esperimenti, la mancanza di risorse (in particolare di dizionari specializzati) è stata compensata adattando delle risorse esistenti per la lingua inglese, in gran parte utilizzando delle tecniche di traduzione automatica. Il modello così ottenutò e stato uno dei migliori nel task SentiPolC a Evalita 2014
A semi-automatic approach for building ontologies from a collection of structured web documents
International audienceMany collections of structured documents are available on the web. The collection generally describes the characteristics of entities from a single type, where each page describes one entity. These documents are adequate knowledge sources for building ontologies. As they benefit from a strong and shared layout, they contain less well written text than plain text files but their architecture is very meaningful. Classical linguistic-based methods for identifying concepts and relations are no longer appropriate for analyzing them. The approach we propose in this paper exploits various properties of such documents, combining layout/formatting analysis and linguistic analysis, and using semantic annotation
Détection et classification non supervisées de relations sémantiques dans des articles scientifiques
International audienceDans cet article, nous abordons une tâche encore peu explorée, consistant à extraire automatiquement l'état de l'art d'un domaine scientifique à partir de l'analyse d'articles de ce domaine. Nous la ramenons à deux sous-tâches élémentaires : l'identification de concepts et la reconnaissance de relations entre ces concepts. Une extraction terminologique permet d'identifier les concepts candidats, qui sont ensuite alignés à des ressources externes. Dans un deuxième temps, nous cherchons à reconnaître et classifier automatiquement les relations sémantiques entre concepts de manière non-supervisée, en nous appuyant sur différentes techniques de clustering et de biclustering. Nous mettons en oeuvre ces deux étapes dans un corpus extrait de l'archive de l'ACL Anthology. Une analyse manuelle nous a permis de proposer une typologie des relations sémantiques, et de classifier un échantillon d'instances de relations. Les premières évaluations suggèrent l'intérêt du biclustering pour détecter de nouveaux types de relations dans le corpus. ABSTRACT Unsupervised Classification of Semantic Relations in Scientific Papers In this article, we tackle the yet unexplored task of automatically building the "state of the art" of a scientific domain from a corpus of research papers. This task is defined as a sequence of two basic steps : finding concepts and recognizing the relations between them. First, candidate concepts are identified using terminology extraction, and subsequently linked to external resources. Second, semantic relations between entities are categorized with different clustring and biclustering algorithms. Experiences were carried out on the ACL Anthology Corpus. Results are evaluated against a hand-crafted typology of semantic relations and manually categorized examples. The first results indicate that biclustering techniques may indeed be useful for detecting new types of relations. MOTS-CLÉS : analyse de la littérature scientifique, extraction de relations, clustering, biclustering
Generating knowledge graphs by employing Natural Language Processing and Machine Learning techniques within the scholarly domain
The continuous growth of scientific literature brings innovations and, at the same time, raises new challenges. One of them is related to the fact that its analysis has become difficult due to the high volume of published papers for which manual effort for annotations and management is required. Novel technological infrastructures are needed to help researchers, research policy makers, and companies to time-efficiently browse, analyse, and forecast scientific research. Knowledge graphs i.e., large networks of entities and relationships, have proved to be effective solution in this space. Scientific knowledge graphs focus on the scholarly domain and typically contain metadata describing research publications such as authors, venues, organizations, research topics, and citations. However, the current generation of knowledge graphs lacks of an explicit representation of the knowledge presented in the research papers. As such, in this paper, we present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications and integrates them in a large-scale knowledge graph. Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, iii) show the advantage of such an hybrid system over alternative approaches, and vi) as a chosen use case, we generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain. As our approach is general and can be applied to any domain, we expect that it can facilitate the management, analysis, dissemination, and processing of scientific knowledge
Recommended from our members
Mining Scholarly Publications for Scientific Knowledge Graph Construction
In this paper, we present a preliminary approach that uses a set of NLP and Deep Learning methods for extracting entities and relationships from research publications and then integrates them in a Knowledge Graph. More specifically, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, and iii) analyse an automatically generated Knowledge Graph including 10,425 entities and 25,655 relationships in the field of Semantic Web
- …