19 research outputs found

    Adapting Cross-Genre Author Profiling to Language and Corpus Notebook for PAN at CLEF 2016

    Get PDF
    Abstract This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage

    Resumen de HOMO-MEX en Iberlef 2023: Detecci贸n de discursos de odio en mensajes online dirigidos hacia la poblaci贸n LGBTQ+ hablante de espa帽ol mexicano

    Get PDF
    The detection of hate speech and stereotypes in online platforms has gained significant attention in the field of Natural Language Processing (NLP). Among various forms of discrimination, LGBTQ+ phobia is prevalent on social media, particularly on platforms like Twitter. The objective of the HOMO-MEX task is to encourage the development of NLP systems that can detect and classify LGBTQ+ phobic content in Spanish tweets, regardless of whether it is expressed aggressively or subtly. The task aims to address the lack of dedicated resources for LGBTQ+ phobia detection in Spanish Twitter and encourages participants to employ multi-label classification approaches.La detecci贸n de discursos de odio y estereotipos en plataformas en l铆nea ha suscitado gran atenci贸n en el campo del Procesamiento del Lenguaje Natural (PLN). Entre las diversas formas de discriminaci贸n, la LGBTQ+fobia es frecuente en las redes sociales, especialmente en plataformas como Twitter. El objetivo de la tarea HOMO-MEX es fomentar el desarrollo de sistemas de PLN que puedan detectar y clasificar contenido LGBTQ+f贸bico en tuits en espa帽ol, independientemente de si se expresa de forma agresiva o sutil. La tarea pretende abordar la falta de recursos dedicados a la detecci贸n de la fobia LGBTQ+ en Twitter en espa帽ol y anima a los participantes a emplear enfoques de clasificaci贸n multietiqueta.This paper has been supported by PAPIIT projects IT100822, TA101722, and CONAHCYT CF-2023-G-64. Also, we thank Alejandro Ojeda Trueba for the creation of the HOMO-MEX presentation image. GBE is supported by a grant from the Ministry of Universities of the Government of Spain, financed by the European Union, NextGeneration EU (Mar铆a Zambrano program)

    Resumen de PAR-MEX en IberLEF 2022: Tarea Compartida para la Detecci贸n de Par谩frasis en Espa帽ol

    Get PDF
    Paraphrase detection is an important unresolved task in natural language processing; especially in the Spanish language. In order to address this issue, and contribute to the creation of high-performance paraphrase detection automated systems, we propose a shared task called PAR-MEX. For this task, we created a corpus, in Spanish, with topics in the domain of Mexican gastronomy. Afterwards, the participants in this task submitted their classification results on our corpus. In this paper we explain the steps followed for the creation of the corpus, we summarize the results obtained by the various participants, and propose some conclusions regarding the paraphrase-detection task in Spanish.La detecci贸n de par谩frasis es una tarea importante no resuelta en procesamiento del lenguaje natural; especialmente en la lengua espa帽ola. Para atacar este problema, y para contribuir a la creaci贸n de sistemas de detecci贸n autom谩tica que obtengan resultados competitivos, proponemos la tarea compartida llamada PAR-MEX. Para esto, creamos un corpus en espa帽ol con temas dentro del campo sem谩ntico de gastronom铆a mexicana. Despu茅s los participantes en esta tarea enviaron los resultados de sus sistemas de clasificaci贸n sobre nuestro corpus. En este paper explicamos los pasos seguidos para la creaci贸n del corpus, resumimos los resultados obtenidos por los participantes, y proponemos algunas conclusiones al respecto de la detecci贸n de par谩frasis en espa帽ol.We acknowledge the support of the projects CONACyT CB A1-S-27780, and DGAPA-UNAM PAPIIT references TA400121 and TA101722

    Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

    Get PDF
    We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available

    Extracci贸n de caracter铆sticas de texto basada en grafos sint谩cticos integrados

    No full text
    Tesis (Doctorado en Ciencias de la Computaci贸n), Instituto Polit茅cnico Nacional, CIC, 2017, 1 archivo PDF, (79 p谩ginas). tesis.ipn.m

    Resumen de FakeDeS en IberLEF 2021: Tarea compartida para la detecci贸n de noticias falsas en espa帽ol

    Get PDF
    This paper presents the overview of FakeDeS 2021, the second edition of this lab under the IberLEF conference. The FakeDeS shared task aims to explore different methodologies and strategies related to fake news detection in Spanish. This year edition brings two main challenges: thematic and language variation. For this purpose, we introduce a new testing corpus containing news related to COVID-19 and news from other Ibero-American countries.Este art铆culo hace una presentaci贸n general de la tarea compartida FakeDeS 2021, cuya segunda edici贸n ha tenido lugar en 2021 bajo el congreso IberLEF, aunque se trata de la primera vez con esta denominaci贸n. La tarea FakeDeS tiene por objetivo explorar diferentes m茅todos y estategias relacinados con la detecci贸n de noticias falsas en espa帽ol, principalmente en su variante de M茅xico. La edici贸n de este a帽o propone dos desaf铆os principales: variaci贸n tem谩tica y variaci贸n ling眉铆stica. Para ello, se introduce un nuevo corpus de prueba que contiene noticias relacionadas con COVID 19 y noticias de otros pa铆ses de Iber-Am茅rica.This research was funded by CONACyT project CB A1-S-27780, DGAPA-UNAM PAPIIT grants number TA400121 and TA100520. The authors also thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory's Deep Learning Platform for Language Technologies

    A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015

    No full text
    Abstract The paper describes our approach for the Authorship Identification task at the PAN CLEF 2015. We extract textual patterns based on features obtained from shortest path walks over Integrated Syntactic Graphs (ISG). Then we calculate a similarity between the unknown document and the known document with these patterns. The approach uses a predefined threshold in order to decide if the unknown document is written by the known author or not

    Sentence-CROBI: A Simple Cross-Bi-Encoder-Based Neural Network Architecture for Paraphrase Identification

    No full text
    Since the rise of Transformer networks and large language models, cross-encoders have become the dominant architecture for various Natural Language Processing tasks. When dealing with sentence pairs, they can exploit the relationships between those pairs. On the other hand, bi-encoders can obtain a vector given a single sentence and are used in tasks such as textual similarity or information retrieval due to their low computational cost; however, their performance is inferior to that of cross-encoders. In this paper, we present Sentence-CROBI, an architecture that combines cross-encoders and bi-encoders to obtain a global representation of sentence pairs. We evaluated the proposed architecture in the paraphrase identification task using the Microsoft Research Paraphrase Corpus, the Quora Question Pairs dataset, and the PAWS-Wiki dataset. Our model obtains competitive results compared with the state-of-the-art by using model ensembles and a simple model configuration. These results demonstrate that a simple architecture that combines sentence pair and single-sentence representations without using complex pre-training or fine-tuning algorithms is a viable alternative for sentence pair tasks
    corecore