21 research outputs found
Mapping (Dis-)Information Flow about the MH17 Plane Crash
Digital media enables not only fast sharing of information, but also
disinformation. One prominent case of an event leading to circulation of
disinformation on social media is the MH17 plane crash. Studies analysing the
spread of information about this event on Twitter have focused on small,
manually annotated datasets, or used proxys for data annotation. In this work,
we examine to what extent text classifiers can be used to label data for
subsequent content analysis, in particular we focus on predicting pro-Russian
and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though
we find that a neural classifier improves over a hashtag based baseline,
labeling pro-Russian and pro-Ukrainian content with high precision remains a
challenging problem. We provide an error analysis underlining the difficulty of
the task and identify factors that might help improve classification in future
work. Finally, we show how the classifier can facilitate the annotation task
for human annotators
CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES
Tesis por compendio[ES] Durante los últimos años, los repositorios multimedia en línea se han convertido
en fuentes clave de conocimiento gracias al auge de Internet, especialmente en
el área de la educación. Instituciones educativas de todo el mundo han dedicado
muchos recursos en la búsqueda de nuevos métodos de enseñanza, tanto para
mejorar la asimilación de nuevos conocimientos, como para poder llegar a una
audiencia más amplia. Como resultado, hoy en día disponemos de diferentes
repositorios con clases grabadas que siven como herramientas complementarias en
la enseñanza, o incluso pueden asentar una nueva base en la enseñanza a
distancia. Sin embargo, deben cumplir con una serie de requisitos para que la
experiencia sea totalmente satisfactoria y es aquí donde la transcripción de los
materiales juega un papel fundamental. La transcripción posibilita una búsqueda
precisa de los materiales en los que el alumno está interesado, se abre la
puerta a la traducción automática, a funciones de recomendación, a la
generación de resumenes de las charlas y además, el poder hacer
llegar el contenido a personas con discapacidades auditivas. No obstante, la
generación de estas transcripciones puede resultar muy costosa.
Con todo esto en mente, la presente tesis tiene como objetivo proporcionar
nuevas herramientas y técnicas que faciliten la transcripción de estos
repositorios. En particular, abordamos el desarrollo de un conjunto de herramientas
de reconocimiento de automático del habla, con énfasis en las técnicas de aprendizaje
profundo que contribuyen a proporcionar transcripciones precisas en casos de
estudio reales. Además, se presentan diferentes participaciones en competiciones
internacionales donde se demuestra la competitividad del software comparada con
otras soluciones. Por otra parte, en aras de mejorar los sistemas de
reconocimiento, se propone una nueva técnica de adaptación de estos sistemas al
interlocutor basada en el uso Medidas de Confianza. Esto además motivó el
desarrollo de técnicas para la mejora en la estimación de este tipo de medidas
por medio de Redes Neuronales Recurrentes.
Todas las contribuciones presentadas se han probado en diferentes repositorios
educativos. De hecho, el toolkit transLectures-UPV es parte de un conjunto de
herramientas que sirve para generar transcripciones de clases en diferentes
universidades e instituciones españolas y europeas.[CA] Durant els últims anys, els repositoris multimèdia en línia s'han convertit
en fonts clau de coneixement gràcies a l'expansió d'Internet, especialment en
l'àrea de l'educació. Institucions educatives de tot el món han dedicat
molts recursos en la recerca de nous mètodes d'ensenyament, tant per
millorar l'assimilació de nous coneixements, com per poder arribar a una
audiència més àmplia. Com a resultat, avui dia disposem de diferents
repositoris amb classes gravades que serveixen com a eines complementàries en
l'ensenyament, o fins i tot poden assentar una nova base a l'ensenyament a
distància. No obstant això, han de complir amb una sèrie de requisits perquè la
experiència siga totalment satisfactòria i és ací on la transcripció dels
materials juga un paper fonamental. La transcripció possibilita una recerca
precisa dels materials en els quals l'alumne està interessat, s'obri la
porta a la traducció automàtica, a funcions de recomanació, a la
generació de resums de les xerrades i el poder fer
arribar el contingut a persones amb discapacitats auditives. No obstant, la
generació d'aquestes transcripcions pot resultar molt costosa.
Amb això en ment, la present tesi té com a objectiu proporcionar noves
eines i tècniques que faciliten la transcripció d'aquests repositoris. En
particular, abordem el desenvolupament d'un conjunt d'eines de reconeixement
automàtic de la parla, amb èmfasi en les tècniques d'aprenentatge profund que
contribueixen a proporcionar transcripcions precises en casos d'estudi reals. A
més, es presenten diferents participacions en competicions internacionals on es
demostra la competitivitat del programari comparada amb altres solucions.
D'altra banda, per tal de millorar els sistemes de reconeixement, es proposa una
nova tècnica d'adaptació d'aquests sistemes a l'interlocutor basada en l'ús de
Mesures de Confiança. A més, això va motivar el desenvolupament de tècniques per
a la millora en l'estimació d'aquest tipus de mesures per mitjà de Xarxes
Neuronals Recurrents.
Totes les contribucions presentades s'han provat en diferents repositoris
educatius. De fet, el toolkit transLectures-UPV és part d'un conjunt d'eines
que serveix per generar transcripcions de classes en diferents universitats i
institucions espanyoles i europees.[EN] During the last years, on-line multimedia repositories have become key
knowledge assets thanks to the rise of Internet and especially in the area of
education. Educational institutions around the world have devoted big efforts
to explore different teaching methods, to improve the transmission of knowledge
and to reach a wider audience. As a result, online video lecture repositories
are now available and serve as complementary tools that can boost the learning
experience to better assimilate new concepts. In order to guarantee the success
of these repositories the transcription of each lecture plays a very important
role because it constitutes the first step towards the availability of many other
features. This transcription allows the searchability of learning materials,
enables the translation into another languages, provides recommendation
functions, gives the possibility to provide content summaries, guarantees
the access to people with hearing disabilities, etc. However, the
transcription of these videos is expensive in terms of time and human cost.
To this purpose, this thesis aims at providing new tools and techniques that
ease the transcription of these repositories. In particular, we address the
development of a complete Automatic Speech Recognition Toolkit with an special
focus on the Deep Learning techniques that contribute to provide accurate
transcriptions in real-world scenarios. This toolkit is tested against many
other in different international competitions showing comparable transcription
quality. Moreover, a new technique to improve the recognition accuracy has been
proposed which makes use of Confidence Measures, and constitutes the spark that
motivated the proposal of new Confidence Measures techniques that helped to
further improve the transcription quality. To this end, a new speaker-adapted
confidence measure approach was proposed for models based on Recurrent Neural
Networks.
The contributions proposed herein have been tested in real-life scenarios in
different educational repositories. In fact, the transLectures-UPV toolkit is
part of a set of tools for providing video lecture transcriptions in many
different Spanish and European universities and institutions.Agua Teba, MÁD. (2019). CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/130198TESISCompendi
Text-detection and -recognition from natural images
Text detection and recognition from images could have numerous functional applications for document analysis, such as assistance for visually impaired people; recognition of vehicle license plates; evaluation of articles containing tables, street signs, maps, and diagrams; keyword-based image exploration; document retrieval; recognition of parts within industrial automation; content-based extraction; object recognition; address block location; and text-based video indexing. This research exploited the advantages of artificial intelligence (AI) to detect and recognise text from natural images. Machine learning and deep learning were used to accomplish this task.In this research, we conducted an in-depth literature review on the current detection and recognition methods used by researchers to identify the existing challenges, wherein the differences in text resulting from disparity in alignment, style, size, and orientation combined with low image contrast and a complex background make automatic text extraction a considerably challenging and problematic task. Therefore, the state-of-the-art suggested approaches obtain low detection rates (often less than 80%) and recognition rates (often less than 60%). This has led to the development of new approaches. The aim of the study was to develop a robust text detection and recognition method from natural images with high accuracy and recall, which would be used as the target of the experiments. This method could detect all the text in the scene images, despite certain specific features associated with the text pattern. Furthermore, we aimed to find a solution to the two main problems concerning arbitrarily shaped text (horizontal, multi-oriented, and curved text) detection and recognition in a low-resolution scene and with various scales and of different sizes.In this research, we propose a methodology to handle the problem of text detection by using novel combination and selection features to deal with the classification algorithms of the text/non-text regions. The text-region candidates were extracted from the grey-scale images by using the MSER technique. A machine learning-based method was then applied to refine and validate the initial detection. The effectiveness of the features based on the aspect ratio, GLCM, LBP, and HOG descriptors was investigated. The text-region classifiers of MLP, SVM, and RF were trained using selections of these features and their combinations. The publicly available datasets ICDAR 2003 and ICDAR 2011 were used to evaluate the proposed method. This method achieved the state-of-the-art performance by using machine learning methodologies on both databases, and the improvements were significant in terms of Precision, Recall, and F-measure. The F-measure for ICDAR 2003 and ICDAR 2011 was 81% and 84%, respectively. The results showed that the use of a suitable feature combination and selection approach could significantly increase the accuracy of the algorithms.A new dataset has been proposed to fill the gap of character-level annotation and the availability of text in different orientations and of curved text. The proposed dataset was created particularly for deep learning methods which require a massive completed and varying range of training data. The proposed dataset includes 2,100 images annotated at the character and word levels to obtain 38,500 samples of English characters and 12,500 words. Furthermore, an augmentation tool has been proposed to support the proposed dataset. The missing of object detection augmentation tool encroach to proposed tool which has the ability to update the position of bounding boxes after applying transformations on images. This technique helps to increase the number of samples in the dataset and reduce the time of annotations where no annotation is required. The final part of the thesis presents a novel approach for text spotting, which is a new framework for an end-to-end character detection and recognition system designed using an improved SSD convolutional neural network, wherein layers are added to the SSD networks and the aspect ratio of the characters is considered because it is different from that of the other objects. Compared with the other methods considered, the proposed method could detect and recognise characters by training the end-to-end model completely. The performance of the proposed method was better on the proposed dataset; it was 90.34. Furthermore, the F-measure of the method’s accuracy on ICDAR 2015, ICDAR 2013, and SVT was 84.5, 91.9, and 54.8, respectively. On ICDAR13, the method achieved the second-best accuracy. The proposed method could spot text in arbitrarily shaped (horizontal, oriented, and curved) scene text.</div
Towards Practicality of Sketch-Based Visual Understanding
Sketches have been used to conceptualise and depict visual objects from
pre-historic times. Sketch research has flourished in the past decade,
particularly with the proliferation of touchscreen devices. Much of the
utilisation of sketch has been anchored around the fact that it can be used to
delineate visual concepts universally irrespective of age, race, language, or
demography. The fine-grained interactive nature of sketches facilitates the
application of sketches to various visual understanding tasks, like image
retrieval, image-generation or editing, segmentation, 3D-shape modelling etc.
However, sketches are highly abstract and subjective based on the perception of
individuals. Although most agree that sketches provide fine-grained control to
the user to depict a visual object, many consider sketching a tedious process
due to their limited sketching skills compared to other query/support
modalities like text/tags. Furthermore, collecting fine-grained sketch-photo
association is a significant bottleneck to commercialising sketch applications.
Therefore, this thesis aims to progress sketch-based visual understanding
towards more practicality.Comment: PhD thesis successfully defended by Ayan Kumar Bhunia, Supervisor:
Prof. Yi-Zhe Song, Thesis Examiners: Prof Stella Yu and Prof Adrian Hilto
Scene text localization and recognition in images and videos
Scene Text Localization and Recognition methods nd all areas in an image or a video
that would be considered as text by a human, mark boundaries of the areas and output
a sequence of characters associated with its content. They are used to process images
and videos taken by a digital camera or a mobile phone and to \read" the content of
each text area into a digital format, typically a list of Unicode character sequences, that
can be processed in further applications.
Three di erent methods for Scene Text Localization and Recognition were proposed
in the course of the research, each one advancing the state of the art and improving the
accuracy. The rst method detects individual characters as Extremal Regions (ER),
where the probability of each ER being a character is estimated using novel features
with O(1) complexity and only ERs with locally maximal probability are selected across
several image projections for the second stage, where the classi cation is improved using
more computationally expensive features. The method was the rst published method
to address the complete problem of scene text localization and recognition as a whole
- all previous work in the literature focused solely on di erent subproblems.
Secondly, a novel easy-to-implement stroke detector was proposed. The detector is
signi cantly faster and produces signi cantly less false detections than the commonly
used ER detector. The detector e ciently produces character strokes segmentations,
which are exploited in a subsequent classi cation phase based on features e ectively
calculated as part of the segmentation process. Additionally, an e cient text clustering
algorithm based on text direction voting is proposed, which as well as the previous
stages is scale- and rotation- invariant and supports wide variety of scripts and fonts.
The third method exploits a deep-learning model, which is trained for both text
detection and recognition in a single trainable pipeline. The method localizes and
recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic
data so it does not require obtaining expensive human annotations for training and it
achieves state-of-the-art accuracy in the end-to-end text recognition on two standard
datasets, whilst being an order of magnitude faster than the previous methods - the
whole pipeline runs at 10 frames per second.Katedra kybernetik
New frontiers in supervised word sense disambiguation: building multilingual resources and neural models on a large scale
Word Sense Disambiguation is a long-standing task in Natural Language Processing
(NLP), lying at the core of human language understanding. While it has already
been studied from many different angles over the years, ranging from knowledge
based systems to semi-supervised and fully supervised models, the field seems to
be slowing down in respect to other NLP tasks, e.g., part-of-speech tagging and
dependencies parsing. Despite the organization of several international competitions
aimed at evaluating Word Sense Disambiguation systems, the evaluation of automatic
systems has been problematic mainly due to the lack of a reliable evaluation
framework aiming at performing a direct quantitative confrontation.
To this end we develop a unified evaluation framework and analyze the performance
of various Word Sense Disambiguation systems in a fair setup. The results
show that supervised systems clearly outperform knowledge-based models. Among
the supervised systems, a linear classifier trained on conventional local features
still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting
neural networks on unlabeled corpora achieve promising results, surpassing this
hard baseline in most test sets. Even though supervised systems tend to perform
best in terms of accuracy, they often lose ground to more flexible knowledge-based
solutions, which do not require training for every disambiguation target. To bridge
this gap we adopt a different perspective and rely on sequence learning to frame
the disambiguation problem: we propose and study in depth a series of end-to-end
neural architectures directly tailored to the task, from bidirectional Long ShortTerm
Memory to encoder-decoder models. Our extensive evaluation over standard
benchmarks and in multiple languages shows that sequence learning enables more
versatile all-words models that consistently lead to state-of-the-art results, even
against models trained with engineered features.
However, supervised systems need annotated training corpora and the few available
to date are of limited size: this is mainly due to the expensive and timeconsuming
process of annotating a wide variety of word senses at a reasonably high
scale, i.e., the so-called knowledge acquisition bottleneck. To address this issue, we
also present different strategies to acquire automatically high quality sense annotated
data in multiple languages, without any manual effort. We assess the quality of the
sense annotations both intrinsically and extrinsically achieving competitive results
on multiple tasks
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
On the Keyword Extraction and Bias Analysis, Graph-based Exploration and Data Augmentation for Abusive Language Detection in Low-Resource Settings
Tesis por compendio[ES] La detección del lenguaje abusivo es una tarea que se ha vuelto cada vez más importante en la era digital moderna, donde la comunicación se produce a través de diversas plataformas en línea. El aumento de las interacciones en estas plataformas ha provocado un aumento de la aparición del lenguaje abusivo. Abordar dicho contenido es crucial para mantener un entorno en línea seguro e inclusivo.
Sin embargo, esta tarea enfrenta varios desafíos que la convierten en un área compleja y que demanda de continua investigación y desarrollo. En particular, detectar lenguaje abusivo en entornos con escasez de datos presenta desafíos adicionales debido a que el desarrollo de sistemas automáticos precisos a menudo requiere de grandes conjuntos de datos anotados.
En esta tesis investigamos diferentes aspectos de la detección del lenguaje abusivo, prestando especial atención a entornos con datos limitados. Primero, estudiamos el sesgo hacia palabras clave abusivas en modelos entrenados para la detección del lenguaje abusivo. Con este propósito, proponemos dos métodos para extraer palabras clave potencialmente abusivas de colecciones de textos. Luego evaluamos el sesgo hacia las palabras clave extraídas y cómo se puede modificar este sesgo para influir en el rendimiento de la detección del lenguaje abusivo. El análisis y las conclusiones de este trabajo revelan evidencia de que es posible mitigar el sesgo y que dicha reducción puede afectar positivamente el desempeño de los modelos. Sin embargo, notamos que no es posible establecer una correspondencia similar entre la variación del sesgo y el desempeño de los modelos cuando hay escasez datos con las técnicas de reducción del sesgo estudiadas.
En segundo lugar, investigamos el uso de redes neuronales basadas en grafos para detectar lenguaje abusivo. Por un lado, proponemos una estrategia de representación de textos diseñada con el objetivo de obtener un espacio de representación en el que los textos abusivos puedan distinguirse fácilmente de otros textos. Por otro lado, evaluamos la capacidad de redes neuronales convolucionales basadas en grafos para clasificar textos abusivos.
La siguiente parte de nuestra investigación se centra en analizar cómo el aumento de datos puede influir en el rendimiento de la detección del lenguaje abusivo. Para ello, investigamos dos técnicas bien conocidas basadas en el principio de minimización del riesgo en la vecindad de instancias originales y proponemos una variante para una de ellas. Además, evaluamos técnicas simples basadas en el reemplazo de sinónimos, inserción aleatoria, intercambio aleatorio y eliminación aleatoria de palabras.
Las contribuciones de esta tesis ponen de manifiesto el potencial de las redes neuronales basadas en grafos y de las técnicas de aumento de datos para mejorar la detección del lenguaje abusivo, especialmente cuando hay limitación de datos.
Estas contribuciones han sido publicadas en conferencias y revistas internacionales.[CA] La detecció del llenguatge abusiu és una tasca que s'ha tornat cada vegada més important en l'era digital moderna, on la comunicació es produïx a través de diverses plataformes en línia. L'augment de les interaccions en estes plataformes ha provocat un augment de l'aparició de llenguatge abusiu. Abordar este contingut és crucial per a mantindre un entorn en línia segur i inclusiu.
No obstant això, esta tasca enfronta diversos desafiaments que la convertixen en una àrea complexa i contínua de recerca i desenvolupament. En particular, detectar llenguatge abusiu en entorns amb escassetat de dades presenta desafiaments addicionals pel fet que el desenvolupament de sistemes automàtics precisos sovint requerix de grans conjunts de dades anotades.
En esta tesi investiguem diferents aspectes de la detecció del llenguatge abusiu, prestant especial atenció a entorns amb dades limitades. Primer, estudiem el biaix cap a paraules clau abusives en models entrenats per a la detecció de llenguatge abusiu. Amb este propòsit, proposem dos mètodes per a extraure paraules clau potencialment abusives de col·leccions de textos. Després avaluem el biaix cap a les paraules clau extretes i com es pot modificar este biaix per a influir en el rendiment de la detecció de llenguatge abusiu. L'anàlisi i les conclusions d'este treball revelen evidència que és possible mitigar el biaix i que esta reducció pot afectar positivament l'acompliment dels models. No obstant això, notem que no és possible establir una correspondència similar entre la variació del biaix i l'acompliment dels models quan hi ha escassetat dades amb les tècniques de reducció del biaix estudiades.
En segon lloc, investiguem l'ús de xarxes neuronals basades en grafs per a detectar llenguatge abusiu. D'una banda, proposem una estratègia de representació textual dissenyada amb l'objectiu d'obtindre un espai de representació en el qual els textos abusius puguen distingir-se fàcilment d'altres textos. D'altra banda, avaluem la capacitat de models basats en xarxes neuronals convolucionals basades en grafs per a classificar textos abusius.
La següent part de la nostra investigació se centra en analitzar com l'augment de dades pot influir en el rendiment de la detecció del llenguatge abusiu. Per a això, investiguem dues tècniques ben conegudes basades en el principi de minimització del risc en el veïnatge d'instàncies originals i proposem una variant per a una d'elles. A més, avaluem tècniques simples basades en el reemplaçament de sinònims, inserció aleatòria, intercanvi aleatori i eliminació aleatòria de paraules.
Les contribucions d'esta tesi destaquen el potencial de les xarxes neuronals basades en grafs i de les tècniques d'augment de dades per a millorar la detecció del llenguatge abusiu, especialment quan hi ha limitació de dades.
Estes contribucions han sigut publicades en revistes i conferències internacionals.[EN] Abusive language detection is a task that has become increasingly important in the modern digital age, where communication takes place via various online platforms. The increase in online interactions has led to an increase in the occurrence of abusive language. Addressing such content is crucial to maintaining a safe and inclusive online environment.
However, this task faces several challenges that make it a complex and ongoing area of research and development. In particular, detecting abusive language in environments with sparse data poses an additional challenge, since the development of accurate automated systems often requires large annotated datasets.
In this thesis we investigate different aspects of abusive language detection, paying particular attention to environments with limited data. First, we study the bias toward abusive keywords in models trained for abusive language detection. To this end, we propose two methods for extracting potentially abusive keywords from datasets. We then evaluate the bias toward the extracted keywords and how this bias can be modified in order to influence abusive language detection performance. The analysis and conclusions of this work reveal evidence that it is possible to mitigate the bias and that such a reduction can positively affect the performance of the models. However, we notice that it is not possible to establish a similar correspondence between bias mitigation and model performance in low-resource settings with the studied bias mitigation techniques.
Second, we investigate the use of models based on graph neural networks to detect abusive language. On the one hand, we propose a text representation framework designed with the aim of obtaining a representation space in which abusive texts can be easily distinguished from other texts. On the other hand, we evaluate the ability of models based on convolutional graph neural networks to classify abusive texts.
The next part of our research focuses on analyzing how data augmentation can influence the performance of abusive language detection. To this end, we investigate two well-known techniques based on the principle of vicinal risk minimization and propose a variant for one of them. In addition, we evaluate simple techniques based on the operations of synonym replacement, random insertion, random swap, and random deletion.
The contributions of this thesis highlight the potential of models based on graph neural networks and data augmentation techniques to improve abusive language detection, especially in low-resource settings.
These contributions have been published in several international conferences and journals.This research work was partially funded by the Spanish Ministry of Science and Innovation under the research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31). The authors thank also the EU-FEDER Comunitat Valenciana 2014-2020 grant IDIFEDER/2018/025. This work was done in the framework of the research project on Fairness
and Transparency for equitable NLP applications in social media, funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making
EuropePI. FairTransNLP research project (PID2021-124361OB-C31) funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making
Europe. Part of the work presented in this article was performed during the first author’s research visit to the University of Mannheim, supported
through a Contact Fellowship awarded by the DAAD scholarship program “STIBET Doktoranden”.Peña Sarracén, GLDL. (2024). On the Keyword Extraction and Bias Analysis, Graph-based Exploration and Data Augmentation for Abusive Language Detection in Low-Resource Settings [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/203266Compendi