272 research outputs found

    Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review

    Full text link
    Due to the advent and increase in the popularity of the Internet, people have been producing and disseminating textual data in several ways, such as reviews, social media posts, and news articles. As a result, numerous researchers have been working on discovering patterns in textual data, especially because social media posts function as social sensors, indicating peoples' opinions, interests, etc. However, most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, such as an outdated dataset, which may not correspond to reality, and an outdated model, which has its performance degrading over time. Concept drift is another aspect that emphasizes these issues, which corresponds to data distribution and pattern changes. In a text stream scenario, it is even more challenging due to its characteristics, such as the high speed and data arriving sequentially. In addition, models for this type of scenario must adhere to the constraints mentioned above while learning from the stream by storing texts for a limited time and consuming low memory. In this study, we performed a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 40 papers to unravel aspects such as text drift categories, types of text drift detection, model update mechanism, the addressed stream mining tasks, types of text representations, and text representation update mechanism. In addition, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Therefore, this paper comprehensively reviews the concept drift adaptation in text stream mining scenarios.Comment: 49 page

    Semi-supervised learning and fairness-aware learning under class imbalance

    Get PDF
    With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms. In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions: • Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process. • Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes

    Human-in-the-Loop Hate Speech Classification in a Multilingual Context

    Full text link
    The shift of public debate to the digital sphere has been accompanied by a rise in online hate speech. While many promising approaches for hate speech classification have been pro- posed, studies often focus only on a single language, usually English, and do not address three key concerns: post-deployment perfor- mance, classifier maintenance and infrastruc- tural limitations. In this paper, we introduce a new human-in-the-loop BERT-based hate speech classification pipeline and trace its de- velopment from initial data collection and an- notation all the way to post-deployment. Our classifier, trained using data from our original corpus of over 422k examples, is specifically developed for the inherently multilingual set- ting of Switzerland and outperforms with its F1 score of 80.5 the currently best-performing BERT-based multilingual classifier by 5.8 F1 points in German and 3.6 F1 points in French. Our systematic evaluations over a 12-month period further highlight the vital importance of continuous, human-in-the-loop classifier main- tenance to ensure robust hate speech classifica- tion post-deployment

    Exploring anomalies in time

    Get PDF

    Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection

    Get PDF
    The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.Agencia Estatal de Investigación | Ref. PID2020–113673RB-I00Xunta de Galicia | Ref. ED431C2018/55Fundação para a Ciência e a Tecnologia | Ref. UIDB/04469/2020Financiado para publicación en acceso aberto: Universidade de Vigo/CISU

    Automatic annotation for weakly supervised learning of detectors

    Get PDF
    PhDObject detection in images and action detection in videos are among the most widely studied computer vision problems, with applications in consumer photography, surveillance, and automatic media tagging. Typically, these standard detectors are fully supervised, that is they require a large body of training data where the locations of the objects/actions in images/videos have been manually annotated. With the emergence of digital media, and the rise of high-speed internet, raw images and video are available for little to no cost. However, the manual annotation of object and action locations remains tedious, slow, and expensive. As a result there has been a great interest in training detectors with weak supervision where only the presence or absence of object/action in image/video is needed, not the location. This thesis presents approaches for weakly supervised learning of object/action detectors with a focus on automatically annotating object and action locations in images/videos using only binary weak labels indicating the presence or absence of object/action in images/videos. First, a framework for weakly supervised learning of object detectors in images is presented. In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically annotating object locations in weakly labelled data is presented which, unlike existing approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial annotation is then used to start an iterative process in which standard object detectors are used to refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift from the object of interest, a scheme for detecting model drift is also presented. Furthermore, unlike most other methods, our weakly supervised approach is evaluated on data without manual pose (object orientation) annotation. Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues, is carried out. From the analysis, a new method based on negative mining (NegMine) is presented for the initial annotation of both object and action data. The NegMine based approach is a much simpler formulation using only inter-class measure and requires no complex combinatorial optimisation but can still meet or outperform existing approaches including the previously pre3 sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing approaches to boost their performance. Finally, the thesis will take a step back and look at the use of generic object detectors as prior knowledge in weakly supervised learning of object detectors. These generic object detectors are typically based on sampling saliency maps that indicate if a pixel belongs to the background or foreground. A new approach to generating saliency maps is presented that, unlike existing approaches, looks beyond the current image of interest and into images similar to the current image. We show that our generic object proposal method can be used by itself to annotate the weakly labelled object data with surprisingly high accuracy

    Comparative process mining:analyzing variability in process data

    Get PDF

    Comparative process mining:analyzing variability in process data

    Get PDF

    Machine learning: statistical physics based theory and smart industry applications

    Get PDF
    The increasing computational power and the availability of data have made it possible to train ever-bigger artificial neural networks. These so-called deep neural networks have been used for impressive applications, like advanced driver assistance and support in medical diagnoses. However, various vulnerabilities have been revealed and there are many open questions concerning the workings of neural networks. Theoretical analyses are therefore essential for further progress. One current question is: why is it that networks with Rectified Linear Unit (ReLU) activation seemingly perform better than networks with sigmoidal activation?We contribute to the answer to this question by comparing ReLU networks with sigmoidal networks in diverse theoretical learning scenarios. In contrast to analysing specific datasets, we use a theoretical modelling using methods from statistical physics. They give the typical learning behaviour for chosen model scenarios. We analyse both the learning behaviour on a fixed dataset and on a data stream in the presence of a changing task. The emphasis is on the analysis of the network’s transition to a state wherein specific concepts have been learnt. We find significant benefits of ReLU networks: they exhibit continuous increases of their performance and adapt more quickly to changing tasks.In the second part of the thesis we treat applications of machine learning: we design a quick quality control method for material in a production line and study the relationship with product faults. Furthermore, we introduce a methodology for the interpretable classification of time series data
    corecore