980 research outputs found

    Offensive language classification in social media: using deep learning

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsAs social media usage becomes more integrated into our daily lives, the impact of online abuse also becomes more prevalent. Research in the area of Offensive Language Classification are numerous and often occur in parrallel. Offensive Language Identification Dataset (OLID) schema was introduced with the aim of consolidating related tasks by categorising offense into a three-level hierarchy - detection of offensive posts (Level A), distinguishing between targeted and untargeted offenses (Level B) and then identifying the target of the offense (Level C). This thesis presents our contribution to the Offensive Language Classification Task (English SubTask A) of OffensEval 2020, and a follow-up study of Offense Type Classification (subTask B) and Offense Target Identification (subTask C) of OffensEval 2019. These tasks follow the OLID schema where each level corresponds to an individual subtask. For subtask A, the dataset is examined in detail and the most uncertain partitions are removed by an under-sampling technique of the training set. We improved model performance by increasing data quality, taking advantage of further offensive language classification datasets. We fine-tuned separate BERT models from individual datasets and experimented with different ensemble approaches including SVMs, Gradient boosting, AdaBoosting and Logistic Regression to achieve a final ensemble classification model that enhanced macro-F1 score. Our best model, an average ensemble of four different Bert models, achieved 11th place out of 82 participants with a macro F1 score of 0.91344 in the English SubTask A. The dataset for subtask B and C are highly unbalanced, and modification of the classification thresholds improved classifier performance of the minority classes, which in turn improved the overall performance. Again using the BERT architecture, the models achieved macro-F1 scores of 0.71367 for subTask B and 0.643352 for subTask C, equivalent to the 5th and 2nd places in the respective tasks. We showed that BERT is an effective architecture for offensive language classification and propose further performance gains are possible by improving data quality.Conforme o uso da Social Media se torna mais integrado no nosso dia-a-dia, o impacto do abuso online torna-se também mais prevalente. Pesquisas na área de Classificação de Linguagem Ofensiva são numerosas e ocorrem frequentemente em paralelo. O esquema Offensive Language Identification Dataset (OLID) foi introduzido com o objectivo de consolidar tarefas relacionadas com a categorização de ofensas numa hierarquia de três níveis - detecção de posts ofensivos (nível A), distinção entre ofensas directas e indirectas (nível B) e posteriormente a identificação do visado pela ofensa (nível C). Esta tese apresenta a nossa contribuição à Offensive Language Classification Task (English sub-tarefa A) da OffensEval 2020, e um subsequente estudo de Offense Type Classification (sub-tarefa B) e Offense Target Identification (sub-tarefa C) da OffensEval 2019. Estas tarefas seguem o esquema OLID onde cada nível corresponde a uma tarefa individual. Para a sub-tarefa A, o conjunto de informação é examinado em detalhe e as partições mais incertas são removidas por uma técnica de sub-amostragem do conjunto de treinamento. Melhoramos também o desempenho ao melhorar a qualidade da informação, aproveitando de conjuntos mais recentes de classificação de linguagem ofensiva. Ajustamos modelos BERT disjuntos através de conjuntos de informação individuais e experimentamos com diferentes junções incluindo SVMs, Gradient boosting, AdaBoosting e Regressão Logística para alcançar /* um modelo classificação junção final */ que melhorou a pontuação macro-F1. O nosso melhor modelo, uma junção média de quatro modelos Bert diferentes, alcançou o 11º de 82 participantes com uma pontuação macro de 0,91344 na sub-tarefa A de Inglês. O conjunto de informação para a sub-tarefa B e C são altamente desequilibrados, e modificar os limiares de classificação melhorou o desempenho de classes minoria, que por sua vez melhoraram o desempenho no geral. Novamente usando a arquitectura BERT, os modelos alcançaram pontuações macro-F1 de 0,71367 para a sub-tarefa B e 0.643352 para a sub-tarefa C, equivalente ao 5º e 2º lugares nas tarefas respectivas. Mostrámos que a arquitectura BERT é eficaz para classificação de linguagem ofensiva e propomos que é possível ganhar desempenho através da melhoria da qualidade da informação

    New perspectives and methods for stream learning in the presence of concept drift.

    Get PDF
    153 p.Applications that generate data in the form of fast streams from non-stationary environments, that is,those where the underlying phenomena change over time, are becoming increasingly prevalent. In thiskind of environments the probability density function of the data-generating process may change overtime, producing a drift. This causes that predictive models trained over these stream data become obsoleteand do not adapt suitably to the new distribution. Specially in online learning scenarios, there is apressing need for new algorithms that adapt to this change as fast as possible, while maintaining goodperformance scores. Examples of these applications include making inferences or predictions based onfinancial data, energy demand and climate data analysis, web usage or sensor network monitoring, andmalware/spam detection, among many others.Online learning and concept drift are two of the most hot topics in the recent literature due to theirrelevance for the so-called Big Data paradigm, where nowadays we can find an increasing number ofapplications based on training data continuously available, named as data streams. Thus, learning in nonstationaryenvironments requires adaptive or evolving approaches that can monitor and track theunderlying changes, and adapt a model to accommodate those changes accordingly. In this effort, Iprovide in this thesis a comprehensive state-of-the-art approaches as well as I identify the most relevantopen challenges in the literature, while focusing on addressing three of them by providing innovativeperspectives and methods.This thesis provides with a complete overview of several related fields, and tackles several openchallenges that have been identified in the very recent state of the art. Concretely, it presents aninnovative way to generate artificial diversity in ensembles, a set of necessary adaptations andimprovements for spiking neural networks in order to be used in online learning scenarios, and finally, adrift detector based on this former algorithm. All of these approaches together constitute an innovativework aimed at presenting new perspectives and methods for the field

    Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons

    Get PDF
    The anterior inferotemporal cortex (IT) is the highest stage along the hierarchy of visual areas that, in primates, processes visual objects. Although several lines of evidence suggest that IT primarily represents visual shape information, some recent studies have argued that neuronal ensembles in IT code the semantic membership of visual objects (i.e., represent conceptual classes such as animate and inanimate objects). In this study, we investigated to what extent semantic, rather than purely visual information, is represented in IT by performing a multivariate analysis of IT responses to a set of visual objects. By relying on a variety of machine-learning approaches (including a cutting-edge clustering algorithm that has been recently developed in the domain of statistical physics), we found that, in most instances, IT representation of visual objects is accounted for by their similarity at the level of shape or, more surprisingly, low-level visual properties. Only in a few cases we observed IT representations of semantic classes that were not explainable by the visual similarity of their members. Overall, these findings reassert the primary function of IT as a conveyor of explicit visual shape information, and reveal that low-level visual properties are represented in IT to a greater extent than previously appreciated. In addition, our work demonstrates how combining a variety of state-of-the-art multivariate approaches, and carefully estimating the contribution of shape similarity to the representation of object categories, can substantially advance our understanding of neuronal coding of visual objects in cortex
    corecore