Search CORE

65 research outputs found

GAMBL, genetic algorithm optimization of memory-based WSD

Author: Antal Van Den Bosch
Bart Decadt
Bart Decadt And
Véronique Hoste
Walter Daelemans
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2004
Field of study

GAMBL is a word expert approach to WSD in which each word expert is trained using memory based learning. Joint feature selection and algorithm parameter optimization are achieved with a genetic algorithm (GA). We use a cascaded classifier approach in which the GA optimizes local context features and the output of a separate keyword classifier (rather than also optimizing the keyword features together with the local context features). A further innovation on earlier versions of memory based WSD is the use of grammatical relation and chunk features. This paper presents the architecture of the system briefly, and discusses its performance on the English lexical sample and all words tasks in SENSEVAL-3

CiteSeerX

Ghent University Academic Bibliography

Tilburg University Repository

LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

Author: Bart Desmet
Els Lefever
Geert Coorman
Lieve Macken
Marjan Van De Kauter
Véronique Hoste
Publication venue
Publication date: 01/01/2013
Field of study

This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools. 1

CiteSeerX

Ghent University Academic Bibliography

Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity

Author: Daelemans Walter
De Pauw Guy
Desmet Bart
Emmery Chris
Hoste Véronique
Jacobs Gilles
Lefever Els
Van Hee Cynthia
Verhoeven Ben
Publication venue
Publication date: 25/10/2019
Field of study

The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of the recent research uses small, heterogeneous datasets, without a thorough evaluation of applicability. In this paper, we further illustrate these issues, as we (i) evaluate many publicly available resources for this task and demonstrate difficulties with data collection. These predominantly yield small datasets that fail to capture the required complex social dynamics and impede direct comparison of progress. We (ii) conduct an extensive set of experiments that indicate a general lack of cross-domain generalization of classifiers trained on these sources, and openly provide this framework to replicate and extend our evaluation criteria. Finally, we (iii) present an effective crowdsourcing method: simulating real-life bullying scenarios in a lab setting generates plausible data that can be effectively used to enrich real data. This largely circumvents the restrictions on data that can be collected, and increases classifier performance. We believe these contributions can aid in improving the empirical practices of future research in the field

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

The good, the bad and the implicit: a comprehensive approach to annotating explicit and implicit sentiment

Author: A Banfield
A Kennedy
A-M Popescu
B Liu
B Pang
Bart Desmet
C Rijsbergen van
E Boldrini
J Cohen
J Read
J Wiebe
JR Landis
JR Martin
K Krippendorff
K Krippendorff
K Krippendorff
KR Scherer
M Halliday
M Kauter Van de
M Taboada
Marjan Van de Kauter
R Artstein
R Quirk
Véronique Hoste
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

We present a fine-grained scheme for the annotation of polar sentiment in text, that accounts for explicit sentiment (so-called private states), as well as implicit expressions of sentiment (polar facts). Polar expressions are annotated below sentence level and classified according to their subjectivity status. Additionally, they are linked to one or more targets with a specific polar orientation and intensity. Other components of the annotation scheme include source attribution and the identification and classification of expressions that modify polarity. In previous research, little attention has been given to implicit sentiment, which represents a substantial amount of the polar expressions encountered in our data. An English and Dutch corpus of financial newswire, consisting of over 45,000 words each, was annotated using our scheme. A subset of this corpus was used to conduct an inter-annotator agreement study, which demonstrated that the proposed scheme can be used to reliably annotate explicit and implicit sentiment in real-world textual data, making the created corpora a useful resource for sentiment analysis

Crossref

Ghent University Academic Bibliography

Automatic Detection of Cyberbullying in Social Media Text

Author: B Sri Nandhini
Bart Desmet
Ben Verhoeven
C Cortes
C Salmivalli
C Salmivalli
C Salmivalli
C Salmivalli
CC Chang
CE Osgood
Chris Emmery
Cynthia Van Hee
D Olweus
DM Blei
EF Gross
Els Lefever
F Dehue
F Pedregosa
Gilles Jacobs
Guy De Pauw
H Cowie
H He
H Vandebosch
H Vandebosch
H Zijlstra
Hussein Suleman
J Cohen
J Juvonen
JJ Dooley
JL Fleiss
K Van Royen
K Van Royen
KY Mckenna
M Fekkes
M O’Moore
M Price
M van de Kauter
MA Al-garadi
ML McHugh
NE Willard
NV Chawla
P Galán-García
PB O’Sullivan
PJ Stone
PK Smith
R Slonje
R Slonje
R Zhao
RE Fan
RS Tokunaga
S Bastiaensens
S Bastiaensens
S Deerwester
S Hinduja
S Hinduja
T Fawcett
V Nahar
Véronique Hoste
Walter Daelemans
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a training corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for this particular task. Experiments on a holdout test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1-score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems based on keywords and word unigrams.Comment: 21 pages, 9 tables, under revie

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

Directory of Open Access Journals

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Learning Dutch coreference resolution

Author: Véronique Hoste
Walter Daelemans
Publication venue
Publication date: 01/01/2004
Field of study

This paper presents a machine learning approach to the resolution of coreferential relations between nominal constituents in Dutch. It is the first significant automatic approach to the resolution of coreferential relations between nominal constituents for this language. The corpusbased strategy was enabled by the annotation of a substantial corpus (ca. 12,500 noun phrases) of Dutch news magazine text with coreferential links for pronominal, proper noun and common noun coreferences. Based on the hypothesis that different types of information sources contribute to a correct resolution of different types of coreferential links, we propose a modular approach in which a separate module is trained per NP type. 1 The task of coreference resolution Although largely unexplored for Dutch, automatic coreference 1 resolution is a research area which is becoming increasingly popular in natural language processing (NLP) research. It is a weakness and therefore a key task in applications such as machine translation, automatic summarization and information extraction for which text understanding is of crucial importance

CiteSeerX

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Utrecht University Repository

Tilburg University Repository

Combining Lexico-semantic Features for Emotion Classification in Suicide Notes

Author: Bart Desmet
Véronique Hoste
Publication venue: 'SAGE Publications'
Publication date: 01/01/2012
Field of study

This paper describes a system for automatic emotion classification, developed for the 2011 i2b2 Natural Language Processing Challenge, Track 2. The objective of the shared task was to label suicide notes with 15 relevant emotions on the sentence level. Our system uses 15 SVM models (one for each emotion) using the combination of features that was found to perform best on a given emotion. Features included lemmas and trigram bag of words, and information from semantic resources such as WordNet, SentiWordNet and subjectivity clues. The best-performing system labeled 7 of the 15 emotions and achieved an F-score of 53.31% on the test data

Directory of Open Access Journals

Ghent University Academic Bibliography

PubMed Central