65 research outputs found
GAMBL, genetic algorithm optimization of memory-based WSD
GAMBL is a word expert approach to WSD in which each word expert is trained using memory based learning. Joint feature selection and algorithm parameter optimization are achieved with a genetic algorithm (GA). We use a cascaded classifier approach in which the GA optimizes local context features and the output of a separate keyword classifier (rather than also optimizing the keyword features together with the local context features). A further innovation on earlier versions of memory based WSD is the use of grammatical relation and chunk features. This paper presents the architecture of the system briefly, and discusses its performance on the English lexical sample and all words tasks in SENSEVAL-3
LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit
This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools. 1
Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity
The detection of online cyberbullying has seen an increase in societal
importance, popularity in research, and available open data. Nevertheless,
while computational power and affordability of resources continue to increase,
the access restrictions on high-quality data limit the applicability of
state-of-the-art techniques. Consequently, much of the recent research uses
small, heterogeneous datasets, without a thorough evaluation of applicability.
In this paper, we further illustrate these issues, as we (i) evaluate many
publicly available resources for this task and demonstrate difficulties with
data collection. These predominantly yield small datasets that fail to capture
the required complex social dynamics and impede direct comparison of progress.
We (ii) conduct an extensive set of experiments that indicate a general lack of
cross-domain generalization of classifiers trained on these sources, and openly
provide this framework to replicate and extend our evaluation criteria.
Finally, we (iii) present an effective crowdsourcing method: simulating
real-life bullying scenarios in a lab setting generates plausible data that can
be effectively used to enrich real data. This largely circumvents the
restrictions on data that can be collected, and increases classifier
performance. We believe these contributions can aid in improving the empirical
practices of future research in the field
The good, the bad and the implicit: a comprehensive approach to annotating explicit and implicit sentiment
We present a fine-grained scheme for the annotation of polar sentiment in text, that accounts for explicit sentiment (so-called private states), as well as implicit expressions of sentiment (polar facts). Polar expressions are annotated below sentence level and classified according to their subjectivity status. Additionally, they are linked to one or more targets with a specific polar orientation and intensity. Other components of the annotation scheme include source attribution and the identification and classification of expressions that modify polarity. In previous research, little attention has been given to implicit sentiment, which represents a substantial amount of the polar expressions encountered in our data. An English and Dutch corpus of financial newswire, consisting of over 45,000 words each, was annotated using our scheme. A subset of this corpus was used to conduct an inter-annotator agreement study, which demonstrated that the proposed scheme can be used to reliably annotate explicit and implicit sentiment in real-world textual data, making the created corpora a useful resource for sentiment analysis
Automatic Detection of Cyberbullying in Social Media Text
While social media offer great communication opportunities, they also
increase the vulnerability of young people to threatening situations online.
Recent studies report that cyberbullying constitutes a growing problem among
youngsters. Successful prevention depends on the adequate detection of
potentially harmful messages and the information overload on the Web requires
intelligent systems to identify potential risks automatically. The focus of
this paper is on automatic cyberbullying detection in social media text by
modelling posts written by bullies, victims, and bystanders of online bullying.
We describe the collection and fine-grained annotation of a training corpus for
English and Dutch and perform a series of binary classification experiments to
determine the feasibility of automatic cyberbullying detection. We make use of
linear support vector machines exploiting a rich feature set and investigate
which information sources contribute the most for this particular task.
Experiments on a holdout test set reveal promising results for the detection of
cyberbullying-related posts. After optimisation of the hyperparameters, the
classifier yields an F1-score of 64% and 61% for English and Dutch
respectively, and considerably outperforms baseline systems based on keywords
and word unigrams.Comment: 21 pages, 9 tables, under revie
Learning Dutch coreference resolution
This paper presents a machine learning approach to the resolution of coreferential relations between nominal constituents in Dutch. It is the first significant automatic approach to the resolution of coreferential relations between nominal constituents for this language. The corpusbased strategy was enabled by the annotation of a substantial corpus (ca. 12,500 noun phrases) of Dutch news magazine text with coreferential links for pronominal, proper noun and common noun coreferences. Based on the hypothesis that different types of information sources contribute to a correct resolution of different types of coreferential links, we propose a modular approach in which a separate module is trained per NP type. 1 The task of coreference resolution Although largely unexplored for Dutch, automatic coreference 1 resolution is a research area which is becoming increasingly popular in natural language processing (NLP) research. It is a weakness and therefore a key task in applications such as machine translation, automatic summarization and information extraction for which text understanding is of crucial importance
Combining Lexico-semantic Features for Emotion Classification in Suicide Notes
This paper describes a system for automatic emotion classification, developed for the 2011 i2b2 Natural Language Processing Challenge, Track 2. The objective of the shared task was to label suicide notes with 15 relevant emotions on the sentence level. Our system uses 15 SVM models (one for each emotion) using the combination of features that was found to perform best on a given emotion. Features included lemmas and trigram bag of words, and information from semantic resources such as WordNet, SentiWordNet and subjectivity clues. The best-performing system labeled 7 of the 15 emotions and achieved an F-score of 53.31% on the test data
- …