1 research outputs found
Towards a corpus for credibility assessment in software practitioner blog articles
Blogs are a source of grey literature which are widely adopted by software
practitioners for disseminating opinion and experience. Analysing such articles
can provide useful insights into the state-of-practice for software engineering
research. However, there are challenges in identifying higher quality content
from the large quantity of articles available. Credibility assessment can help
in identifying quality content, though there is a lack of existing corpora.
Credibility is typically measured through a series of conceptual criteria, with
'argumentation' and 'evidence' being two important criteria.
We create a corpus labelled for argumentation and evidence that can aid the
credibility community. The corpus consists of articles from the blog of a
single software practitioner and is publicly available.
Three annotators label the corpus with a series of conceptual credibility
criteria, reaching an agreement of 0.82 (Fleiss' Kappa). We present preliminary
analysis of the corpus by using it to investigate the identification of claim
sentences (one of our ten labels).
We train four systems (Bert, KNN, Decision Tree and SVM) using three feature
sets (Bag of Words, Topic Modelling and InferSent), achieving an F1 score of
0.64 using InferSent and a Linear SVM.
Our preliminary results are promising, indicating that the corpus can help
future studies in detecting the credibility of grey literature. Future research
will investigate the degree to which the sentence level annotations can infer
the credibility of the overall document