12 research outputs found
Evaluating Rank-Coherence of Crowd Rating in Customer Satisfaction
AbstractCrowd rating is a continuous and public process of data gathering that allows the display of general quantitative opinions on a topic from online anonymous networks as they are crowds. Online platforms leveraged these technologies to improve predictive tasks in marketing. However, we argue for a different employment of crowd rating as a tool of public utility to support social contexts suffering to adverse selection, like tourism. This aim needs to deal with issues in both method of measurement and analysis of data, and with common biases associated to public disclosure of rating information. We propose an evaluative method to investigate fairness of common measures of rating procedures with the peculiar perspective of assessing linearity of the ranked outcomes. This is tested on a longitudinal observational case of 7 years of customer satisfaction ratings, for a total amount of 26.888 reviews. According to the results obtained from the sampled dataset, analysed with the proposed evaluative method, there is a trade-off between loss of (potentially) biased information on ratings and fairness of the resulting rankings. However, computing an ad hoc unbiased ranking case, the ranking outcome through the time-weighted measure is not significantly different from the ad hoc unbiased case
Chapter Multipoint vs slider: a protocol for experiments
Since the broad diffusion of Computer-Assisted survey tools (i.e. web surveys), a lively debate about innovative scales of measure arose among social scientists and practitioners. Implications are relevant for applied Statistics and evaluation research since while traditional scales collect ordinal observations, data from sliders can be interpreted as continuous. Literature, however, report excessive times of completion of the task from sliders in web surveys. This experimental protocol is aimed at testing hypotheses on the accuracy in prediction and dispersion of estimates from anonymous participants who are recruited online and randomly assigned into tasks in recognition of shades of colour. The treatment variable is two scales: a traditional multipoint 0-10 multipoint vs a slider 0-100. Shades have a unique parametrisation (true value) and participants have to guess the true value through the scale. These tasks are designed to recreate situations of uncertainty among participants while minimizing the subjective component of a perceptual assessment and maximizing information about scale-driven differences and biases. We propose to test statistical differences in the treatment variable: (i) mean absolute error from the true value (ii), time of completion of the task. To correct biases due to the variance in the number of completed tasks among participants, data about participants can be collected through both pre-tasks acceptance of web cookies and post-tasks explicit questions
PROTOCOL: HOW TO CORRECT THE CLASSIFICATION ERROR BY ASKING TO LARGE LANGUAGE MODELS THE SIMILARITY AMONG CATEGORIES
Similarity between two categories is a number between 0 and 1 that abstractally represent how much the two categories overlap, objectively or subjectively. When two categories overlap, the error of classification of one to other is less severe. For example, misclassifying a wolf for dog is a less severe error than misclassifying a wolf for a cat, because wolf are more similar to dogs than cats.
Nevertheless, canonical estimation of matrices of similarities for taxonomies of categories is expensive. In this protocol it is suggested why and how to estimate a similarity matrix from one or multiple Large Language Models
Supplementary files for Characterisation and Calibration of Multiversal Models
Supplementary files for Characterisation and Calibration of Multiversal Models</p
Chapter Multipoint vs slider: a protocol for experiments
Since the broad diffusion of Computer-Assisted survey tools (i.e. web surveys), a lively debate about innovative scales of measure arose among social scientists and practitioners. Implications are relevant for applied Statistics and evaluation research since while traditional scales collect ordinal observations, data from sliders can be interpreted as continuous. Literature, however, report excessive times of completion of the task from sliders in web surveys. This experimental protocol is aimed at testing hypotheses on the accuracy in prediction and dispersion of estimates from anonymous participants who are recruited online and randomly assigned into tasks in recognition of shades of colour. The treatment variable is two scales: a traditional multipoint 0-10 multipoint vs a slider 0-100. Shades have a unique parametrisation (true value) and participants have to guess the true value through the scale. These tasks are designed to recreate situations of uncertainty among participants while minimizing the subjective component of a perceptual assessment and maximizing information about scale-driven differences and biases. We propose to test statistical differences in the treatment variable: (i) mean absolute error from the true value (ii), time of completion of the task. To correct biases due to the variance in the number of completed tasks among participants, data about participants can be collected through both pre-tasks acceptance of web cookies and post-tasks explicit questions
The polarising effect of Review Bomb
This study discusses the Review Bomb, a phenomenon consisting of a massive
attack by groups of Internet users on a website that displays users' review on
products. It gained attention, especially on websites that aggregate numerical
ratings. Although this phenomenon can be considered an example of online
misinformation, it differs from conventional spam review, which happens within
larger time spans. In particular, the Bomb occurs suddenly and for a short
time, because in this way it leverages the notorious problem of cold-start: if
reviews are submitted by a lot of fresh new accounts, it makes hard to justify
preventative measures. The present research work is focused on the case of The
Last of Us Part II, a video game published by Sony, that was the target of the
widest phenomenon of Review Bomb, occurred in June 2020. By performing an
observational analysis of a linguistic corpus of English reviews and the
features of its users, this study confirms that the Bomb was an ideological
attack aimed at breaking down the rating system of the platform Metacritic.
Evidence supports that the bombing had the unintended consequence to induce a
reaction from users, ending into a consistent polarisation of ratings towards
extreme values. The results not only display the theory of polarity in online
reviews, but them also provide insights for the research on the problem of
cold-start detection of spam review. In particular, it illustrates the
relevance of detecting users discussing contextual elements instead of the
product and users with anomalous features