Search CORE

7 research outputs found

Detecting Check-Worthy Claims in Political Debates, Speeches, and Interviews Using Audio Data

Author: Hardalov Momchil
Ivanov Petar
Koychev Ivan
Nakov Preslav
Publication venue
Publication date: 24/05/2023
Field of study

A large portion of society united around the same vision and ideas carries enormous energy. That is precisely what political figures would like to accumulate for their cause. With this goal in mind, they can sometimes resort to distorting or hiding the truth, unintentionally or on purpose, which opens the door for misinformation and disinformation. Tools for automatic detection of check-worthy claims would be of great help to moderators of debates, journalists, and fact-checking organizations. While previous work on detecting check-worthy claims has focused on text, here we explore the utility of the audio signal as an additional information source. We create a new multimodal dataset (text and audio in English) containing 48 hours of speech. Our evaluation results show that the audio modality together with text yields improvements over text alone in the case of multiple speakers. Moreover, an audio-only model could outperform a text-only one for a single speaker.Comment: check-worthy claims, fake news, political debates, audi

arXiv.org e-Print Archive

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Author: Afzal Osama Mohammed
Aji Alham Fikri
Ivanov Petar
Mahmoud Tarek
Mansurov Jonibek
Nakov Preslav
Shelmanov Artem
Su Jinyan
Tsvigun Akim
Wang Yuxia
Whitehouse Chenxi
Publication venue
Publication date: 24/05/2023
Field of study

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.Comment: 11 page

arXiv.org e-Print Archive

US News and Social Media Framing around Vaping

Author: Aanegola Rohan
Altice Frederick L.
Babaeianjelodar Marzieh
Bancroft Angus
Chen Keyu
Cheung Lam Yin
De Choudhury Munmun
KhudaBukhsh Ashiqur R.
Kumar Navin
Nakov Preslav Ivanov
Shi Yiwen
Yadav Shweta
Publication venue
Publication date: 22/07/2022
Field of study

In this paper, we investigate how vaping is framed differently (2008-2021) between US news and social media. We analyze 15,711 news articles and 1,231,379 Facebook posts about vaping to study the differences in framing between media varieties. We use word embeddings to provide two-dimensional visualizations of the semantic changes around vaping for news and for social media. We detail that news media framing of vaping shifted over time in line with emergent regulatory trends, such as; flavored vaping bans, with little discussion around vaping as a smoking cessation tool. We found that social media discussions were far more varied, with transitions toward vaping both as a public health harm and as a smoking cessation tool. Our cloze test, dynamic topic model, and question answering showed similar patterns, where social media, but not news media, characterizes vaping as combustible cigarette substitute. We use n-grams to detail that social media data first centered on vaping as a smoking cessation tool, and in 2019 moved toward narratives around vaping regulation, similar to news media frames. Overall, social media tracks the evolution of vaping as a social practice, while news media reflects more risk based concerns. A strength of our work is how the different techniques we have applied validate each other. Stakeholders may utilize our findings to intervene around the framing of vaping, and may design communications campaigns that improve the way society sees vaping, thus possibly aiding smoking cessation; and reducing youth vaping

arXiv.org e-Print Archive

Edinburgh Research Explorer

Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics

Author: All Rights Reserved
Preslav Ivanov Nakov
Preslav Ivanov Nakov
Publication venue
Publication date
Field of study

CiteSeerX

CompSysTech’2001 – Bulgarian Computer Science Conference – 21-22.06.2001, Sofia, Bulgaria Investigating the Degree of Adequacy of the Relations in the Concept Structure of Students using the Method of Latent Semantic Analysis

Author: Preslav Ivanov Nakov
Senia Petrova Terzieva
Sneja H
Publication venue
Publication date
Field of study

Abstract. The research on the effects of study is hindered by the possibilities of the techniques and methods of registering, measuring and assessing the actually formed knowledge as information represented in the memory with the appropriate correlation among its units. The problem has been solved by the use of the latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed by the students in the form of free verbal statements. Key words: latent semantic analysis; notional structures, content analysis

CiteSeerX