Search CORE

4,278 research outputs found

Fake news detection and analysis

Author: Stefan Elena-Ramona
Publication venue: Universitat Politècnica de Catalunya
Publication date: 28/06/2022
Field of study

The evolution of technology has led to the development of environments that allow instantaneous communication and dissemination of information. As a result, false news, article manipulation, lack of trust in media and information bubbles have become high-impact issues. In this context, the need for automatic tools that can classify the content as reliable or not and that can create a trustworthy environment is continually increasing. Current solutions do not entirely solve this problem as the degree of difficulty of the task is high and dependent on factors such as type of language, type of news or subject volatility. The main objective of this thesis is the exploration of this crucial problem of Natural Language Processing, namely false content detection and of how it can be solved as a classification problem with automatic learning. A linguistic approach is taken, experimenting with different types of features and models to build accurate fake news detectors. The experiments are structured in the following three main steps: text pre-processing, feature extraction and classification itself. In addition, they are conducted on a real-world dataset, LIAR, to offer a good overview of which model best overcomes day-to-day situations. Two approaches are chosen: multi-class and binary classification. In both cases, we prove that out of all the experiments, a simple feed-forward network combined with fine-tuned DistilBERT embeddings reports the highest accuracy - 27.30% on 6-labels classification and 63.61% on 2-labels classification. These results emphasize that transfer learning bring important improvements in this task. In addition, we demonstrate that classic machine learning algorithms like Decision Tree, Naïve Bayes, and Support Vector Machine act similar with the state-of-the-art solutions, even performing better than some recurrent neural networks like LSTM or BiLSTM. This clearly confirms that more complex solutions do not guarantee higher performance. Regarding features, we confirm that there is a connection between the degree of veracity of a text and the frequency of terms, more powerful than their position or order. Yet, context prove to be the most powerful aspect in the characteristic extraction process. Also, indices that describe the author's style must be carefully selected to provide relevant information