2,600 research outputs found

    Detecting Misleading Headlines Through the Automatic Recognition of Contradiction in Spanish

    Get PDF
    Misleading headlines are part of the disinformation problem. Headlines should give a concise summary of the news story helping the reader to decide whether to read the body text of the article, which is why headline accuracy is a crucial element of a news story. This work focuses on detecting misleading headlines through the automatic identification of contradiction between the headline and body text of a news item. When the contradiction is detected, the reader is alerted to the lack of precision or trustworthiness of the headline in relation to the body text. To facilitate the automatic detection of misleading headlines, a new Spanish dataset is created (ES_Headline_Contradiction) for the purpose of identifying contradictory information between a headline and its body text. This dataset annotates the semantic relationship between headlines and body text by categorising the relation between texts as compatible , contradictory and unrelated . Furthermore, another novel aspect of this dataset is that it distinguishes between different types of contradictions, thereby enabling a more fine-grain identification of them. The dataset was built via a novel semi-automatic methodology, which resulted in a more cost-efficient development process. The results of the experiments show that pre-trained language models can be fine-tuned with this dataset, producing very encouraging results for detecting incongruency or non-relation between headline and body text.This research work is funded by MCIN/AEI/ 10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union” or by the “European Union NextGenerationEU/PRTR” through the project TRIVIAL: Technological Resources for Intelligent VIral AnaLysis through NLP (PID2021-122263OB-C22) and the project SOCIALTRUST: Assessing trustworthiness in digital media (PDC2022-133146-C22). Also funded by Generalitat Valenciana through the project NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation (CIPROM/2021/21), and the grant ACIF/2020/177

    Predicting Startup Success Using Publicly Available Data

    Get PDF
    Predicting the success of an early-stage startup has always been a major effort for investors and venture funds. Statistically, there are about 305 million total startups created in a year, but less than 10% of them succeed to become profitable businesses. Accurately identifying the signs of startup growth is the work of countless investors, and in recent years, research has turned to machine learning in hopes of improving the accuracy and speed of startup success prediction. To learn about a startup, investors have to navigate many different internet sources and often rely on personal intuition to determine the startup’s potential and likelihood of success. This thesis explores whether online data about a company, particularly general company data, previous funding events, published news articles, internet presence, and social media activity can be used to identify fast-growing startups. Data collected from Crunchbase, the Google Search API, and Twitter was used to predict whether a company will raise a round of funding within a fixed time horizon. A total of ten machine learning models were evaluated and the CatBoost ensemble method achieved the best performance with precision, recall, and F1 scores of 0.663, 0.827, and 0.736 respectively for predicting funding within 3 years. The same ensem- ble method achieved F1 scores of 0.528, 0.683, 0.736, 0.763, and 0.777 at predicting funding 1-5 years into the future. The final objective was to predict whether a startup that had already raised an angel or seed round would raise another investment within a one-year horizon. The CatBoost model with a 0.75 cutoff achieved precision and F0.1 scores of 0.790 and 0.774, beating the results of previous work in this field

    Erosion Of Credibility: A Mixed Methods Evaluation of Twitter News Headlines from The New York Times, Washington Post, Wall Street Journal, Los Angeles Times, And USA Today

    Get PDF
    To entice and commodify social media news consumers, contemporary news organizations have increasingly relied on data analytics to boost audience engagement. Clicks, likes, and shares are the metrics that now guide the editorial process and shape decisions about content and coverage. As such, news headlines are regularly manipulated to attract the attention of those who quickly scroll through social media networks on computers and smartphones. However, few studies have examined the typologies of news content most likely to be manipulated in social media news headlines or the impact of news headline manipulation on news source credibility. For this research, source credibility theory has been updated for a practical application of today’s social media news landscape and used as a lens to examine the phenomenon, its impact on audience engagement, and association with traditional standards of journalism and credibility. A mixed methods content analysis was conducted of news headlines published on Twitter compared to headlines and content published on the websites of five traditional newspapers: the New York Times, Washington Post, Wall Street Journal, Los Angeles Times, and USA Today. The results indicated that the typologies of news most likely to be manipulated for Twitter publication (opinion, politics, health/medical), were also the least credible. Conversely, typologies of news that were least likely to be manipulated for Twitter publication (international, consumer, disaster), were rated the most credible

    Two-Level Text Classification Using Hybrid Machine Learning Techniques

    Get PDF
    Nowadays, documents are increasingly being associated with multi-level category hierarchies rather than a flat category scheme. To access these documents in real time, we need fast automatic methods to navigate these hierarchies. Today’s vast data repositories such as the web also contain many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each domain constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within these domains is frequently further divided into many subcategories. Subspace Learning is a technique popular with non-text domains such as image recognition to increase speed and accuracy. Subspace analysis lends itself naturally to the idea of hybrid classifiers. Each subspace can be processed by a classifier best suited to the characteristics of that particular subspace. Instead of using the complete set of full space feature dimensions, classifier performances can be boosted by using only a subset of the dimensions. This thesis presents a novel hybrid parallel architecture using separate classifiers trained on separate subspaces to improve two-level text classification. The classifier to be used on a particular input and the relevant feature subset to be extracted is determined dynamically by using a novel method based on the maximum significance value. A novel vector representation which enhances the distinction between classes within the subspace is also developed. This novel system, the Hybrid Parallel Classifier, was compared against the baselines of several single classifiers such as the Multilayer Perceptron and was found to be faster and have higher two-level classification accuracies. The improvement in performance achieved was even higher when dealing with more complex category hierarchies

    Analysis of S&P500 using News Headlines Applying Machine Learning Algorithms

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceFinancial risk is in everyone’s life now, directly or indirectly impacting people´s daily life, empowering people on their decisions and the consequences of the same. This financial system comprises all the companies that produce and sell, making them an essential factor. This study addresses the impact people can have, by the news headlines written, on companies’ stock prices. S&P 500 is the index that will be studied in this research, compiling the biggest 500 companies in the USA and how the index can be affected by the News Articles written by humans from distinct and powerful Newspapers. Many people worldwide “play the game” of investing in stock prices, winning or losing much money. This study also tries to understand how strongly this news and the Index, previously mentioned, can be correlated. With the increased data available, it is necessary to have some computational power to help process all of this data. There it is when the machine learning methods can have a crucial involvement. For this is necessary to understand how these methods can be applied and influence the final decision of the human that always has the same question: Can stock prices be predicted? For that is necessary to understand first the correlation between news articles, one of the elements able to impact the stock prices, and the stock prices themselves. This study will focus on the correlation between News and S&P 500

    Towards Misleading Connection Mining

    Get PDF
    This study introduces a new Natural Language Generation (NLG) task – Unit Claim Identification. The task aims to extract every piece of verifiable information from a headline. The Unit Claim identification has applications in other domains; such as fact-checking where the identification of each verifiable information from a check-worthy statement can lead to an effective fact-check. Moreover, the extracting of the unit claims from headlines can identify a misleading news article, by mapping evidence from contents. For addressing the unit claim identification problem, we outlined a set of guidelines for data annotation, arranged in-house training for the annotators and obtained a small dataset. We explored two potential approaches - 1) Rule-based approach and 2) Deep learning-based approach and compared their performances. Although the performance of the deep learning-based approach was not very effective due to small number of training instances, the rule-based approach shoa promising result in terms of precision (65.85%)

    Prediction of Stock Market Volatility Utilizing Sentiment from News and Social Media Texts : A study on the practical implementation of sentiment analysis and deep learning models for predicting day-ahead volatility

    Get PDF
    This thesis studies the impact of sentiment on the prediction of volatility for 100 of the largest stocks in the S&P500 index. The purpose is to find out if sentiment can improve the forecast of day-ahead volatility wherein volatility is measured as the realized volatility of intraday returns. The textual data has been gathered from three different sources: Eikon, Twitter, and Reddit. The data consists of respectively 397 564 headlines from Eikon, 35 811 098 tweets, and 4 109 008 comments from Reddit. These numbers represent the uncleaned data before filtration. The data has been collected for the period between 01.08.2021 and 31.08.2022. Sentiment is calculated by the FinBERT model, an NLP model created by further pre-training of the BERT model on financial text. To predict volatility with the sentiment from FinBERT, three different deep learning models have been applied: A feed forward neural network, a recurrent neural network, and a long short-term memory model. They are used to solve both regression and classification problems. The inference analysis shows significant effects from the computed sentiment variables, and it implies that there exists a correlation between the number of text items and volatility. This is in line with previous literature on sentiment and volatility. The results from the deep learning models show that sentiment has an impact on the prediction of volatility. Both in terms of lower MSE and MAE for the regression problem and higher accuracy for the classification problem. Moreover, this thesis looks at potential weaknesses that could influence the validity of the results. Potential weaknesses include how sentiment is represented, noise in the data, and the Absftarcatc tthat the FinBERT model is not trained on financial oriented text from social media.nhhma
    corecore