364 research outputs found

    NEED4Tweet: a Twitterbot for tweets named entity extraction and disambiguation

    Get PDF
    In this demo paper, we present NEED4Tweet, a Twitterbot for named entity extraction (NEE) and disambiguation (NED) for Tweets. The straightforward application of state-of-the-art extraction and disambiguation approaches on informal text widely used in Tweets, typically results in significantly degraded performance due to the lack of formal structure; the lack of sufficient context required; and the seldom entities involved. In this paper, we introduce a novel framework that copes with the introduced challenges. We rely on contextual and semantic features more than syntactic features which are less informative. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language

    A generic open world named entity disambiguation approach for tweets

    Get PDF
    Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). In this paper we share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor

    A Framework to compare text annotators and its applications

    Get PDF
    Text in human languages have a low logic structure and are inherently ambiguous. For this reason, the typical approach of Information Retrieval to text documents has been based on the Bag-of-words model, in which documents are analyzed only by the occurrence of terms, discarding any possible structure. But a recently developing line of research is devoted to adding structure to unstructured text, by recognizing the topics contained in a text and annotate them. Topic annotators are systems that have the purpose of linking a natural language document to the topics that are relevant for describing the content of the document. This systems can be applied to many classic problems of Information Retrieval: the categorization of a document can be based on its topics; the clustering of a set of documents can be done using their topics to find similarities; for a search engine, it would be easier to find relevant pages if there was a way to know the topics that the query expresses and search for them in the cached web pages. In this thesis, we present a formal framework that describe the problems related to topic retrieval, the algorithms that solve those problems, and the way they can be benchmarked

    Sentiment analysis on twitter for the portuguese language

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaWith the growth and popularity of the internet and more specifically of social networks, users can more easily share their thoughts, insights and experiences with others. Messages shared via social networks provide useful information for several applications, such as monitoring specific targets for sentiment or comparing the public sentiment on several targets, avoiding the traditional marketing research method with the use of surveys to explicitly get the public opinion. To extract information from the large amounts of messages that are shared, it is best to use an automated program to process these messages. Sentiment analysis is an automated process to determine the sentiment expressed in natural language in text. Sentiment is a broad term, but here we are focussed in opinions and emotions that are expressed in text. Nowadays, out of the existing social network websites, Twitter is considered the best one for this kind of analysis. Twitter allows users to share their opinion on several topics and entities, by means of short messages. The messages may be malformed and contain spelling errors, therefore some treatment of the text may be necessary before the analysis, such as spell checks. To know what the message is focusing on it is necessary to find these entities on the text such as people, locations, organizations, products, etc. and then analyse the rest of the text and obtain what is said about that specific entity. With the analysis of several messages, we can have a general idea on what the public thinks regarding many different entities. It is our goal to extract as much information concerning different entities from tweets in the Portuguese language. Here it is shown different techniques that may be used as well as examples and results on state-of-the-art related work. Using a semantic approach, from these messages we were able to find and extract named entities and assigning sentiment values for each found entity, producing a complete tool competitive with existing solutions. The sentiment classification and assigning to entities is based on the grammatical construction of the message. These results are then used to be viewed by the user in real time or stored to be viewed latter. This analysis provides ways to view and compare the public sentiment regarding these entities, showing the favourite brands, companies and people, as well as showing the growth of the sentiment over time

    AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

    Full text link
    Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).Comment: 15 pages, 6 Figures, 9 Table

    A Reverse Approach to Named Entity Extraction and Linking in Microposts

    Get PDF
    ABSTRACT In this paper, we present a pipeline for named entity extraction and linking that is designed specifically for noisy, grammatically inconsistent domains where traditional named entity techniques perform poorly. Our approach leverages a large knowledge base to improve entity recognition, while maintaining the use of traditional NER to identify mentions that are not co-referent with any entities in the knowledge base

    Sentiment Analysis for micro-blogging platforms in Arabic

    Get PDF
    Sentiment Analysis (SA) concerns the automatic extraction and classification of sentiments conveyed in a given text, i.e. labelling a text instance as positive, negative or neutral. SA research has attracted increasing interest in the past few years due to its numerous real-world applications. The recent interest in SA is also fuelled by the growing popularity of social media platforms (e.g. Twitter), as they provide large amounts of freely available and highly subjective content that can be readily crawled. Most previous SA work has focused on English with considerable success. In this work, we focus on studying SA in Arabic, as a less-resourced language. This work reports on a wide set of investigations for SA in Arabic tweets, systematically comparing three existing approaches that have been shown successful in English. Specifically, we report experiments evaluating fully-supervised-based (SL), distantsupervision- based (DS), and machine-translation-based (MT) approaches for SA. The investigations cover training SA models on manually-labelled (i.e. in SL methods) and automatically-labelled (i.e. in DS methods) data-sets. In addition, we explored an MT-based approach that utilises existing off-the-shelf SA systems for English with no need for training data, assessing the impact of translation errors on the performance of SA models, which has not been previously addressed for Arabic tweets. Unlike previous work, we benchmark the trained models against an independent test-set of >3.5k instances collected at different points in time to account for topic-shifts issues in the Twitter stream. Despite the challenging noisy medium of Twitter and the mixture use of Dialectal and Standard forms of Arabic, we show that our SA systems are able to attain performance scores on Arabic tweets that are comparable to the state-of-the-art SA systems for English tweets. The thesis also investigates the role of a wide set of features, including syntactic, semantic, morphological, language-style and Twitter-specific features. We introduce a set of affective-cues/social-signals features that capture information about the presence of contextual cues (e.g. prayers, laughter, etc.) to correlate them with the sentiment conveyed in an instance. Our investigations reveal a generally positive impact for utilising these features for SA in Arabic. Specifically, we show that a rich set of morphological features, which has not been previously used, extracted using a publicly-available morphological analyser for Arabic can significantly improve the performance of SA classifiers. We also demonstrate the usefulness of languageindependent features (e.g. Twitter-specific) for SA. Our feature-sets outperform results reported in previous work on a previously built data-set
    • …
    corecore