44 research outputs found

    OntoDSumm : Ontology based Tweet Summarization for Disaster Events

    Full text link
    The huge popularity of social media platforms like Twitter attracts a large fraction of users to share real-time information and short situational messages during disasters. A summary of these tweets is required by the government organizations, agencies, and volunteers for efficient and quick disaster response. However, the huge influx of tweets makes it difficult to manually get a precise overview of ongoing events. To handle this challenge, several tweet summarization approaches have been proposed. In most of the existing literature, tweet summarization is broken into a two-step process where in the first step, it categorizes tweets, and in the second step, it chooses representative tweets from each category. There are both supervised as well as unsupervised approaches found in literature to solve the problem of first step. Supervised approaches requires huge amount of labelled data which incurs cost as well as time. On the other hand, unsupervised approaches could not clusters tweet properly due to the overlapping keywords, vocabulary size, lack of understanding of semantic meaning etc. While, for the second step of summarization, existing approaches applied different ranking methods where those ranking methods are very generic which fail to compute proper importance of a tweet respect to a disaster. Both the problems can be handled far better with proper domain knowledge. In this paper, we exploited already existing domain knowledge by the means of ontology in both the steps and proposed a novel disaster summarization method OntoDSumm. We evaluate this proposed method with 4 state-of-the-art methods using 10 disaster datasets. Evaluation results reveal that OntoDSumm outperforms existing methods by approximately 2-66% in terms of ROUGE-1 F1 score

    Sentiment analysis and real-time microblog search

    Get PDF
    This thesis sets out to examine the role played by sentiment in real-time microblog search. The recent prominence of the real-time web is proving both challenging and disruptive for a number of areas of research, notably information retrieval and web data mining. User-generated content on the real-time web is perhaps best epitomised by content on microblogging platforms, such as Twitter. Given the substantial quantity of microblog posts that may be relevant to a user query at a given point in time, automated methods are required to enable users to sift through this information. As an area of research reaching maturity, sentiment analysis offers a promising direction for modelling the text content in microblog streams. In this thesis we review the real-time web as a new area of focus for sentiment analysis, with a specific focus on microblogging. We propose a system and method for evaluating the effect of sentiment on perceived search quality in real-time microblog search scenarios. Initially we provide an evaluation of sentiment analysis using supervised learning for classi- fying the short, informal content in microblog posts. We then evaluate our sentiment-based filtering system for microblog search in a user study with simulated real-time scenarios. Lastly, we conduct real-time user studies for the live broadcast of the popular television programme, the X Factor, and for the Leaders Debate during the Irish General Election. We find that we are able to satisfactorily classify positive, negative and neutral sentiment in microblog posts. We also find a significant role played by sentiment in many microblog search scenarios, observing some detrimental effects in filtering out certain sentiment types. We make a series of observations regarding associations between document-level sentiment and user feedback, including associations with user profile attributes, and users’ prior topic sentiment

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    Event detection and user interest discovering in social media data streams

    Get PDF
    Social media plays an increasingly important role in people’s life. Microblogging is a form of social media which allows people to share and disseminate real-life events. Broadcasting events in microblogging networks can be an effective method of creating awareness, divulging important information and so on. However, many existing approaches at dissecting the information content primarily discuss the event detection model and ignore the user interest which can be discovered during event evolution. This leads to difficulty in tracking the most important events as they evolve including identifying the influential spreaders. There is further complication given that the influential spreaders interests will also change during event evolution. The influential spreaders play a key role in event evolution and this has been largely ignored in traditional event detection methods. To this end, we propose a user-interest model based event evolution model, named the HEE (Hot Event Evolution) model. This model not only considers the user interest distribution, but also uses the short text data in the social network to model the posts and the recommend methods to discovering the user interests. This can resolve the problem of data sparsity, as exemplified by many existing event detection methods, and improve the accuracy of event detection. A hot event automatic filtering algorithm is initially applied to remove the influence of general events, improving the quality and efficiency of mining the event. Then an automatic topic clustering algorithm is applied to arrange the short texts into clusters with similar topics. An improved user-interest model is proposed to combine the short texts of each cluster into a long text document simplifying the determination of the overall topic in relation to the interest distribution of each user during the evolution of important events. Finally a novel cosine measure based event similarity detection method is used to assess correlation between events thereby detecting the process of event evolution. The experimental results on a real Twitter dataset demonstrate the efficiency and accuracy of our proposed model for both event detection and user interest discovery during the evolution of hot events.N/

    Mining Social Media to Understand Consumers' Health Concerns and the Public's Opinion on Controversial Health Topics.

    Full text link
    Social media websites are increasingly used by the general public as a venue to express health concerns and discuss controversial medical and public health issues. This information could be utilized for the purposes of public health surveillance as well as solicitation of public opinions. In this thesis, I developed methods to extract health-related information from multiple sources of social media data, and conducted studies to generate insights from the extracted information using text-mining techniques. To understand the availability and characteristics of health-related information in social media, I first identified the users who seek health information online and participate in online health community, and analyzed their motivations and behavior by two case studies of user-created groups on MedHelp and a diabetes online community on Twitter. Through a review of tweets mentioning eye-related medical concepts identified by MetaMap, I diagnosed the common reasons of tweets mislabeled by natural language processing tools tuned for biomedical texts, and trained a classifier to exclude non medically-relevant tweets to increase the precision of the extracted data. Furthermore, I conducted two studies to evaluate the effectiveness of understanding public opinions on controversial medical and public health issues from social media information using text-mining techniques. The first study applied topic modeling and text summarization to automatically distill users' key concerns about the purported link between autism and vaccines. The outputs of two methods cover most of the public concerns of MMR vaccines reported in previous survey studies. In the second study, I estimated the public's view on the ac{ACA} by applying sentiment analysis to four years of Twitter data, and demonstrated that the the rates of positive/negative responses measured by tweet sentiment are in general agreement with the results of Kaiser Family Foundation Poll. Finally, I designed and implemented a system which can automatically collect and analyze online news comments to help researchers, public health workers, and policy makers to better monitor and understand the public's opinion on issues such as controversial health-related topics.PhDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120714/1/owenliu_1.pd

    Approaches to implement and evaluate aggregated search

    Get PDF
    La recherche d'information agrĂ©gĂ©e peut ĂȘtre vue comme un troisiĂšme paradigme de recherche d'information aprĂšs la recherche d'information ordonnĂ©e (ranked retrieval) et la recherche d'information boolĂ©enne (boolean retrieval). Les deux paradigmes les plus explorĂ©s jusqu'Ă  aujourd'hui retournent un ensemble ou une liste ordonnĂ©e de rĂ©sultats. C'est Ă  l'usager de parcourir ces ensembles/listes et d'en extraire l'information nĂ©cessaire qui peut se retrouver dans plusieurs documents. De maniĂšre alternative, la recherche d'information agrĂ©gĂ©e ne s'intĂ©resse pas seulement Ă  l'identification des granules (nuggets) d'information pertinents, mais aussi Ă  l'assemblage d'une rĂ©ponse agrĂ©gĂ©e contenant plusieurs Ă©lĂ©ments. Dans nos travaux, nous analysons les travaux liĂ©s Ă  la recherche d'information agrĂ©gĂ©e selon un schĂ©ma gĂ©nĂ©ral qui comprend 3 parties: dispatching de la requĂȘte, recherche de granules d'information et agrĂ©gation du rĂ©sultat. Les approches existantes sont groupĂ©es autours de plusieurs perspectives gĂ©nĂ©rales telle que la recherche relationnelle, la recherche fĂ©dĂ©rĂ©e, la gĂ©nĂ©ration automatique de texte, etc. Ensuite, nous nous sommes focalisĂ©s sur deux pistes de recherche selon nous les plus prometteuses: (i) la recherche agrĂ©gĂ©e relationnelle et (ii) la recherche agrĂ©gĂ©e inter-verticale. * La recherche agrĂ©gĂ©e relationnelle s'intĂ©resse aux relations entre les granules d'information pertinents qui servent Ă  assembler la rĂ©ponse agrĂ©gĂ©e. En particulier, nous nous sommes intĂ©ressĂ©s Ă  trois types de requĂȘtes notamment: requĂȘte attribut (ex. prĂ©sident de la France, PIB de l'Italie, maire de Glasgow, ...), requĂȘte instance (ex. France, Italie, Glasgow, Nokia e72, ...) et requĂȘte classe (pays, ville française, portable Nokia, ...). Pour ces requĂȘtes qu'on appelle requĂȘtes relationnelles nous avons proposĂ©s trois approches pour permettre la recherche de relations et l'assemblage des rĂ©sultats. Nous avons d'abord mis l'accent sur la recherche d'attributs qui peut aider Ă  rĂ©pondre aux trois types de requĂȘtes. Nous proposons une approche Ă  large Ă©chelle capable de rĂ©pondre Ă  des nombreuses requĂȘtes indĂ©pendamment de la classe d'appartenance. Cette approche permet l'extraction des attributs Ă  partir des tables HTML en tenant compte de la qualitĂ© des tables et de la pertinence des attributs. Les diffĂ©rentes Ă©valuations de performances effectuĂ©es prouvent son efficacitĂ© qui dĂ©passe les mĂ©thodes de l'Ă©tat de l'art. DeuxiĂšmement, nous avons traitĂ© l'agrĂ©gation des rĂ©sultats composĂ©s d'instances et d'attributs. Ce problĂšme est intĂ©ressant pour rĂ©pondre Ă  des requĂȘtes de type classe avec une table contenant des instances (lignes) et des attributs (colonnes). Pour garantir la qualitĂ© du rĂ©sultat, nous proposons des pondĂ©rations sur les instances et les attributs promouvant ainsi les plus reprĂ©sentatifs. Le troisiĂšme problĂšme traitĂ© concerne les instances de la mĂȘme classe (ex. France, Italie, Allemagne, ...). Nous proposons une approche capable d'identifier massivement ces instances en exploitant les listes HTML. Toutes les approches proposĂ©es fonctionnent Ă  l'Ă©chelle Web et sont importantes et complĂ©mentaires pour la recherche agrĂ©gĂ©e relationnelle. Enfin, nous proposons 4 prototypes d'application de recherche agrĂ©gĂ©e relationnelle. Ces derniers peuvent rĂ©pondre des types de requĂȘtes diffĂ©rents avec des rĂ©sultats relationnels. Plus prĂ©cisĂ©ment, ils recherchent et assemblent des attributs, des instances, mais aussi des passages et des images dans des rĂ©sultats agrĂ©gĂ©s. Un exemple est la requĂȘte ``Nokia e72" dont la rĂ©ponse sera composĂ©e d'attributs (ex. prix, poids, autonomie batterie, ...), de passages (ex. description, reviews, ...) et d'images. Les rĂ©sultats sont encourageants et illustrent l'utilitĂ© de la recherche agrĂ©gĂ©e relationnelle. * La recherche agrĂ©gĂ©e inter-verticale s'appuie sur plusieurs moteurs de recherche dits verticaux tel que la recherche d'image, recherche vidĂ©o, recherche Web traditionnelle, etc. Son but principal est d'assembler des rĂ©sultats provenant de toutes ces sources dans une mĂȘme interface pour rĂ©pondre aux besoins des utilisateurs. Les moteurs de recherche majeurs et la communautĂ© scientifique nous offrent dĂ©jĂ  une sĂ©rie d'approches. Notre contribution consiste en une Ă©tude sur l'Ă©valuation et les avantages de ce paradigme. Plus prĂ©cisĂ©ment, nous comparons 4 types d'Ă©tudes qui simulent des situations de recherche sur un total de 100 requĂȘtes et 9 sources diffĂ©rentes. Avec cette Ă©tude, nous avons identifiĂ©s clairement des avantages de la recherche agrĂ©gĂ©e inter-verticale et nous avons pu dĂ©duire de nombreux enjeux sur son Ă©valuation. En particulier, l'Ă©valuation traditionnelle utilisĂ©e en RI, certes la moins rapide, reste la plus rĂ©aliste. Pour conclure, nous avons proposĂ© des diffĂ©rents approches et Ă©tudes sur deux pistes prometteuses de recherche dans le cadre de la recherche d'information agrĂ©gĂ©e. D'une cĂŽtĂ©, nous avons traitĂ© trois problĂšmes importants de la recherche agrĂ©gĂ©e relationnelle qui ont portĂ© Ă  la construction de 4 prototypes d'application avec des rĂ©sultats encourageants. De l'autre cĂŽtĂ©, nous avons mis en place 4 Ă©tudes sur l'intĂ©rĂȘt et l'Ă©valuation de la recherche agrĂ©gĂ©e inter-verticale qui ont permis d'identifier les enjeux d'Ă©valuation et les avantages du paradigme. Comme suite Ă  long terme de ce travail, nous pouvons envisager une recherche d'information qui intĂšgre plus de granules relationnels et plus de multimĂ©dia.Aggregated search or aggregated retrieval can be seen as a third paradigm for information retrieval following the Boolean retrieval paradigm and the ranked retrieval paradigm. In the first two, we are returned respectively sets and ranked lists of search results. It is up to the time-poor user to scroll this set/list, scan within different documents and assemble his/her information need. Alternatively, aggregated search not only aims the identification of relevant information nuggets, but also the assembly of these nuggets into a coherent answer. In this work, we present at first an analysis of related work to aggregated search which is analyzed with a general framework composed of three steps: query dispatching, nugget retrieval and result aggregation. Existing work is listed aside different related domains such as relational search, federated search, question answering, natural language generation, etc. Within the possible research directions, we have then focused on two directions we believe promise the most namely: relational aggregated search and cross-vertical aggregated search. * Relational aggregated search targets relevant information, but also relations between relevant information nuggets which are to be used to assemble reasonably the final answer. In particular, there are three types of queries which would easily benefit from this paradigm: attribute queries (e.g. president of France, GDP of Italy, major of Glasgow, ...), instance queries (e.g. France, Italy, Glasgow, Nokia e72, ...) and class queries (countries, French cities, Nokia mobile phones, ...). We call these queries as relational queries and we tackle with three important problems concerning the information retrieval and aggregation for these types of queries. First, we propose an attribute retrieval approach after arguing that attribute retrieval is one of the crucial problems to be solved. Our approach relies on the HTML tables in the Web. It is capable to identify useful and relevant tables which are used to extract relevant attributes for whatever queries. The different experimental results show that our approach is effective, it can answer many queries with high coverage and it outperforms state of the art techniques. Second, we deal with result aggregation where we are given relevant instances and attributes for a given query. The problem is particularly interesting for class queries where the final answer will be a table with many instances and attributes. To guarantee the quality of the aggregated result, we propose the use of different weights on instances and attributes to promote the most representative and important ones. The third problem we deal with concerns instances of the same class (e.g. France, Germany, Italy ... are all instances of the same class). Here, we propose an approach that can massively extract instances of the same class from HTML lists in the Web. All proposed approaches are applicable at Web-scale and they can play an important role for relational aggregated search. Finally, we propose 4 different prototype applications for relational aggregated search. They can answer different types of queries with relevant and relational information. Precisely, we not only retrieve attributes and their values, but also passages and images which are assembled into a final focused answer. An example is the query ``Nokia e72" which will be answered with attributes (e.g. price, weight, battery life ...), passages (e.g. description, reviews ...) and images. Results are encouraging and they illustrate the utility of relational aggregated search. * The second research direction that we pursued concerns cross-vertical aggregated search, which consists of assembling results from different vertical search engines (e.g. image search, video search, traditional Web search, ...) into one single interface. Here, different approaches exist in both research and industry. Our contribution concerns mostly evaluation and the interest (advantages) of this paradigm. We propose 4 different studies which simulate different search situations. Each study is tested with 100 different queries and 9 vertical sources. Here, we could clearly identify new advantages of this paradigm and we could identify different issues with evaluation setups. In particular, we observe that traditional information retrieval evaluation is not the fastest but it remains the most realistic. To conclude, we propose different studies with respect to two promising research directions. On one hand, we deal with three important problems of relational aggregated search following with real prototype applications with encouraging results. On the other hand, we have investigated on the interest and evaluation of cross-vertical aggregated search. Here, we could clearly identify some of the advantages and evaluation issues. In a long term perspective, we foresee a possible combination of these two kinds of approaches to provide relational and cross-vertical information retrieval incorporating more focus, structure and multimedia in search results

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    A Survey on Automated Fact-Checking

    Get PDF
    Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how factchecking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research

    Finding Microblog Posts of User Interest

    Get PDF
    Microblogging is an increasingly popular form of social media. One of the most popular microblogging services is Twitter. The number of messages posted to Twitter on a daily basis is extremely large. Accordingly, it becomes hard for users to sort through these messages and find ones that interest them. Twitter offers search mechanisms but they are relatively simple and accordingly the results can be lacklustre. Through participation in the 2011 Text Retrieval Conference's Microblog Track, this thesis examines real-time ad hoc search using standard information retrieval approaches without microblog or Twitter specific modifications. It was found that using pseudo-relevance feedback based upon a language model derived from Twitter posts, called tweets, in conjunction with standard ranking methods is able to perform competitively with advanced retrieval systems as well as microblog and Twitter specific retrieval systems. Furthermore, possible modifications both Twitter specific and otherwise are discussed that would potentially increase retrieval performance. Twitter has also spawned an interesting phenomenon called hashtags. Hashtags are used by Twitter users to denote that their message belongs to a particular topic or conversation. Unfortunately, tweets have a 140 characters limit and accordingly all relevant hashtags cannot always be present in tweet. Thus, Twitter users cannot easily find tweets that do not contain hashtags they are interested in but should contain them. This problem is investigated in this thesis in three ways using learning methods. First, learning methods are used to determine if it is possible to discriminate between two topically different sets of a tweets. This thesis then investigates whether or not it is possible for tweets without a particular hashtag, but discusses the same topic as the hashtag, to be separated from random tweets. This case mimics the real world scenario of users having to sift through random tweets to find tweets that are related to a topic they are interested in. This investigation is performed by removing hashtags from tweets and attempting to distinguish those tweets from random tweets. Finally, this thesis investigates whether or not topically similar tweets can also be distinguished based upon a sub-topic. This was investigated in almost an identical manner to the second case. This thesis finds that topically distinct tweets can be distinguished but more importantly that standard learning methods are able to determine that a tweet with a hashtag removed should have that hashtag. In addition, this hashtag reconstruction can be performed well with very few examples of what a tweet with and without the particular hashtag should look like. This provides evidence that it may be possible to separate tweets a user may be interested from random tweets only using hashtags they are interested in. Furthermore, the success of the hashtag reconstruction also provides evidence that users do not misuse or abuse hashtags since hashtag presence was taken to be the ground truth in all experiments. Finally, the applicability of the hashtag reconstruction results to the TREC Microblog Track and a mobile application is presented
    corecore