111 research outputs found

    A weakly supervised Bayesian model for violence detection in social media

    Get PDF
    Social streams have proven to be the most up-to-date and inclusive information on current events. In this paper we propose a novel probabilistic modelling framework, called violence detection model (VDM), which enables the identification of text containing violent content and extraction of violence-related topics over social media data. The proposed VDM model does not require any labeled corpora for training, instead, it only needs the incorporation of word prior knowledge which captures whether a word indicates violence or not. We propose a novel approach of deriving word prior knowledge using the relative entropy measurement of words based on the intuition that low entropy words are indicative of semantically coherent topics and therefore more informative, while high entropy words indicates words whose usage is more topical diverse and therefore less informative. Our proposed VDM model has been evaluated on the TREC Microblog 2011 dataset to identify topics related to violence. Experimental results show that deriving word priors using our proposed relative entropy method is more effective than the widely-used information gain method. Moreover, VDM gives higher violence classification results and produces more coherent violence-related topics compared to a few competitive baselines

    A Hotspot Discovery Method Based on Improved FIHC Clustering Algorithm

    Get PDF
    It was difficult to find the microblog hotspot because the characteristics of microblog were short, rapid, change and so on. A microblog hotspot detection method based on MFIHC and TOPSIS was proposed in order to solve the problem. Firstly, the calculation of HowNet similarity was used in the score function of FIHC, the semantic links between frequent words were considered, and the initial clusters based on frequent words were produced more accurately. Then the initial cluster of the text repletion of mircoblog was reduced, and the idea of Single-Pass clustering was used to the reduced topic cluster in order to get the Hotspot. At last, an improved TOPSIS model was used to sort the hot topics in order to get the rank of the hot topics. Compared with the other text clustering algorithms and hotspot detection methods, the method has good effect, and can be a more comprehensive response to the current hot topics

    News Text Classification Based on an Improved Convolutional Neural Network

    Get PDF
    With the explosive growth in Internet news media and the disorganized status of news texts, this paper puts forward an automatic classification model for news based on a Convolutional Neural Network (CNN). In the model, Word2vec is firstly merged with Latent Dirichlet Allocation (LDA) to generate an effective text feature representation. Then when an attention mechanism is combined with the proposed model, higher attention probability values are given to key features to achieve an accurate judgment. The results show that the precision rate, the recall rate and the F1 value of the model in this paper reach 96.4%, 95.9% and 96.2% respectively, which indicates that the improved CNN, through a unique framework, can extract deep semantic features of the text and provide a strong support for establishing an efficient and accurate news text classification model

    Profiling Users and Knowledge Graphs on the Web

    Get PDF
    Profiling refers to the process of collecting useful information or patterns about something. Due to the growth of the web, profiling methods play an important role in different applications such as recommender systems. In this thesis, we first demonstrate how knowledge graphs (KGs) enhance profiling methods. KGs are databases for entities and their relations. Since KGs have been developed with the objective of information discovery, we assume that they can assist profiling methods. To this end, we develop a novel profiling method using KGs called Hierarchical Concept Frequency-Inverse Document Frequency (HCF-IDF), which combines the strength of traditional term weighting method and semantics in a KG. HCF-IDF represents documents as a set of entities and their weights. We apply HCF-IDF to two applications that recommends researchers and scientific publications. Both applications show HCF-IDF captures topics of documents. As key result, the method can make competitive recommendations based on only the titles of scientific publications, because it reveals relevant entities using the structure of KGs. While the KGs assist profiling methods, we present how profiling methods can improve the KGs. We show two methods that enhance the integrity of KGs. The first method is a crawling strategy that keeps local copies of KGs up-to-date. We profile the dynamics of KGs using a linear regression model. The experiment shows that our novel crawling strategy based on the linear regression model performs better than the state of the art. The second method is a change verification method for KGs. The method classifies each incoming change into a correct or incorrect one to mitigate administrators who check the validity of a change. We profile how topological features influence on the dynamics of a KG. The experiment demonstrates that the novel method using the topological features can improve change verification. Therefore, profiling the dynamics contribute to the integrity of KGs

    Event detection, tracking, and visualization in Twitter: a mention-anomaly-based approach

    Full text link
    The ever-growing number of people using Twitter makes it a valuable source of timely information. However, detecting events in Twitter is a difficult task, because tweets that report interesting events are overwhelmed by a large volume of tweets on unrelated topics. Existing methods focus on the textual content of tweets and ignore the social aspect of Twitter. In this paper we propose MABED (i.e. mention-anomaly-based event detection), a novel statistical method that relies solely on tweets and leverages the creation frequency of dynamic links (i.e. mentions) that users insert in tweets to detect significant events and estimate the magnitude of their impact over the crowd. MABED also differs from the literature in that it dynamically estimates the period of time during which each event is discussed, rather than assuming a predefined fixed duration for all events. The experiments we conducted on both English and French Twitter data show that the mention-anomaly-based approach leads to more accurate event detection and improved robustness in presence of noisy Twitter content. Qualitatively speaking, we find that MABED helps with the interpretation of detected events by providing clear textual descriptions and precise temporal descriptions. We also show how MABED can help understanding users' interest. Furthermore, we describe three visualizations designed to favor an efficient exploration of the detected events.Comment: 17 page

    Enhanced Heartbeat Graph for emerging event detection on Twitter using time series networks

    Full text link
    © 2019 Elsevier Ltd With increasing popularity of social media, Twitter has become one of the leading platforms to report events in real-time. Detecting events from Twitter stream requires complex techniques. Event-related trending topics consist of a group of words which successfully detect and identify events. Event detection techniques must be scalable and robust, so that they can deal with the huge volume and noise associated with social media. Existing event detection methods mostly rely on burstiness, mainly the frequency of words and their co-occurrences. However, burstiness sometimes dominates other relevant details in the data which could be equally significant. Besides, the topological and temporal relationships in the data are often ignored. In this work, we propose a novel graph-based approach, called the Enhanced Heartbeat Graph (EHG), which detects events efficiently. EHG suppresses dominating topics in the subsequent data stream, after their first detection. Experimental results on three real-world datasets (i.e., Football Association Challenge Cup Final, Super Tuesday, and the US Election 2012) show superior performance of the proposed approach in comparison to the state-of-the-art techniques
    corecore