4,398 research outputs found

    A Hotspot Discovery Method Based on Improved FIHC Clustering Algorithm

    Get PDF
    It was difficult to find the microblog hotspot because the characteristics of microblog were short, rapid, change and so on. A microblog hotspot detection method based on MFIHC and TOPSIS was proposed in order to solve the problem. Firstly, the calculation of HowNet similarity was used in the score function of FIHC, the semantic links between frequent words were considered, and the initial clusters based on frequent words were produced more accurately. Then the initial cluster of the text repletion of mircoblog was reduced, and the idea of Single-Pass clustering was used to the reduced topic cluster in order to get the Hotspot. At last, an improved TOPSIS model was used to sort the hot topics in order to get the rank of the hot topics. Compared with the other text clustering algorithms and hotspot detection methods, the method has good effect, and can be a more comprehensive response to the current hot topics

    Design and Implementation of Network Public Opinion Analysis System

    Get PDF
    Network public opinion analysis is an important way of information analysis processing. This paper based on the research of the related technologies, designs and realizes a new network public opinion analysis system. System mainly includes network data fetching part, fetching the data processing part, analyzes the processed data part and display part of the public opinion analysis results. In the document extraction part, used the web crawler technology, Larbin web crawler to realize the collection of web content; In public opinion information analysis part, the implementation of the new topic adopts an improved Single - Pass clustering algorithm. This algorithm is using of multi-center, using the title and body of the vector to compared two-way ,that is better reflect the dynamics of public opinion topics. Finally, in the network environment of a university, we have the tests repeatedly. The results show that the new public opinion analysis system running is stable and has good efficiency. The thesis has certain value for the development of other information analysis systems in the Internet

    From past to present: spam detection and identifying opinion leaders in social networks

    Get PDF
    On microblogging sites, which are gaining more and more users every day, a wide range of ideas are quickly emerging, spreading, and creating interactive environments. In some cases, in Turkey as well as in the rest of the world, it was noticed that events were published on microblogging sites before appearing in visual, audio and printed news sources. Thanks to the rapid flow of information in social networks, it can reach millions of people in seconds. In this context, social media can be seen as one of the most important sources of information affecting public opinion. Since the information in social networks became accessible, research started to be conducted using the information on the social networks. While the studies about spam detection and identification of opinion leaders gained popularity, surveys about these topics began to be published. This study also shows the importance of spam detection and identification of opinion leaders in social networks. It is seen that the data collected from social platforms, especially in recent years, has sourced many state-of-art applications. There are independent surveys that focus on filtering the spam content and detecting influencers on social networks. This survey analyzes both spam detection studies and opinion leader identification and categorizes these studies by their methodologies. As far as we know there is no survey that contains approaches for both spam detection and opinion leader identification in social networks. This survey contains an overview of the past and recent advances in both spam detection and opinion leader identification studies in social networks. Furthermore, readers of this survey have the opportunity of understanding general aspects of different studies about spam detection and opinion leader identification while observing key points and comparisons of these studies.This work is supported in part by the Scientific and Technological Research Council of Turkey (TUBITAK) through grant number 118E315 and grant number 120E187. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of TUBITAK.Publisher's VersionEmerging Sources Citation Index (ESCI)Q4WOS:00080858480001

    Hot Topic Discovery in Online Community using Topic Labels and Hot Features

    Get PDF
    With huge volumes of information on Internet, how to extract user-concerned hot topics quickly and effectively has become a fundamental task for information processing on Internet. Generally, hot topic detection includes two tasks, the first one is topic discovery and the other is its hotness evaluation. In this paper, we propose a hot topic detection method. For topic discovery, topics are identified by clustering based on extracted topic labels. For hotness evaluation, the proposed model has fully considered the internal and external dual features and combined them together. The experimental results over TianYa BBS demonstrate the efficiency of the proposed method: compared with topic discovery based on latent semantic indexing, the improved vector space model based on topic labels gets better results and the identified topics are more accurate. Moreover, the proposed hotness features could reflect the popularity of a topic, and hence have obtained better hot topic results finally

    Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

    Get PDF
    The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.M30-DV Master's ThesisM-D

    How to Create an Innovation Accelerator

    Full text link
    Too many policy failures are fundamentally failures of knowledge. This has become particularly apparent during the recent financial and economic crisis, which is questioning the validity of mainstream scholarly paradigms. We propose to pursue a multi-disciplinary approach and to establish new institutional settings which remove or reduce obstacles impeding efficient knowledge creation. We provided suggestions on (i) how to modernize and improve the academic publication system, and (ii) how to support scientific coordination, communication, and co-creation in large-scale multi-disciplinary projects. Both constitute important elements of what we envision to be a novel ICT infrastructure called "Innovation Accelerator" or "Knowledge Accelerator".Comment: 32 pages, Visioneer White Paper, see http://www.visioneer.ethz.c

    Sub-story detection in Twitter with hierarchical Dirichlet processes

    Get PDF
    Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance

    Deep Learning Methods for Register Classification

    Get PDF
    For this project the data used is the one collected by, Biber and Egbert (2018) related to various language articles from the internet. I am using BERT model (Bidirectional Encoder Representations from Transformers), which is a deep neural network and FastText, which is a shallow neural network, as a baseline to perform text classification. Also, I am using Deep Learning models like XLNet to see if classification accuracy is improved. Also, it has been described by Biber and Egbert (2018) what is register. We can think of register as genre. According to Biber (1988), register is varieties defined in terms of general situational parameters. Hence, it can be inferred that there is a close relation between the language and the context of the situation in which it is being used. This work attempts register classification using deep learning methods that use attention mechanism. Working with the models, dealing with the imbalanced datasets in real life problems, tuning the hyperparameters for training the models was accomplished throughout the work. Also, proper evaluation metrics for various kind of data was determined. The background study shows that how cumbersome the use classical Machine Learning approach used to be. Deep Learning, on the other hand, can accomplish the task with ease. The metric to be selected for the classification task for different types of datasets (balanced vs imbalanced), dealing with overfitting was also accomplished
    corecore