113 research outputs found

    Event detection from click-through data via query clustering

    Get PDF
    The web is an index of real-world events and lot of knowledge can be mined from the web resources and their derivatives. Event detection is one recent research topic triggered from the domain of web data mining with the increasing popularity of search engines. In the visitor-centric approach, the click-through data generated by the web search engines is the start up resource with the intuition: often such data is event-driven. In this thesis, a retrospective algorithm is proposed to detect such real-world events from the click-through data. This approach differs from the existing work as it: (i) considers the click-through data as collaborative query sessions instead of mere web logs and try to understand user behavior (ii) tries to integrate the semantics, structure, and content of queries and pages (iii) aims to achieve the overall objective via Query Clustering. The problem of event detection is transformed into query clustering by generating clusters - hybrid cover graphs; each hybrid cover graph corresponds to a real-world event. The evolutionary pattern for the co-occurrences of query-page pairs in a hybrid cover graph is imposed for the quality purpose over a moving window period. Also, the approach is experimentally evaluated on a commercial search engine\u27s data collected over 3 months with about 20 million web queries and page clicks from 650000 users. The results outperform the most recent work in this domain in terms of number of events detected, F-measures, entropy, recall etc. --Abstract, page iv

    Event Detection in Twitter Using Multi Timing Chained Windows

    Get PDF
    Twitter is a popular microblogging and social networking service. Twitter posts are continuously generated and well suited for knowledge discovery using different data mining techniques. We present a novel near real-time approach for processing tweets and detecting events. The proposed method, Multi Timing Chained Windows (MTCW), is independent of the language of the tweets. The MTCW defines several Timing Windows and links them to each other like a chain. Indeed, in this chain, the input of the larger window will be the output of the smaller previous one. Using MTCW, the events can be detected over a few minutes. To evaluate this idea, the required dataset has been collected using the Twitter API. The results of evaluations show the accuracy and the effectiveness of our approach compared with other state-of-the-art methods in the event detection in Twitter

    EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction

    Full text link
    Social platforms have emerged as crucial platforms for disseminating information and discussing real-life social events, which offers an excellent opportunity for researchers to design and implement novel event detection frameworks. However, most existing approaches merely exploit keyword burstiness or network structures to detect unspecified events. Thus, they often fail to identify unspecified events regarding the challenging nature of events and social data. Social data, e.g., tweets, is characterized by misspellings, incompleteness, word sense ambiguation, and irregular language, as well as variation in aspects of opinions. Moreover, extracting discriminative features and patterns for evolving events by exploiting the limited structural knowledge is almost infeasible. To address these challenges, in this thesis, we propose a novel framework, namely EnrichEvent, that leverages the lexical and contextual representations of streaming social data. In particular, we leverage contextual knowledge, as well as lexical knowledge, to detect semantically related tweets and enhance the effectiveness of the event detection approaches. Eventually, our proposed framework produces cluster chains for each event to show the evolving variation of the event through time. We conducted extensive experiments to evaluate our framework, validating its high performance and effectiveness in detecting and distinguishing unspecified social events

    Analyzing feature trajectories for event detection

    Get PDF

    Analyzing Granger causality in climate data with time series classification methods

    Get PDF
    Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested

    Event Detection and Tracking Detection of Dangerous Events on Social Media

    Get PDF
    Online social media platforms have become essential tools for communication and information exchange in our lives. It is used for connecting with people and sharing information. This phenomenon has been intensively studied in the past decade to investigate users’ sentiments for different scenarios and purposes. As the technology advanced and popularity increased, it led to the use of different terms referring to similar topics which often result in confusion. We study such trends and intend to propose a uniform solution that deals with the subject clearly. We gather all these ambiguous terms under the umbrella of the most recent and popular terms to reach a concise verdict. Many events have been addressed in recent works that cover only specific types and domains of events. For the sake of keeping things simple and practical, the events that are extreme, negative, and dangerous are grouped under the name Dangerous Events (DE). These dangerous events are further divided into three main categories of action-based, scenario-based, and sentiments-based dangerous events to specify their characteristics. We then propose deep-learning-based models to detect events that are dangerous in nature. The deep-learning models that include BERT, RoBERTa, and XLNet provide valuable results that can effectively help solve the issue of detecting dangerous events using various dimensions. Even though the models perform well, the main constraint of fewer available event datasets and lower quality of certain events data affects the performance of these models can be tackled by handling the issue accordingly.As plataformas online de redes sociais tornaram-se ferramentas essenciais para a comunicação, conexão com outros, e troca de informação nas nossas vidas. Este fenómeno tem sido intensamente estudado na última década para investigar os sentimentos dos utilizadores em diferentes cenários e para vários propósitos. Contudo, a utilização dos meios de comunicação social tornou-se mais complexa e num fenómeno mais vasto devido ao envolvimento de múltiplos intervenientes, tais como empresas, grupos e outras organizações. À medida que a tecnologia avançou e a popularidade aumentou, a utilização de termos diferentes referentes a tópicos semelhantes gerou confusão. Por outras palavras, os modelos são treinados segundo a informação de termos e âmbitos específicos. Portanto, a padronização é imperativa. O objetivo deste trabalho é unir os diferentes termos utilizados em termos mais abrangentes e padronizados. O perigo pode ser uma ameaça como violência social, desastres naturais, danos intelectuais ou comunitários, contágio, agitação social, perda económica, ou apenas a difusão de ideologias odiosas e violentas. Estudamos estes diferentes eventos e classificamos-los em tópicos para que a ténica de deteção baseada em tópicos possa ser concebida e integrada sob o termo Evento Perigosos (DE). Consequentemente, definimos o termo proposto “Eventos Perigosos” (Dangerous Events) e dividimo-lo em três categorias principais de modo a especificar as suas características. Sendo estes denominados Eventos Perigosos, Eventos Perigosos de nível superior, e Eventos Perigosos de nível inferior. O conjunto de dados MAVEN foi utilizado para a obtenção de conjuntos de dados para realizar a experiência. Estes conjuntos de dados são filtrados manualmente com base no tipo de eventos para separar eventos perigosos de eventos gerais. Os modelos de transformação BERT, RoBERTa, e XLNet foram utilizados para classificar dados de texto consoante a respetiva categoria de Eventos Perigosos. Os resultados demonstraram que o desempenho do BERT é superior a outros modelos e pode ser eficazmente utilizado para a tarefa de deteção de Eventos Perigosos. Salienta-se que a abordagem de divisão dos conjuntos de dados aumentou significativamente o desempenho dos modelos. Existem diversos métodos propostos para a deteção de eventos. A deteção destes eventos (ED) são maioritariamente classificados na categoria de supervisonado e não supervisionados, como demonstrado nos metódos supervisionados, estão incluidos support vector machine (SVM), Conditional random field (CRF), Decision tree (DT), Naive Bayes (NB), entre outros. Enquanto a categoria de não supervisionados inclui Query-based, Statisticalbased, Probabilistic-based, Clustering-based e Graph-based. Estas são as duas abordagens em uso na deteção de eventos e são denonimados de document-pivot and feature-pivot. A diferença entre estas abordagens é na sua maioria a clustering approach, a forma como os documentos são utilizados para caracterizar vetores, e a similaridade métrica utilizada para identificar se dois documentos correspondem ao mesmo evento ou não. Além da deteção de eventos, a previsão de eventos é um problema importante mas complicado que engloba diversas dimensões. Muitos destes eventos são difíceis de prever antes de se tornarem visíveis e ocorrerem. Como um exemplo, é impossível antecipar catástrofes naturais, sendo apenas detetáveis após o seu acontecimento. Existe um número limitado de recursos em ternos de conjuntos de dados de eventos. ACE 2005, MAVEN, EVIN são alguns dos exemplos de conjuntos de dados disponíveis para a deteção de evnetos. Os trabalhos recentes demonstraram que os Transformer-based pre-trained models (PTMs) são capazes de alcançar desempenho de última geração em várias tarefas de NLP. Estes modelos são pré-treinados em grandes quantidades de texto. Aprendem incorporações para as palavras da língua ou representações de vetores de modo a que as palavras que se relacionem se agrupen no espaço vectorial. Um total de três transformadores diferentes, nomeadamente BERT, RoBERTa, e XLNet, será utilizado para conduzir a experiência e tirar a conclusão através da comparação destes modelos. Os modelos baseados em transformação (Transformer-based) estão em total sintonia utilizando uma divisão de 70,30 dos conjuntos de dados para fins de formação e teste/validação. A sintonização do hiperparâmetro inclui 10 epochs, 16 batch size, e o optimizador AdamW com taxa de aprendizagem 2e-5 para BERT e RoBERTa e 3e-5 para XLNet. Para eventos perigosos, o BERT fornece 60%, o RoBERTa 59 enquanto a XLNet fornece apenas 54% de precisão geral. Para as outras experiências de configuração de eventos de alto nível, o BERT e a XLNet dão 71% e 70% de desempenho com RoBERTa em relação aos outros modelos com 74% de precisão. Enquanto para o DE baseado em acções, DE baseado em cenários, e DE baseado em sentimentos, o BERT dá 62%, 85%, e 81% respetivamente; RoBERTa com 61%, 83%, e 71%; a XLNet com 52%, 81%, e 77% de precisão. Existe a necessidade de clarificar a ambiguidade entre os diferentes trabalhos que abordam problemas similares utilizando termos diferentes. A ideia proposta de referir acontecimentos especifícos como eventos perigosos torna mais fácil a abordagem do problema em questão. No entanto, a escassez de conjunto de dados de eventos limita o desempenho dos modelos e o progresso na deteção das tarefas. A disponibilidade de uma maior quantidade de informação relacionada com eventos perigosos pode melhorar o desempenho do modelo existente. É evidente que o uso de modelos de aprendizagem profunda, tais como como BERT, RoBERTa, e XLNet, pode ajudar a detetar e classificar eventos perigosos de forma eficiente. Tem sido evidente que a utilização de modelos de aprendizagem profunda, tais como BERT, RoBERTa, e XLNet, pode ajudar a detetar e classificar eventos perigosos de forma eficiente. Em geral, o BERT tem um desempenho superior ao do RoBERTa e XLNet na detecção de eventos perigosos. É igualmente importante rastrear os eventos após a sua detecção. Por conseguinte, para trabalhos futuros, propõe-se a implementação das técnicas que lidam com o espaço e o tempo, a fim de monitorizar a sua emergência com o tempo

    Multi-dimensional mining of unstructured data with limited supervision

    Get PDF
    As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to healthcare and scientific research. In many emerging applications, people's information needs from text data are becoming multi-dimensional---they demand useful insights for multiple aspects from the given text corpus. However, turning massive text data into multi-dimensional knowledge remains a challenge that cannot be readily addressed by existing data mining techniques. In this thesis, we propose algorithms that turn unstructured text data into multi-dimensional knowledge with limited supervision. We investigate two core questions: 1. How to identify task-relevant data with declarative queries in multiple dimensions? 2. How to distill knowledge from data in a multi-dimensional space? To address the above questions, we propose an integrated cube construction and exploitation framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multi-dimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Second, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling multi-dimensional knowledge from data to provide insights along multiple dimensions. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multi-dimensional, multi-granular data selection with declarative queries; and with cube exploitation algorithms, users can make accurate cross-dimension predictions or extract multi-dimensional patterns for decision making. The proposed framework has two distinctive advantages when turning text data into multi-dimensional knowledge: flexibility and label-efficiency. First, it enables acquiring multi-dimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multi-dimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain

    Context-Aware Message-Level Rumour Detection with Weak Supervision

    Get PDF
    Social media has become the main source of all sorts of information beyond a communication medium. Its intrinsic nature can allow a continuous and massive flow of misinformation to make a severe impact worldwide. In particular, rumours emerge unexpectedly and spread quickly. It is challenging to track down their origins and stop their propagation. One of the most ideal solutions to this is to identify rumour-mongering messages as early as possible, which is commonly referred to as "Early Rumour Detection (ERD)". This dissertation focuses on researching ERD on social media by exploiting weak supervision and contextual information. Weak supervision is a branch of ML where noisy and less precise sources (e.g. data patterns) are leveraged to learn limited high-quality labelled data (Ratner et al., 2017). This is intended to reduce the cost and increase the efficiency of the hand-labelling of large-scale data. This thesis aims to study whether identifying rumours before they go viral is possible and develop an architecture for ERD at individual post level. To this end, it first explores major bottlenecks of current ERD. It also uncovers a research gap between system design and its applications in the real world, which have received less attention from the research community of ERD. One bottleneck is limited labelled data. Weakly supervised methods to augment limited labelled training data for ERD are introduced. The other bottleneck is enormous amounts of noisy data. A framework unifying burst detection based on temporal signals and burst summarisation is investigated to identify potential rumours (i.e. input to rumour detection models) by filtering out uninformative messages. Finally, a novel method which jointly learns rumour sources and their contexts (i.e. conversational threads) for ERD is proposed. An extensive evaluation setting for ERD systems is also introduced

    VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    An enhanced binary bat and Markov clustering algorithms to improve event detection for heterogeneous news text documents

    Get PDF
    Event Detection (ED) works on identifying events from various types of data. Building an ED model for news text documents greatly helps decision-makers in various disciplines in improving their strategies. However, identifying and summarizing events from such data is a non-trivial task due to the large volume of published heterogeneous news text documents. Such documents create a high-dimensional feature space that influences the overall performance of the baseline methods in ED model. To address such a problem, this research presents an enhanced ED model that includes improved methods for the crucial phases of the ED model such as Feature Selection (FS), ED, and summarization. This work focuses on the FS problem by automatically detecting events through a novel wrapper FS method based on Adapted Binary Bat Algorithm (ABBA) and Adapted Markov Clustering Algorithm (AMCL), termed ABBA-AMCL. These adaptive techniques were developed to overcome the premature convergence in BBA and fast convergence rate in MCL. Furthermore, this study proposes four summarizing methods to generate informative summaries. The enhanced ED model was tested on 10 benchmark datasets and 2 Facebook news datasets. The effectiveness of ABBA-AMCL was compared to 8 FS methods based on meta-heuristic algorithms and 6 graph-based ED methods. The empirical and statistical results proved that ABBAAMCL surpassed other methods on most datasets. The key representative features demonstrated that ABBA-AMCL method successfully detects real-world events from Facebook news datasets with 0.96 Precision and 1 Recall for dataset 11, while for dataset 12, the Precision is 1 and Recall is 0.76. To conclude, the novel ABBA-AMCL presented in this research has successfully bridged the research gap and resolved the curse of high dimensionality feature space for heterogeneous news text documents. Hence, the enhanced ED model can organize news documents into distinct events and provide policymakers with valuable information for decision making
    corecore