11 research outputs found

    Sub-story detection in Twitter with hierarchical Dirichlet processes

    Get PDF
    Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance

    Eliminasi Non-Topic Menggunakan Pemodelan Topik untuk Peringkasan Otomatis Data Tweet dengan Konteks Covid-19

    Get PDF
    Akun twitter, seperti Suara Surabaya, dapat membantu menyebarkan informasi tentang COVID-19 meskipun ada bahasan lainnya seperti kecelakaan, kemacetan atau topik lain. Peringkasan teks dapat diimplementasikan pada kasus pembacaan data twitter karena banyaknya jumlah tweet yang tersedia, sehingga akan mempermudah dalam memperoleh informasi penting terkini terkait COVID-19. Jumlah variasi bahasan pada teks tweet mengakibatkan hasil ringkasan yang kurang baik. Oleh karena itu dibutuhkan adanya eliminasi tweet yang tidak berkaitan dengan konteks sebelum dilakukan peringkasan. Kontribusi penelitian ini adalah adanya metode pemodelan topik sebagai bagian tahapan dalam serangkaian proses eliminasi data. Metode pemodelan topik sebagai salah satu teknik eliminasi data dapat digunakan dalam berbagai kasus namun pada penelitian ini difokuskan pada COVID-19. Tujuannya adalah untuk mempermudah masyarakat memperoleh informasi terkini secara ringkas. Tahapan yang dilakukan adalah pra-pemrosesan, eliminasi data menggunakan pemodelan topik dan peringkasan otomatis. Penelitian ini menggunakan kombinasi beberapa metode word embedding, pemodelan topik dan peringkasan otomatis sebagai pembanding. Ringkasan diuji menggunakan metode ROUGE dari setiap kombinasi untuk ditemukan kombinasi terbaik dari penelitian ini. Hasil pengujian menunjukkan kombinasi metode Word2Vec, LSI dan TextRank memiliki nilai ROUGE terbaik yaitu 0.67. Sedangkan kombinasi metode TFIDF, LDA dan Okapi BM25 memiliki nilai ROUGE terendah yaitu 0.35. AbstractTwitter accounts, such as Suara Surabaya, can help spread information about COVID-19 even though there are other topics such as accidents, traffic jams or other topics. Text summarization can be implemented in the case of reading Twitter data because of the large number of tweets available, making it easier to obtain the latest important information related to COVID-19. The number of discussion variations in the tweet text results in poor summary results. Therefore, it is necessary to eliminate tweets that are not related to the context before summarization is carried out. The contribution to this research is the topic modeling method as part of a series of data elimination processes. The topic modeling method as a data elimination technique can be used in various cases, but this research focuses on COVID-19. The aim is to make it easier for the public to obtain current information in a concise manner. The steps taken in this study were pre-processing, data elimination using topic modeling and automatic summarization. This study uses a combination of several word embedding methods, topic modeling and automatic summarization as a comparison. The summary is tested using the ROUGE method of each combination to find the best combination of this study. The test results show that the combination of Word2Vec, LSI and TextRank methods has the best ROUGE value, 0.67. While the combination of TFIDF, LDA and Okapi BM25 methods has the lowest ROUGE value, 0.35

    Modeling Islamist Extremist Communications on Social Media using Contextual Dimensions: Religion, Ideology, and Hate

    Full text link
    Terror attacks have been linked in part to online extremist content. Although tens of thousands of Islamist extremism supporters consume such content, they are a small fraction relative to peaceful Muslims. The efforts to contain the ever-evolving extremism on social media platforms have remained inadequate and mostly ineffective. Divergent extremist and mainstream contexts challenge machine interpretation, with a particular threat to the precision of classification algorithms. Our context-aware computational approach to the analysis of extremist content on Twitter breaks down this persuasion process into building blocks that acknowledge inherent ambiguity and sparsity that likely challenge both manual and automated classification. We model this process using a combination of three contextual dimensions -- religion, ideology, and hate -- each elucidating a degree of radicalization and highlighting independent features to render them computationally accessible. We utilize domain-specific knowledge resources for each of these contextual dimensions such as Qur'an for religion, the books of extremist ideologues and preachers for political ideology and a social media hate speech corpus for hate. Our study makes three contributions to reliable analysis: (i) Development of a computational approach rooted in the contextual dimensions of religion, ideology, and hate that reflects strategies employed by online Islamist extremist groups, (ii) An in-depth analysis of relevant tweet datasets with respect to these dimensions to exclude likely mislabeled users, and (iii) A framework for understanding online radicalization as a process to assist counter-programming. Given the potentially significant social impact, we evaluate the performance of our algorithms to minimize mislabeling, where our approach outperforms a competitive baseline by 10.2% in precision.Comment: 22 page

    Twitter and Research: A Systematic Literature Review Through Text Mining

    Get PDF

    Twitter and Research: A Systematic Literature Review Through Text Mining

    Get PDF
    Researchers have collected Twitter data to study a wide range of topics. This growing body of literature, however, has not yet been reviewed systematically to synthesize Twitter-related papers. The existing literature review papers have been limited by constraints of traditional methods to manually select and analyze samples of topically related papers. The goals of this retrospective study are to identify dominant topics of Twitter-based research, summarize the temporal trend of topics, and interpret the evolution of topics withing the last ten years. This study systematically mines a large number of Twitter-based studies to characterize the relevant literature by an efficient and effective approach. This study collected relevant papers from three databases and applied text mining and trend analysis to detect semantic patterns and explore the yearly development of research themes across a decade. We found 38 topics in more than 18,000 manuscripts published between 2006 and 2019. By quantifying temporal trends, this study found that while 23.7% of topics did not show a significant trend ( P=\u3e0.05 ), 21% of topics had increasing trends and 55.3% of topics had decreasing trends that these hot and cold topics represent three categories: application, methodology, and technology. The contributions of this paper can be utilized in the growing field of Twitter-based research and are beneficial to researchers, educators, and publishers

    Temu Kembali Informasi Berbasis Pemodelan Topik Menggunakan Kombinasi LSI dan VSM Pada Sistem Tanya-Jawab

    Get PDF
    Dalam Penerapan e-government untuk menuju tata pemerintahan yang baik (good governance), pemerintah pusat maupun daerah menyediakan layanan tanya-jawab pada sistem online. Layanan tanya-jawab ini sangat penting karena dapat memfasilitasi permintaan informasi secara lebih mudah serta dapat diakses kapan saja, tanpa harus menunggu jam layanan kantor buka. Dalam pelaksanaan layanan tersebut masih dilakukan secara manual, sehingga perlu dikembangkan suatu sistem tanya-jawab yang dikerjakan oleh komputer. Suatu sistem tanya-jawab dibentuk oleh beberapa elemen/modul. Salah satu elemen penting dalam sistem tanya-jawab tersebut elemen temu kembali informasi yang bertanggung jawab dalam pengambilan dokumen-dokumen yang relevan dengan pertanyaan (query) pengguna. Metode yang banyak digunakan dalam membangun temu kembali informasi adalah menggunakan adalah metode Vector Space Model (VSM) dan Latent Semantic Indexing (LSI), dimana keduanya merepresentasikan dokumen ke dalam vektor ruang. Namun kedua metode tersebut memiliki keterbatasan masing-masing. Untuk itu dalam penelitian ini diusulkan model kombinasi antara metode VSM dan LSI untuk memperbaiki beberapa batasan pada keduanya. Dalam mencari dokumen yang relevan dengan query, model kombinasi ini bekerja dengan cara mengambil terlebih dahulu dokumen yang memiliki kesamaan topik dengan query menggunakan pemodelan topik dalam hal ini metode LSI. Kemudian setelah itu mengurutkannya berdasarkan kesamaan term menggunakan metode VSM untuk diambil beberapa dokumen dengan nilai kemiripan tertinggi. Untuk menguji kinerja dari model kombinasi tersebut dalam mencari dokumen relevan pada sistem tanya-jawab, maka pada penelitian ini akan menggunakan data layanan tanya-jawab pada sistem Pengadaan Secara Elektronik (SPSE) sebagai data eksperimen. Dari hasil eksperimen yang dilakukan ditemukan bahwa model yang diusulkan mampu meningkatkan presisi metode dasarnya yakni LSI dan VSM yang berdiri sendiri. Model kombinasi (LSI+VSM) memperoleh precision at 1 (P@1)=0,7 dengan Mean Average Precision (MAP)=0,579 sedangkan pada model dasarnya diperoleh P@1=0,5 dengan MAP=0,237 untuk LSI, P@1=0,38 dengan MAP=0,247 untuk VSM biasa serta P@1=0,44 dengan MAP=0,258 untuk VSM dengan pembobotan profesional (VSM+PP). =========================================================================================================== In order to achieve good governance through implementation of e-government, the central and local governments provide a question-answering services for online system. This question-answering services are essential to facilitate information requests to make it easier and accessible at any time. In the implementation of the services are still done manually, so it is necessary to develop a computerized question-answering system (QAS). A QAS is formed by several elements/modules. One of important element in QAS is the information retrieval (IR) that is responsible for retrieving relevant documents to the user requests. A widely used methods for developing the information retrieval system are using Vector Space Model (VSM) and Latent Semantic Indexing (LSI), where they represent documents into space vectors. However, both models have their respective limitation. For this reason, in this research proposed a combination model between VSM and LSI to fix some limitations on both. In searching for documents relevant to the query, this combination model works by retrieving documents that have the same topic as the query first using the topic modeling in this case the LSI method and then sort it based on the term similarity using the VSM method to retrieve some documents with the highest similarity value. To evaluate the performance of that combination model in searching relevant documents on the question-answering system, hence in this research will be use question-answer data on the Electronic Procurement System (SPSE) as experimental data. From the experimental results, it was found that the proposed model was able to improve the precision of its basic method i.e. the stand-alone LSI and VSM. The combination model (LSI + VSM) obtained precision at 1 (P@1)=0.7 with Mean Average Precision (MAP)=0.579 whereas in the basic methods obtained P@1=0.5 with MAP=0.237 for the LSI, P@1=0.38 with MAP=0.247 for the traditional VSM and P@1=0.44 with MAP=0.258 for the VSM with professional weight concept

    Neural approaches to sequence labeling for information extraction

    Get PDF
    Een belangrijk aspect binnen artificiële intelligentie (AI) is het interpreteren van menselijke taal uitgedrukt in tekstuele (geschreven) vorm: natural Language processing (NLP) is belangrijk gezien tekstuele informatie nuttig is voor veel toepassingen. Toch is het verstaan ervan (zogenaamde natural Language understanding, (NLU) een uitdaging, gezien de ongestructureerde vorm van tekst, waarvan de betekenis vaak dubbelzinnig en contextafhankelijk is. In dit proefschrift introduceren we oplossingen voor tekortkomingen van gerelateerd werk bij het behandelen van fundamentele taken in natuurlijke taalverwerking, zoals named entity recognition (i.e. het identificeren van de entiteiten die in een zin voorkomen) en relatie-extractie (het identificeren van relaties tussen entiteiten). Vertrekkend van een specifiek probleem (met name het identificeren van de structuur van een huis aan de hand van een tekstueel zoekertje), bouwen we stapsgewijs een complete (geautomatiseerde) oplossing voor de bovengenoemde taken, op basis van neutrale netwerkarchitecturen. Onze oplossingen zijn algemeen toepasbaar op verschillende toepassingsdomeinen en talen. We beschouwen daarnaast ook de taak van het identificeren van relevante gebeurtenissen tijdens een evenement (bv. een doelpunt tijdens een voetbalwedstrijd), in informatiestromen op Twitter. Meer bepaald formuleren we dit probleem als het labelen van woord sequenties (vergelijkbaar met named entity recognition), waarbij we de chronologische relatie tussen opeenvolgende tweets benutten
    corecore