323 research outputs found

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.CRUE-CSIC agreementSpringer Natur

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature

    Exploring Cyberterrorism, Topic Models and Social Networks of Jihadists Dark Web Forums: A Computational Social Science Approach

    Get PDF
    This three-article dissertation focuses on cyber-related topics on terrorist groups, specifically Jihadists’ use of technology, the application of natural language processing, and social networks in analyzing text data derived from terrorists\u27 Dark Web forums. The first article explores cybercrime and cyberterrorism. As technology progresses, it facilitates new forms of behavior, including tech-related crimes known as cybercrime and cyberterrorism. In this article, I provide an analysis of the problems of cybercrime and cyberterrorism within the field of criminology by reviewing existing literature focusing on (a) the issues in defining terrorism, cybercrime, and cyberterrorism, (b) ways that cybercriminals commit a crime in cyberspace, and (c) ways that cyberterrorists attack critical infrastructure, including computer systems, data, websites, and servers. The second article is a methodological study examining the application of natural language processing computational techniques, specifically latent Dirichlet allocation (LDA) topic models and topic network analysis of text data. I demonstrate the potential of topic models by inductively analyzing large-scale textual data of Jihadist groups and supporters from three Dark Web forums to uncover underlying topics. The Dark Web forums are dedicated to Islam and the Islamic world discussions. Some members of these forums sympathize with and support terrorist organizations. Results indicate that topic modeling can be applied to analyze text data automatically; the most prevalent topic in all forums was religion. Forum members also discussed terrorism and terrorist attacks, supporting the Mujahideen fighters. A few of the discussions were related to relationships and marriages, advice, seeking help, health, food, selling electronics, and identity cards. LDA topic modeling is significant for finding topics from larger corpora such as the Dark Web forums. Implications for counterterrorism include the use of topic modeling in real-time classification and removal of online terrorist content and the monitoring of religious forums, as terrorist groups use religion to justify their goals and recruit in such forums for supporters. The third article builds on the second article, exploring the network structures of terrorist groups on the Dark Web forums. The two Dark Web forums\u27 interaction networks were created, and network properties were measured using social network analysis. A member is considered connected and interacting with other forum members when they post in the same threads forming an interaction network. Results reveal that the network structure is decentralized, sparse, and divided based on topics (religion, terrorism, current events, and relationships) and the members\u27 interests in participating in the threads. As participation in forums is an active process, users tend to select platforms most compatible with their views, forming a subgroup or community. However, some members are essential and influential in the information and resources flow within the networks. The key members frequently posted about religion, terrorism, and relationships in multiple threads. Identifying key members is significant for counterterrorism, as mapping network structures and key users are essential for removing and destabilizing terrorist networks. Taken together, this dissertation applies a computational social science approach to the analysis of cyberterrorism and the use of Dark Web forums by jihadists

    An Empirical Approach for Extreme Behavior Identification through Tweets Using Machine Learning

    Get PDF
    This research was supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program. No.10063130, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159), and MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2019-2016-0-00313) supervised by the IITP (Institute for Information & communications Technology Promotion), and the 2018 Yeungnam University Research Grant.Peer reviewe

    Automatic Extraction of Narrative Structure from Long Form Text

    Get PDF
    Automatic understanding of stories is a long-time goal of artificial intelligence and natural language processing research communities. Stories literally explain the human experience. Understanding our stories promotes the understanding of both individuals and groups of people; various cultures, societies, families, organizations, governments, and corporations, to name a few. People use stories to share information. Stories are told –by narrators– in linguistic bundles of words called narratives. My work has given computers awareness of narrative structure. Specifically, where are the boundaries of a narrative in a text. This is the task of determining where a narrative begins and ends, a non-trivial task, because people rarely tell one story at a time. People don’t specifically announce when we are starting or stopping our stories: We interrupt each other. We tell stories within stories. Before my work, computers had no awareness of narrative boundaries, essentially where stories begin and end. My programs can extract narrative boundaries from novels and short stories with an F1 of 0.65. Before this I worked on teaching computers to identify which paragraphs of text have story content, with an F1 of 0.75 (which is state of the art). Additionally, I have taught computers to identify the narrative point of view (POV; how the narrator identifies themselves) and diegesis (how involved in the story’s action is the narrator) with F1 of over 0.90 for both narrative characteristics. For the narrative POV, diegesis, and narrative level extractors I ran annotation studies, with high agreement, that allowed me to teach computational models to identify structural elements of narrative through supervised machine learning. My work has given computers the ability to find where stories begin and end in raw text. This allows for further, automatic analysis, like extraction of plot, intent, event causality, and event coreference. These tasks are impossible when the computer can’t distinguish between which stories are told in what spans of text. There are two key contributions in my work: 1) my identification of features that accurately extract elements of narrative structure and 2) the gold-standard data and reports generated from running annotation studies on identifying narrative structure

    Native language identification of fluent and advanced non-native writers

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).Published versio

    State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism

    Get PDF
    Overview This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.  The paper is structured as follows: Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS). Part 2 provides an introduction to the key approaches of social media intelligence (henceforth ‘SOCMINT’) for counter-terrorism. Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored. Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work

    “Russians are very sweet and nice”:a corpus-assisted multimodal discourse analysis of the representation of people in online travel reviews about Moscow

    Get PDF
    The paper explores how guests and hosts are represented in online travel reviews about Moscow. Tourism provides an opportunity to get acquainted with the sociocultural background of other nations and potentially to improve international relations. Moscow, the capital of Russia, is sometimes viewed as an unfriendly or unsafe destination and the Russian Government aims to increase the popularity of the city. However, there are concerns that modern tourism discourse contributes to the maintenance of asymmetrical guest-host power relations. Guests are often accused of consumerism while hosts are frequently backgrounded or represented as servants or cultural markers. Such representation can lead to client-servant attitude and even cause discrimination against hosts. While online travel reviews are considered an important genre of tourism discourse, most studies analyse the representation of people in promotional or media discourse. Considering that multimodality is an integral feature of tourism discourse and that the analysis of discourse patterns allows exploring the meanings widely shared by the society, the study utilizes a corpus-assisted multimodal approach by analysing the representation of people in headlines, texts, images and image captions of a corpus of online travel reviews. The analysis corroborates previous conclusions that guests tend to be represented as consumers enjoying themselves while hosts are perceived as friendly servants. However, the study provides evidence that tourists can background not only hosts but also themselves or other tourists. Moreover, the results reveal that in contrast to promotional and media discourse, guests can also portray themselves as active, solving problems while sometimes representing guests as rude or unwelcoming. The results also show that the representation of people can vary across the modes of the same document. The study concludes that user-generated tourism discourse reveals a complex picture and can express resistance to the dominant institutional imagery

    Saudis in the eyes of the other:A corpus-driven critical discourse study of the representation of Saudis on Twitter

    Get PDF
    Despite an abundance of research on the representation of groups and minorities in traditional (mass) media, little work has focused on the representation of others on social media platforms, especially Twitter. More specifically, to the best of my knowledge, no study has yet approached the representation of Saudis on Twitter from a Critical-discourse and Corpus Linguistics perspective. Hence, the overall aim of this thesis is to investigate how Saudis are represented in tweets in English from Australia, Canada, Great Britain, the United States and the rest of the world during two tragic events at Mecca in 2015 (the crane collapse at The Holy Mosque and the stampede at Mina). Unlike studies of media representation which focus on a one-to-many text context, the current study investigates the bottom-up discursive practices on social media, namely, the user-generated microblogging service, Twitter. The data comprise 89,928 tweets (1.9 million tokens) collected during the tragic events at Mecca starting from 10 September 2015 over a one-month period and including all English tweets mentioning Saudis. Drawing on theories from Critical Discourse Studies, the thesis deploys concepts and tools from the Discourse-historical approach and Systemic Functional Grammar. These are also supported by corpus-assisted methodologies to unravel the linguistic patterns associated with Saudis across five corpora. Integrating both quantitative and qualitative approaches substantiates the findings of the current study as well as enhance the synergy between Critical Discourse Studies and Corpus Linguistics approaches in examining social media texts, particularly Twitter data. The analysis revealed a hegemonic negative representation of Saudis across the corpora. Themes relating Saudis to war, terrorism and corruption are more prevalent than others. Constructing Saudis in relation to Islam and wealth (oil) triggers negative discourse prosody of extremism and corruption. Tweets about the tragic events at Mecca were generally condemning and reproachful. Additionally, comparing each corpus with others did not produced contradictory results, but rather triangulated the hegemonic, negative discourse recurring across the corpora, which sustains online racist and Saudiphobic discourse. These findings correspond remarkably to earlier findings identified in the analyses of representations of Muslims in Western media. The findings contribute to the ongoing academic discussion on the relationship between traditional media and social media regarding whether social media represent a largely safe space for maintaining and developing alternative discourses, or if it can mirror and reproduce existing hegemonic discourses, which may result in even stronger polarising effects on public discourse. In light of these findings, Twitter seems to serve as an online amplifier that mirrors and reinforces existing discourses in traditional media that are likely to have even stronger polarising effects on public discourse
    • 

    corecore