120 research outputs found
A Survey on Cybercrime Using Social Media
There is growing interest in automating crime detection and prevention for large populations as a result of the increased usage of social media for victimization and criminal activities. This area is frequently researched due to its potential for enabling criminals to reach a large audience. While several studies have investigated specific crimes on social media, a comprehensive review paper that examines all types of social media crimes, their similarities, and detection methods is still lacking. The identification of similarities among crimes and detection methods can facilitate knowledge and data transfer across domains. The goal of this study is to collect a library of social media crimes and establish their connections using a crime taxonomy. The survey also identifies publicly accessible datasets and offers areas for additional study in this area
Location Reference Recognition from Texts: A Survey and Comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs
Crime prediction and monitoring in Porto, Portugal, using machine learning, spatial and text analytics
Crimes are a common societal concern impacting quality of life and economic growth.
Despite the global decrease in crime statistics, specific types of crime and feelings of insecurity, have
often increased, leading safety and security agencies with the need to apply novel approaches and
advanced systems to better predict and prevent occurrences. The use of geospatial technologies,
combined with data mining and machine learning techniques allows for significant advances in the
criminology of place. In this study, official police data from Porto, in Portugal, between 2016 and 2018,
was georeferenced and treated using spatial analysis methods, which allowed the identification of
spatial patterns and relevant hotspots. Then, machine learning processes were applied for space-time
pattern mining. Using lasso regression analysis, significance for crime variables were found, with
random forest and decision tree supporting the important variable selection. Lastly, tweets related to
insecurity were collected and topic modeling and sentiment analysis was performed. Together, these
methods assist interpretation of patterns, prediction and ultimately, performance of both police and
planning professionals
A combined classification-clustering framework for identifying disruptive events
Twitter is a popular micro-blogging web application serving hundreds of millions of users. Users publish short messages to communicate with friends and families, express their opinions and broadcast news and information about a variety of topics all in real-time. User-generated content can be utilized as a rich source of real-world event identification as well as extract useful knowledge about disruptive events for a given region. In this paper, we propose a novel detection framework for identifying real-time events, including a main event and associated disruptive events, from Twitter data. Theapproach is based on five steps:data collection, pre-processing,classification, online clustering and summarization. We use a Naïve Bayes classification model and an Online Clustering method to validate our model on a major real-world event (Formula 1 Abu Dhabi Grand Prix 2013)
Predictive Analysis on Twitter: Techniques and Applications
Predictive analysis of social media data has attracted considerable attention
from the research community as well as the business world because of the
essential and actionable information it can provide. Over the years, extensive
experimentation and analysis for insights have been carried out using Twitter
data in various domains such as healthcare, public health, politics, social
sciences, and demographics. In this chapter, we discuss techniques, approaches
and state-of-the-art applications of predictive analysis of Twitter data.
Specifically, we present fine-grained analysis involving aspects such as
sentiment, emotion, and the use of domain knowledge in the coarse-grained
analysis of Twitter data for making decisions and taking actions, and relate a
few success stories
Robust part-of-speech tagging of social media text
Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen.
Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen.
Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist.
Insbesondere Texte aus sozialen Medien sind eine große Herausforderung.
Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind.
Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene.
Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden.
Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen.
Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten.
Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen.
Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht.
Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet.
Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert.
Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert.
Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften für einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann.
Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache.
Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind.
Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren.
Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern.
Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben.
Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task.
Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains.
In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate.
These arising challenges originate in a lack of robustness of taggers towards domain transfers.
This increased error rate has an impact on NLP applications that depend on PoS information.
The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness.
Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging.
Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains.
We find that the tagging of informal text is poorly supported by available taggers.
A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution.
We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness.
This approach is based on tagging in two steps.
The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags.
Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible.
We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages.
We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language.
Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain.
We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data.
In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena.
Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis.
These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility
Media’s influence on the 21st century society: A global criminological systematic review
This investigation assumes that the media can reduce or spread criminal activities and tendencies based on how the concerned parties apply the policies and community standards that guide these platforms’ use. In total, 254 materials were gathered across several search systems between October 2021 and September 2022. Qualitative data were used from the selected materials to synthesise and summarise the content on the examined 21st-century events and media’s influence on crime.
It is not possible to reject the premise that the media influences opinions on crime and the legal system. Nevertheless, the data reveals that no causal media effect can be directly established. However, the same data uncovers how media portrays an activity affects how people perceive it. Advances in technology, media, and criminology may have affected the analysis of records, including the time and quality of resources.
More accurate and fair media coverage of crime would lead to a more informed and aware population. On the other hand, media houses that promote and reward good behaviour should be applauded. These two steps ensure the media cannot be ignored when assessing crime and how the public perceives it, as it can encourage crime and shift perceptions. Therefore, further research, stricter laws and policies, and community education on crime prevention and media screening are needed. The fact that unfavourable media coverage of crime can ruin a business, either directly or indirectly (consumer behaviour changes due to crime), makes this paper of utmost importance for businessmen, politicians, and local agencies.Esta dissertação presume que os media podem ser utilizados para reduzir ou difundir atividades ou tendências criminosas, dependendo da aplicação de políticas e padrões comunitários que influenciam tais plataformas. Foram utilizados 254 materiais reunidos em diversos sistemas de pesquisa entre outubro de 2021 e setembro de 2022. Estes compreendem publicações do século XXI que examinam a influência dos media nas práticas criminais e suas perceções.
Apesar deste estudo não possibilitar estabelecer uma relação causal, não é, ainda assim, possível rejeitar a premissa de que os media influenciam as perceções face ao crime. Determina, contudo, que o modo como os media divulgam uma atividade afeta a perceção social face à mesma.
Uma população mais informada e consciente depende de uma cobertura mediática mais fatual. Os media que promovem e recompensam o bom comportamento devem ser louvados. Os media não podem ser ignorados na avaliação do crime e da sua perceção, tendo o poder de incentivar a criminalidade e potenciar alterações nas perceções sociais. Consequentemente, é necessário investigar mais, aplicar leis e políticas mais rigorosas, e investir em programas de educação comunitária de prevenção à criminalidade e interpretação dos media. Esta dissertação é de elevada importância a empresários, políticos e outros órgãos locais, pelo fato de a cobertura desfavorável do crime pelos media poder arruinar um indivíduo, organização ou até um negócio, seja de forma direta (críticas ao estabelecimento) ou indireta (mudanças no comportamento do consumidor devido à ocorrência de crimes numa região)
Recommended from our members
Identifying and Processing Crisis Information from Social Media
Social media platforms play a crucial role in how people communicate, particularly during crisis situations such as natural disasters. People share and disseminate information on social media platforms that relates to updates, alerts, rescue and relief requests among other crisis relevant information. Hurricane Harvey and Hurricane Sandy saw over tens of millions of posts getting generated, on Twitter, in a short span of time. The ambit of such posts spreads across a wide range such as personal and official communications, and citizen sensing, to mention a few. This makes social media platforms a source of vital information to different stakeholders in crisis situations such as impacted communities, relief agencies, and civic authorities. However, the overwhelming volume of data generated during such times, makes it impossible to manually identify information relevant to crisis. Additionally, a large portion of posts in voluminous streams is not relevant or bears minimal relevance to crisis situations.
This has steered much research towards exploring methods that can automatically identify crisis relevant information from voluminous streams of data during such scenarios. However, the problem of identifying crisis relevant information from social media platforms, such as Twitter, is not trivial given the nature of unstructured text such as short text length and syntactic variations among other challenges. A key objective, while creating automatic crisis relevancy classification systems, is to make them adaptable to a wide range of crisis types and languages. Many related approaches rely on statistical features which are quantifiable properties and linguistic properties of the text. A general approach is to train the classification model on labelled data acquired from crisis events and evaluate on other crisis events. A key aspect missing from explored literature is the validity of crisis relevancy classification models when applied to data from unseen types of crisis events and languages. For instance, how would the accuracy of a crisis relevancy classification model, trained on earthquake type of events, change when applied to flood type of events. Or, how would a model perform when trained on crisis data in English but applied to data in Italian.
This thesis investigates these problems from a semantics perspective, where the challenges posed by diverse types of crisis and language variations are seen as the problems that can be tackled by enriching the data semantically. The use of knowledge bases such as DBpedia, BabelNet, and Wikipedia, for semantic enrichment of data in text classification problems has often been studied. Semantic enrichment of data through entity linking and expansion of context via knowledge bases can take advantage of connections between different concepts and thus enhance contextual coherency across crisis types and languages. Several previous works have focused on similar problems and proposed approaches using statistical features and/or non-semantic features. The use of semantics extracted through knowledge graphs has remained unexplored in building crisis relevancy classifiers that are adaptive to varying crisis types and multilingual data. Experiments conducted in this thesis consider data from Twitter, a micro-blogging social media platform, and analyse multiple aspects of crisis data classification. The results obtained through various analyses in this thesis demonstrate the value of semantic enrichment of text through knowledge graphs in improving the adaptability of crisis relevancy classifiers across crisis types and languages, in comparison to statistical features as often used in much of the related work
A Deep Multi-View Learning Framework for City Event Extraction from Twitter Data Streams
Cities have been a thriving place for citizens over the centuries due to
their complex infrastructure. The emergence of the Cyber-Physical-Social
Systems (CPSS) and context-aware technologies boost a growing interest in
analysing, extracting and eventually understanding city events which
subsequently can be utilised to leverage the citizen observations of their
cities. In this paper, we investigate the feasibility of using Twitter textual
streams for extracting city events. We propose a hierarchical multi-view deep
learning approach to contextualise citizen observations of various city systems
and services. Our goal has been to build a flexible architecture that can learn
representations useful for tasks, thus avoiding excessive task-specific feature
engineering. We apply our approach on a real-world dataset consisting of event
reports and tweets of over four months from San Francisco Bay Area dataset and
additional datasets collected from London. The results of our evaluations show
that our proposed solution outperforms the existing models and can be used for
extracting city related events with an averaged accuracy of 81% over all
classes. To further evaluate the impact of our Twitter event extraction model,
we have used two sources of authorised reports through collecting road traffic
disruptions data from Transport for London API, and parsing the Time Out London
website for sociocultural events. The analysis showed that 49.5% of the Twitter
traffic comments are reported approximately five hours prior to the authorities
official records. Moreover, we discovered that amongst the scheduled
sociocultural event topics; tweets reporting transportation, cultural and
social events are 31.75% more likely to influence the distribution of the
Twitter comments than sport, weather and crime topics
- …