30 research outputs found
Strings and things:a semantic search engine for news quotes using named entity recognition
Abstract
Emerging methods for content delivery such as quote-searching and entity-searching, enable users to quickly identify novel and relevant information from unstructured texts, news articles, and media sources. These methods have widespread applications in web surveillance and crime informatics, and can help improve intention disambiguation, character evaluation, threat analysis, and bias detection. Furthermore, quote-based and entity-based searching is also an empowering information retrieval tool that can enable non-technical users to gauge the quality of public discourse, allowing for more fine-grained analysis of core sociological questions. The paper presents a prototype search engine that allows users to search a news database containing quotes using a combination of strings and things. The ingestion pipeline, which forms the backend of the service, comprises of the following modules i) a crawler that ingests data from the GDELT Global Quotation Graph ii) a named entity recognition (NER) filter that labels data on the fly iii) an indexing mechanism that serves the data to an Elasticsearch cluster and iv) a user interface that allows users to formulate queries. The paper presents the high-level configuration of the pipeline and reports basic metrics and aggregations
Public perceptions on organised crime, mafia, and terrorism:a big data analysis based on Twitter and Google trends
Abstract
Public perceptions enable crime and motivate government policy on law and order; however, there has been limited empirical research on serious crime perceptions in social media. Recently, open source dataâand âbig dataââhave enabled researchers from different fields to develop cost-effective methods for opinion mining and sentiment analysis. Against this backdrop, the aim of this paper is to apply state-of-the-art tools and techniques for assembly and analysis of open source data. We set out to explore how non-discursive behavioural data can be used as a proxy for studying public perceptions of serious crime. The data collection focused on the following three conversational topics: organised crime, the mafia, and terrorism. Specifically, time series data of usersâ online search habits (over a ten-year period) were gathered from Google Trends, and cross-sectional network data (N=178,513) were collected from Twitter. The collected data contained a significant amount of structure. Marked similarities and differences in peopleâs habits and perceptions were observable, and these were recorded. The results indicated that âbig dataâ is a cost-effective method for exploring theoretical and empirical issues vis-Ă -vis public perceptions of serious crime
On web based sentence similarity for paraphrasing detection
Abstract
Semantic similarity measures play vital roles in information retrieval, natural language processing and paraphrasing detection. With the growing plagiarisms cases in both commercial and research community, designing efficient tools and approaches for paraphrasing detection becomes crucial. This paper contrasts web-based approach related to analysis of snippets of the search engine with WordNet based measure. Several refinements of the web-based approach will be investigated and compared. Evaluations of the approaches with respect to Microsoft paraphrasing dataset will be performed and discussed
On the use of distributed semantics of tweet metadata for user age prediction
Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score
On the use of distributed semantics of tweet metadata for user age prediction
Abstract
Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score
VR ethnography:a pilot study on the use of virtual reality âgo-alongâ interviews in Google street view
Abstract
Go-along interviewing is an emerging qualitative research method for eliciting contextualised perspectives in which informants and observers conduct mobile interviews while navigating in real or imagined sites. This paper describes results of a pilot study that use virtual reality (VR) go-along interviews to explore university community membersâ (N=6) contextualized perceptions of urban habitat fragmentation due to new transportation infrastructure. Participants were immersed into the popular Google Street View and asked to navigate from the University campus to the city center. Along that route, construction sites featured in 360° images acted as prompts for discussing ecological change. Preliminary results indicate that VR go-along interviews are able to evoke emotions and inform a broad range of research questions with regards to both verbal and non-verbal feedback received from the informants
Crime prediction using hotel reviews?
Abstract
Can hotel reviews be used as a proxy for predicting crime hotspots? Domain knowledge indicates that hotels are crime attractors, and therefore, hotel guests might be reliable âhuman crime sensorsâ. In order to assess this heuristic, we propose a novel method by mapping actual crime events into hotel reviews from London, using spatial clustering and sentiment feedback. Preliminary findings indicate that sentiment scores from hotel reviews are inversely correlated with crime intensity. Hotels with positive reviews are more likely to be adjacent to crime hotspots, and vice versa. One possible explanation for this counterintuitive finding that the review data are not mapped against specific crime types, and thus the crime data capture mostly police visibility on the site. More research and domain knowledge are needed to establish the strength of hotel reviews as a proxy for crime prediction
Inferring demographic data of marginalized users in Twitter with computer vision APIs
Abstract
Inferring demographic intelligence from unlabeled social media data is an actively growing area of research, challenged by low availability of ground truth annotated training corpora. High-accuracy approaches for labeling demographic traits of social media users employ various heuristics that do not scale up and often discount non-English texts and marginalized users. First, we present a framework for inferring the demographic attributes of Twitter users from their profile pictures (avatars) using the Microsoft Azure Face API. Second, we measure the inter-rater agreement between annotations made using our framework against two pre-labeled samples of Twitter users (N1=1163; N2=659) whose age labels were manually annotated. Our results indicate that the strength of the inter-rater agreement (Gwetâs AC1=0.89; 0.90) between the gold standard and our approach is âvery goodâ for labelling the age group of users. The paper provides a use case of Computer Vision for enabling the development of large cross-sectional labeled datasets, and further advances novel solutions in the field of demographic inference from short social media texts
SemanPhone:combining semantic and phonetic word association in verbal learning context
Abstract
This paper proposes an effective way to discover and memorize new English vocabulary based on both semantic and phonetic associations. The method we proposed aims to automatically find out the most associated words of a given target word. The measurement of semantic association was achieved by calculating cosine similarity of two-word vectors, and the measurement of phonetic association was achieved by calculating the longest common subsequence of phonetic symbol strings of two words. Finally, the method was implemented as a web application
Catchem:a browser plugin for the Panama papers using approximate string matching
Abstract
The Panama Papers is a collection of 11.5 million leaked records that contain information for more than 214,488 offshore entities. This collection is growing rapidly as more leaked records become available online. In this paper, we present a work in progress on a web browser plugin that detects company names from the Panama Papers and alerts the user by means of unobtrusive visual cues. We matched a random sample of company names from the Public Works and Government Services Canada registry against the Panama Papers using three different string matching techniques. Monge-Elkan is found to provide the best matching results but at increased computational cost. Levenshtein-based approach is found to provide the best tradeoff between matching and computational cost, while Jacquard index like approach is found to be less sensitive to slight textual change