12,888 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
An Analysis of the Consequences of the General Data Protection Regulation on Social Network Research
This article examines the principles outlined in the General Data Protection Regulation in the context of social network data. We provide both a practical guide to General Data Protection Regulation--compliant social network data processing, covering aspects such as data collection, consent, anonymization, and data analysis, and a broader discussion of the problems emerging when the general principles on which the regulation is based are instantiated for this research area
Social Bots: Human-Like by Means of Human Control?
Social bots are currently regarded an influential but also somewhat
mysterious factor in public discourse and opinion making. They are considered
to be capable of massively distributing propaganda in social and online media
and their application is even suspected to be partly responsible for recent
election results. Astonishingly, the term `Social Bot' is not well defined and
different scientific disciplines use divergent definitions. This work starts
with a balanced definition attempt, before providing an overview of how social
bots actually work (taking the example of Twitter) and what their current
technical limitations are. Despite recent research progress in Deep Learning
and Big Data, there are many activities bots cannot handle well. We then
discuss how bot capabilities can be extended and controlled by integrating
humans into the process and reason that this is currently the most promising
way to go in order to realize effective interactions with other humans.Comment: 36 pages, 13 figure
TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild
The extraction of cyber threat intelligence (CTI) from open sources is a
rapidly expanding defensive strategy that enhances the resilience of both
Information Technology (IT) and Operational Technology (OT) environments
against large-scale cyber-attacks. While previous research has focused on
improving individual components of the extraction process, the community lacks
open-source platforms for deploying streaming CTI data pipelines in the wild.
To address this gap, the study describes the implementation of an efficient and
well-performing platform capable of processing compute-intensive data pipelines
based on the cloud computing paradigm for real-time detection, collecting, and
sharing CTI from different online sources. We developed a prototype platform
(TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy,
Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index
IOCs in the wild. Moreover, the provisioning, monitoring, and management of the
TSTEM platform are achieved through infrastructure as a code (IaC). Custom
focus crawlers collect web content, which is then processed by a first-level
classifier to identify potential indicators of compromise (IOCs). If deemed
relevant, the content advances to a second level of extraction for further
examination. Throughout this process, state-of-the-art NLP models are utilized
for classification and entity extraction, enhancing the overall IOC extraction
methodology. Our experimental results indicate that these models exhibit high
accuracy (exceeding 98%) in the classification and extraction tasks, achieving
this performance within a time frame of less than a minute. The effectiveness
of our system can be attributed to a finely-tuned IOC extraction method that
operates at multiple stages, ensuring precise identification of relevant
information with low false positives
- …