31,951 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Automated Website Fingerprinting through Deep Learning
Several studies have shown that the network traffic that is generated by a
visit to a website over Tor reveals information specific to the website through
the timing and sizes of network packets. By capturing traffic traces between
users and their Tor entry guard, a network eavesdropper can leverage this
meta-data to reveal which website Tor users are visiting. The success of such
attacks heavily depends on the particular set of traffic features that are used
to construct the fingerprint. Typically, these features are manually engineered
and, as such, any change introduced to the Tor network can render these
carefully constructed features ineffective. In this paper, we show that an
adversary can automate the feature engineering process, and thus automatically
deanonymize Tor traffic by applying our novel method based on deep learning. We
collect a dataset comprised of more than three million network traces, which is
the largest dataset of web traffic ever used for website fingerprinting, and
find that the performance achieved by our deep learning approaches is
comparable to known methods which include various research efforts spanning
over multiple years. The obtained success rate exceeds 96% for a closed world
of 100 websites and 94% for our biggest closed world of 900 classes. In our
open world evaluation, the most performant deep learning model is 2% more
accurate than the state-of-the-art attack. Furthermore, we show that the
implicit features automatically learned by our approach are far more resilient
to dynamic changes of web content over time. We conclude that the ability to
automatically construct the most relevant traffic features and perform accurate
traffic recognition makes our deep learning based approach an efficient,
flexible and robust technique for website fingerprinting.Comment: To appear in the 25th Symposium on Network and Distributed System
Security (NDSS 2018
Procedures and tools for acquisition and analysis of volatile memory on android smartphones
Mobile phone forensics have become more prominent since mobile phones have become ubiquitous both for personal and business practice. Android smartphones show tremendous growth in the global market share. Many researchers and works show the procedures and techniques for the acquisition and analysis the non-volatile memory inmobile phones. On the other hand, the physical memory (RAM) on the smartphone might retain incriminating evidence that could be acquired and analysed by the examiner. This study reveals the proper procedure for acquiring the volatile memory inthe Android smartphone and discusses the use of Linux Memory Extraction (LiME) for dumping the volatile memory. The study also discusses the analysis process of the memory image with Volatility 2.3, especially how the application shows its capability analysis. Despite its advancement there are two major concerns for both applications. First, the examiners have to gain root privileges before executing LiME. Second, both applications have no generic solution or approach. On the other hand, currently there is no other tool or option that might give the same result as LiME and Volatility 2.3
XML content warehousing: Improving sociological studies of mailing lists and web data
In this paper, we present the guidelines for an XML-based approach for the
sociological study of Web data such as the analysis of mailing lists or
databases available online. The use of an XML warehouse is a flexible solution
for storing and processing this kind of data. We propose an implemented
solution and show possible applications with our case study of profiles of
experts involved in W3C standard-setting activity. We illustrate the
sociological use of semi-structured databases by presenting our XML Schema for
mailing-list warehousing. An XML Schema allows many adjunctions or crossings of
data sources, without modifying existing data sets, while allowing possible
structural evolution. We also show that the existence of hidden data implies
increased complexity for traditional SQL users. XML content warehousing allows
altogether exhaustive warehousing and recursive queries through contents, with
far less dependence on the initial storage. We finally present the possibility
of exporting the data stored in the warehouse to commonly-used advanced
software devoted to sociological analysis
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
Recently, substantial research effort has focused on how to apply CNNs or
RNNs to better extract temporal patterns from videos, so as to improve the
accuracy of video classification. In this paper, however, we show that temporal
information, especially longer-term patterns, may not be necessary to achieve
competitive results on common video classification datasets. We investigate the
potential of a purely attention based local feature integration. Accounting for
the characteristics of such features in video classification, we propose a
local feature integration framework based on attention clusters, and introduce
a shifting operation to capture more diverse signals. We carefully analyze and
compare the effect of different attention mechanisms, cluster sizes, and the
use of the shifting operation, and also investigate the combination of
attention clusters for multimodal integration. We demonstrate the effectiveness
of our framework on three real-world video classification datasets. Our model
achieves competitive results across all of these. In particular, on the
large-scale Kinetics dataset, our framework obtains an excellent single model
accuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5
accuracy on the validation set. The attention clusters are the backbone of our
winner solution at ActivityNet Kinetics Challenge 2017. Code and models will be
released soon.Comment: The backbone of the winner solution at ActivityNet Kinetics Challenge
201
- …