62,759 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Lost in translation: Exposing hidden compiler optimization opportunities
Existing iterative compilation and machine-learning-based optimization
techniques have been proven very successful in achieving better optimizations
than the standard optimization levels of a compiler. However, they were not
engineered to support the tuning of a compiler's optimizer as part of the
compiler's daily development cycle. In this paper, we first establish the
required properties which a technique must exhibit to enable such tuning. We
then introduce an enhancement to the classic nightly routine testing of
compilers which exhibits all the required properties, and thus, is capable of
driving the improvement and tuning of the compiler's common optimizer. This is
achieved by leveraging resource usage and compilation information collected
while systematically exploiting prefixes of the transformations applied at
standard optimization levels. Experimental evaluation using the LLVM v6.0.1
compiler demonstrated that the new approach was able to reveal hidden
cross-architecture and architecture-dependent potential optimizations on two
popular processors: the Intel i5-6300U and the Arm Cortex-A53-based Broadcom
BCM2837 used in the Raspberry Pi 3B+. As a case study, we demonstrate how the
insights from our approach enabled us to identify and remove a significant
shortcoming of the CFG simplification pass of the LLVM v6.0.1 compiler.Comment: 31 pages, 7 figures, 2 table. arXiv admin note: text overlap with
arXiv:1802.0984
Segmentation and semantic labelling of RGBD data with convolutional neural networks and surface fitting
We present an approach for segmentation and semantic labelling of RGBD data exploiting together geometrical cues and deep learning techniques. An initial over-segmentation is performed using spectral clustering and a set of non-uniform rational B-spline surfaces is fitted on the extracted segments. Then a convolutional neural network (CNN) receives in input colour and geometry data together with surface fitting parameters. The network is made of nine convolutional stages followed by a softmax classifier and produces a vector of descriptors for each sample. In the next step, an iterative merging algorithm recombines the output of the over-segmentation into larger regions matching the various elements of the scene. The couples of adjacent segments with higher similarity according to the CNN features are candidate to be merged and the surface fitting accuracy is used to detect which couples of segments belong to the same surface. Finally, a set of labelled segments is obtained by combining the segmentation output with the descriptors from the CNN. Experimental results show how the proposed approach outperforms state-of-the-art methods and provides an accurate segmentation and labelling
Integrated Support for Handoff Management and Context-Awareness in Heterogeneous Wireless Networks
The overwhelming success of mobile devices and wireless
communications is stressing the need for the development of
mobility-aware services. Device mobility requires services
adapting their behavior to sudden context changes and being
aware of handoffs, which introduce unpredictable delays and
intermittent discontinuities. Heterogeneity of wireless
technologies (Wi-Fi, Bluetooth, 3G) complicates the situation,
since a different treatment of context-awareness and handoffs is
required for each solution. This paper presents a middleware
architecture designed to ease mobility-aware service
development. The architecture hides technology-specific
mechanisms and offers a set of facilities for context awareness
and handoff management. The architecture prototype works with
Bluetooth and Wi-Fi, which today represent two of the most
widespread wireless technologies. In addition, the paper discusses
motivations and design details in the challenging context of
mobile multimedia streaming applications
Automatic Model Based Dataset Generation for Fast and Accurate Crop and Weeds Detection
Selective weeding is one of the key challenges in the field of agriculture
robotics. To accomplish this task, a farm robot should be able to accurately
detect plants and to distinguish them between crop and weeds. Most of the
promising state-of-the-art approaches make use of appearance-based models
trained on large annotated datasets. Unfortunately, creating large agricultural
datasets with pixel-level annotations is an extremely time consuming task,
actually penalizing the usage of data-driven techniques. In this paper, we face
this problem by proposing a novel and effective approach that aims to
dramatically minimize the human intervention needed to train the detection and
classification algorithms. The idea is to procedurally generate large synthetic
training datasets randomizing the key features of the target environment (i.e.,
crop and weed species, type of soil, light conditions). More specifically, by
tuning these model parameters, and exploiting a few real-world textures, it is
possible to render a large amount of realistic views of an artificial
agricultural scenario with no effort. The generated data can be directly used
to train the model or to supplement real-world images. We validate the proposed
methodology by using as testbed a modern deep learning based image segmentation
architecture. We compare the classification results obtained using both real
and synthetic images as training data. The reported results confirm the
effectiveness and the potentiality of our approach.Comment: To appear in IEEE/RSJ IROS 201
XML Schema-based Minification for Communication of Security Information and Event Management (SIEM) Systems in Cloud Environments
XML-based communication governs most of today's systems communication, due to
its capability of representing complex structural and hierarchical data.
However, XML document structure is considered a huge and bulky data that can be
reduced to minimize bandwidth usage, transmission time, and maximize
performance. This contributes to a more efficient and utilized resource usage.
In cloud environments, this affects the amount of money the consumer pays.
Several techniques are used to achieve this goal. This paper discusses these
techniques and proposes a new XML Schema-based Minification technique. The
proposed technique works on XML Structure reduction using minification. The
proposed technique provides a separation between the meaningful names and the
underlying minified names, which enhances software/code readability. This
technique is applied to Intrusion Detection Message Exchange Format (IDMEF)
messages, as part of Security Information and Event Management (SIEM) system
communication hosted on Microsoft Azure Cloud. Test results show message size
reduction ranging from 8.15% to 50.34% in the raw message, without using
time-consuming compression techniques. Adding GZip compression to the proposed
technique produces 66.1% shorter message size compared to original XML
messages.Comment: XML, JSON, Minification, XML Schema, Cloud, Log, Communication,
Compression, XMill, GZip, Code Generation, Code Readability, 9 pages, 12
figures, 5 tables, Journal Articl
Stateful Testing: Finding More Errors in Code and Contracts
Automated random testing has shown to be an effective approach to finding
faults but still faces a major unsolved issue: how to generate test inputs
diverse enough to find many faults and find them quickly. Stateful testing, the
automated testing technique introduced in this article, generates new test
cases that improve an existing test suite. The generated test cases are
designed to violate the dynamically inferred contracts (invariants)
characterizing the existing test suite. As a consequence, they are in a good
position to detect new errors, and also to improve the accuracy of the inferred
contracts by discovering those that are unsound. Experiments on 13 data
structure classes totalling over 28,000 lines of code demonstrate the
effectiveness of stateful testing in improving over the results of long
sessions of random testing: stateful testing found 68.4% new errors and
improved the accuracy of automatically inferred contracts to over 99%, with
just a 7% time overhead.Comment: 11 pages, 3 figure
- …