11,830 research outputs found
On Extracting Data from Tables that are Encoded using HTML
Tables are a common means to display data in human-friendly formats. Many
authors have worked on proposals to extract those data back since this has
many interesting applications. In this article, we summarise and compare many
of the proposals to extract data from tables that are encoded using HTML and
have been published between 2000 and 2018. We first present a vocabulary that
homogenises the terminology used in this field; next, we use it to summarise
the proposals; finally, we compare them side by side. Our analysis highlights
several challenges to which no proposal provides a conclusive solution and a
few more that have not been addressed sufficiently; simply put, no proposal
provides a complete solution to the problem, which seems to suggest that this
research field shall keep active in the near future. We have also realised that
there is no consensus regarding the datasets and the methods used to evaluate
the proposals, which hampers comparing the experimental results.Ministerio de EconomĂa y Competitividad TIN2013-40848-RMinisterio de EconomĂa y Competitividad TIN2016-75394-
Statically Checking Web API Requests in JavaScript
Many JavaScript applications perform HTTP requests to web APIs, relying on
the request URL, HTTP method, and request data to be constructed correctly by
string operations. Traditional compile-time error checking, such as calling a
non-existent method in Java, are not available for checking whether such
requests comply with the requirements of a web API. In this paper, we propose
an approach to statically check web API requests in JavaScript. Our approach
first extracts a request's URL string, HTTP method, and the corresponding
request data using an inter-procedural string analysis, and then checks whether
the request conforms to given web API specifications. We evaluated our approach
by checking whether web API requests in JavaScript files mined from GitHub are
consistent or inconsistent with publicly available API specifications. From the
6575 requests in scope, our approach determined whether the request's URL and
HTTP method was consistent or inconsistent with web API specifications with a
precision of 96.0%. Our approach also correctly determined whether extracted
request data was consistent or inconsistent with the data requirements with a
precision of 87.9% for payload data and 99.9% for query data. In a systematic
analysis of the inconsistent cases, we found that many of them were due to
errors in the client code. The here proposed checker can be integrated with
code editors or with continuous integration tools to warn programmers about
code containing potentially erroneous requests.Comment: International Conference on Software Engineering, 201
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Tight and simple Web graph compression
Analysing Web graphs has applications in determining page ranks, fighting Web
spam, detecting communities and mirror sites, and more. This study is however
hampered by the necessity of storing a major part of huge graphs in the
external memory, which prevents efficient random access to edge (hyperlink)
lists. A number of algorithms involving compression techniques have thus been
presented, to represent Web graphs succinctly but also providing random access.
Those techniques are usually based on differential encodings of the adjacency
lists, finding repeating nodes or node regions in the successive lists, more
general grammar-based transformations or 2-dimensional representations of the
binary matrix of the graph. In this paper we present two Web graph compression
algorithms. The first can be seen as engineering of the Boldi and Vigna (2004)
method. We extend the notion of similarity between link lists, and use a more
compact encoding of residuals. The algorithm works on blocks of varying size
(in the number of input lines) and sacrifices access time for better
compression ratio, achieving more succinct graph representation than other
algorithms reported in the literature. The second algorithm works on blocks of
the same size, in the number of input lines, and its key mechanism is merging
the block into a single ordered list. This method achieves much more attractive
space-time tradeoffs.Comment: 15 page
Technical Report: CSVM Ecosystem
The CSVM format is derived from CSV format and allows the storage of tabular
like data with a limited but extensible amount of metadata. This approach could
help computer scientists because all information needed to uses subsequently
the data is included in the CSVM file and is particularly well suited for
handling RAW data in a lot of scientific fields and to be used as a canonical
format. The use of CSVM has shown that it greatly facilitates: the data
management independently of using databases; the data exchange; the integration
of RAW data in dataflows or calculation pipes; the search for best practices in
RAW data management. The efficiency of this format is closely related to its
plasticity: a generic frame is given for all kind of data and the CSVM parsers
don't make any interpretation of data types. This task is done by the
application layer, so it is possible to use same format and same parser codes
for a lot of purposes. In this document some implementation of CSVM format for
ten years and in different laboratories are presented. Some programming
examples are also shown: a Python toolkit for using the format, manipulating
and querying is available. A first specification of this format (CSVM-1) is now
defined, as well as some derivatives such as CSVM dictionaries used for data
interchange. CSVM is an Open Format and could be used as a support for Open
Data and long term conservation of RAW or unpublished data.Comment: 31 pages including 2p of Anne
- …