2,101,823 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Extrema-weighted feature extraction for functional data
Motivation: Although there is a rich literature on methods for assessing the
impact of functional predictors, the focus has been on approaches for dimension
reduction that can fail dramatically in certain applications. Examples of
standard approaches include functional linear models, functional principal
components regression, and cluster-based approaches, such as latent trajectory
analysis. This article is motivated by applications in which the dynamics in a
predictor, across times when the value is relatively extreme, are particularly
informative about the response. For example, physicians are interested in
relating the dynamics of blood pressure changes during surgery to post-surgery
adverse outcomes, and it is thought that the dynamics are more important when
blood pressure is significantly elevated or lowered.
Methods: We propose a novel class of extrema-weighted feature (XWF)
extraction models. Key components in defining XWFs include the marginal density
of the predictor, a function up-weighting values at high quantiles of this
marginal, and functionals characterizing local dynamics. Algorithms are
proposed for fitting of XWF-based regression and classification models, and are
compared with current methods for functional predictors in simulations and a
blood pressure during surgery application.
Results: XWFs find features of intraoperative blood pressure trajectories
that are predictive of postoperative mortality. By their nature, most of these
features cannot be found by previous methods.Comment: 16 pages, 9 figure
The Slitless Spectroscopy Data Extraction Software aXe
The methods and techniques for the slitless spectroscopy software aXe, which
was designed to reduce data from the various slitless spectroscopy modes of
Hubble Space Telescope instruments, are described. aXe can treat slitless
spectra from different instruments such as ACS, NICMOS and WFC3 through the use
of a configuration file which contains all the instrument dependent parameters.
The basis of the spectral extraction within aXe are the position, morphology
and photometry of the objects on a companion direct image. Several aspects of
slitless spectroscopy, such as the overlap of spectra, an extraction dependent
on object shape and the provision of flat-field cubes, motivate a dedicated
software package, and the solutions offered within aXe are discussed in detail.
The effect of the mutual contamination of spectra can be quantitatively
assessed in aXe, using spectral and morphological information from the
companion direct image(s). A new method named 'aXedrizzle' for 2D rebinning and
co-adding spectral data, taken with small shifts or dithers, is described. The
extraction of slitless spectra with optimal weighting is outlined and the
correction of spectra for detector fringing for the ACS CCD's is presented.
Auxiliary software for simulating slitless data and for visualizing the results
of an aXe extraction is outlined.Comment: 18 pages, 10 figures, accepted for publication in PASP. A high
resolution version is available at
http://www.stecf.org/software/slitless_software/axe/axe_PASP.pd
Data-driven Extraction of Intonation Contour Classes
In this paper we introduce the first steps towards a new datadriven method for extraction of intonation events that does not require any prerequisite prosodic labelling. Provided with data segmented on the syllable constituent level it derives local and global contour classes by stylisation and subsequent clustering of the stylisation parameter vectors. Local contour classes correspond to pitch movements connected to one or several syllables and determine the local f0 shape. Global classes are connected to intonation phrases and determine the f0 register. Local classes initially are derived for syllabic segments, which are then concatenated incrementally by means of statistical language modelling of co-occurrence patterns. Due to its generality the method is in principal language independent and potentially capable to deal also with other aspects of prosody than intonation. 1
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Developing a comprehensive framework for multimodal feature extraction
Feature extraction is a critical component of many applied data science
workflows. In recent years, rapid advances in artificial intelligence and
machine learning have led to an explosion of feature extraction tools and
services that allow data scientists to cheaply and effectively annotate their
data along a vast array of dimensions---ranging from detecting faces in images
to analyzing the sentiment expressed in coherent text. Unfortunately, the
proliferation of powerful feature extraction services has been mirrored by a
corresponding expansion in the number of distinct interfaces to feature
extraction services. In a world where nearly every new service has its own API,
documentation, and/or client library, data scientists who need to combine
diverse features obtained from multiple sources are often forced to write and
maintain ever more elaborate feature extraction pipelines. To address this
challenge, we introduce a new open-source framework for comprehensive
multimodal feature extraction. Pliers is an open-source Python package that
supports standardized annotation of diverse data types (video, images, audio,
and text), and is expressly with both ease-of-use and extensibility in mind.
Users can apply a wide range of pre-existing feature extraction tools to their
data in just a few lines of Python code, and can also easily add their own
custom extractors by writing modular classes. A graph-based API enables rapid
development of complex feature extraction pipelines that output results in a
single, standardized format. We describe the package's architecture, detail its
major advantages over previous feature extraction toolboxes, and use a sample
application to a large functional MRI dataset to illustrate how pliers can
significantly reduce the time and effort required to construct sophisticated
feature extraction workflows while increasing code clarity and maintainability
Relational Data Mining Through Extraction of Representative Exemplars
With the growing interest on Network Analysis, Relational Data Mining is
becoming an emphasized domain of Data Mining. This paper addresses the problem
of extracting representative elements from a relational dataset. After defining
the notion of degree of representativeness, computed using the Borda
aggregation procedure, we present the extraction of exemplars which are the
representative elements of the dataset. We use these concepts to build a
network on the dataset. We expose the main properties of these notions and we
propose two typical applications of our framework. The first application
consists in resuming and structuring a set of binary images and the second in
mining co-authoring relation in a research team
- …
