2,101,823 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Extrema-weighted feature extraction for functional data

    Full text link
    Motivation: Although there is a rich literature on methods for assessing the impact of functional predictors, the focus has been on approaches for dimension reduction that can fail dramatically in certain applications. Examples of standard approaches include functional linear models, functional principal components regression, and cluster-based approaches, such as latent trajectory analysis. This article is motivated by applications in which the dynamics in a predictor, across times when the value is relatively extreme, are particularly informative about the response. For example, physicians are interested in relating the dynamics of blood pressure changes during surgery to post-surgery adverse outcomes, and it is thought that the dynamics are more important when blood pressure is significantly elevated or lowered. Methods: We propose a novel class of extrema-weighted feature (XWF) extraction models. Key components in defining XWFs include the marginal density of the predictor, a function up-weighting values at high quantiles of this marginal, and functionals characterizing local dynamics. Algorithms are proposed for fitting of XWF-based regression and classification models, and are compared with current methods for functional predictors in simulations and a blood pressure during surgery application. Results: XWFs find features of intraoperative blood pressure trajectories that are predictive of postoperative mortality. By their nature, most of these features cannot be found by previous methods.Comment: 16 pages, 9 figure

    The Slitless Spectroscopy Data Extraction Software aXe

    Full text link
    The methods and techniques for the slitless spectroscopy software aXe, which was designed to reduce data from the various slitless spectroscopy modes of Hubble Space Telescope instruments, are described. aXe can treat slitless spectra from different instruments such as ACS, NICMOS and WFC3 through the use of a configuration file which contains all the instrument dependent parameters. The basis of the spectral extraction within aXe are the position, morphology and photometry of the objects on a companion direct image. Several aspects of slitless spectroscopy, such as the overlap of spectra, an extraction dependent on object shape and the provision of flat-field cubes, motivate a dedicated software package, and the solutions offered within aXe are discussed in detail. The effect of the mutual contamination of spectra can be quantitatively assessed in aXe, using spectral and morphological information from the companion direct image(s). A new method named 'aXedrizzle' for 2D rebinning and co-adding spectral data, taken with small shifts or dithers, is described. The extraction of slitless spectra with optimal weighting is outlined and the correction of spectra for detector fringing for the ACS CCD's is presented. Auxiliary software for simulating slitless data and for visualizing the results of an aXe extraction is outlined.Comment: 18 pages, 10 figures, accepted for publication in PASP. A high resolution version is available at http://www.stecf.org/software/slitless_software/axe/axe_PASP.pd

    Data-driven Extraction of Intonation Contour Classes

    Get PDF
    In this paper we introduce the first steps towards a new datadriven method for extraction of intonation events that does not require any prerequisite prosodic labelling. Provided with data segmented on the syllable constituent level it derives local and global contour classes by stylisation and subsequent clustering of the stylisation parameter vectors. Local contour classes correspond to pitch movements connected to one or several syllables and determine the local f0 shape. Global classes are connected to intonation phrases and determine the f0 register. Local classes initially are derived for syllabic segments, which are then concatenated incrementally by means of statistical language modelling of co-occurrence patterns. Due to its generality the method is in principal language independent and potentially capable to deal also with other aspects of prosody than intonation. 1

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Developing a comprehensive framework for multimodal feature extraction

    Full text link
    Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions---ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (video, images, audio, and text), and is expressly with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of complex feature extraction pipelines that output results in a single, standardized format. We describe the package's architecture, detail its major advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct sophisticated feature extraction workflows while increasing code clarity and maintainability

    Relational Data Mining Through Extraction of Representative Exemplars

    Full text link
    With the growing interest on Network Analysis, Relational Data Mining is becoming an emphasized domain of Data Mining. This paper addresses the problem of extracting representative elements from a relational dataset. After defining the notion of degree of representativeness, computed using the Borda aggregation procedure, we present the extraction of exemplars which are the representative elements of the dataset. We use these concepts to build a network on the dataset. We expose the main properties of these notions and we propose two typical applications of our framework. The first application consists in resuming and structuring a set of binary images and the second in mining co-authoring relation in a research team
    corecore