847,897 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Optimal Extraction of Fibre Optic Spectroscopy

    Full text link
    We report an optimal extraction methodology, for the reduction of multi-object fibre spectroscopy data, operating in the regime of tightly packed (and hence significantly overlapping) fibre profiles. The routine minimises crosstalk between adjacent fibres and statistically weights the extraction to reduce noise. As an example of the process we use simulations of the numerous modes of operation of the AAOmega fibre spectrograph and observational data from the SPIRAL Integral Field Unit at the Anglo-Australian Telescope.Comment: Accepted for publication in PAS

    CLaSPS: a new methodology for Knowledge extraction from complex astronomical dataset

    Get PDF
    In this paper we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex datasets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. In this paper we discuss the applications of CLaSPS to two simple astronomical datasets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first dataset, CSC+, is composed of optical quasars spectroscopically selected in the SDSS data, observed in the X-rays by Chandra and with multi-wavelength observations in the near-infrared, optical and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the alphaOX parameter and the near ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other dataset consists of a sample of blazars for which photometric observations in the optical, mid and near infrared are available, complemented for a subset of the sources, by Fermi gamma-ray data. The main results of the application of CLaSPS to such datasets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lacs and Flat Spectrum Radio Quasars and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in details in other papers by one of the authors.Comment: 18 pages, 9 figures, accepted for publication in Ap

    Optimizing a sustainable ultrasound assisted extraction method for the recovery of polyphenols from lemon by-products:comparison with hot water and organic solvent extractions

    Get PDF
    Response surface methodology (RSM) based on a three-factor and three-level Box–Behnken design was employed for optimizing the aqueous ultrasound-assisted extraction (AUAE) conditions, including extraction time (35–45 min), extraction temperature (45–55 °C) and ultrasonic power (150–250 W), for the recovery of total phenolic content (TPC) and rutin from lemon by-products. The independent variables and their values were selected on the basis of preliminary experiments, where the effects of five extraction parameters (particle size, extraction time and temperature, ultrasonic power and sample-to-solvent ratio) on TPC and rutin extraction yields were investigated. The yields of TPC and rutin were studied using a second-order polynomial equation. The optimum AUAE conditions for TPC were extraction time of 45 min, extraction temperature of 50 °C and ultrasonic power of 250 W with a predicted value of 18.10 ± 0.24 mg GAE/g dw, while the optimum AUAE conditions for rutin were extraction time of 35 min, extraction temperature of 48 °C and ultrasonic power of 150W with a predicted value of 3.20 ± 0.12 mg/g dw. The extracts obtained at the optimum AUAE conditions were compared with those obtained by a hot water and an organic solvent conventional extraction in terms of TPC, total flavonoid content (TF) and antioxidant capacity. The extracts obtained by AUAE had the same TPC, TF and ferric reducing antioxidant power as those achieved by organic solvent conventional extraction. However, hot water extraction led to extracts with the highest flavonoid content and antioxidant capacity. Scanning electron microscopy analysis showed that all the extraction methods led to cell damage to varying extents

    Crowdsourcing Semantic Label Propagation in Relation Classification

    Full text link
    Distant supervision is a popular method for performing relation extraction from text that is known to produce noisy labels. Most progress in relation extraction and classification has been made with crowdsourced corrections to distant-supervised labels, and there is evidence that indicates still more would be better. In this paper, we explore the problem of propagating human annotation signals gathered for open-domain relation classification through the CrowdTruth methodology for crowdsourcing, that captures ambiguity in annotations by measuring inter-annotator disagreement. Our approach propagates annotations to sentences that are similar in a low dimensional embedding space, expanding the number of labels by two orders of magnitude. Our experiments show significant improvement in a sentence-level multi-class relation classifier.Comment: In publication at the First Workshop on Fact Extraction and Verification (FeVer) at EMNLP 201
    corecore