847,897 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Optimal Extraction of Fibre Optic Spectroscopy
We report an optimal extraction methodology, for the reduction of
multi-object fibre spectroscopy data, operating in the regime of tightly packed
(and hence significantly overlapping) fibre profiles. The routine minimises
crosstalk between adjacent fibres and statistically weights the extraction to
reduce noise. As an example of the process we use simulations of the numerous
modes of operation of the AAOmega fibre spectrograph and observational data
from the SPIRAL Integral Field Unit at the Anglo-Australian Telescope.Comment: Accepted for publication in PAS
CLaSPS: a new methodology for Knowledge extraction from complex astronomical dataset
In this paper we present the Clustering-Labels-Score Patterns Spotter
(CLaSPS), a new methodology for the determination of correlations among
astronomical observables in complex datasets, based on the application of
distinct unsupervised clustering techniques. The novelty in CLaSPS is the
criterion used for the selection of the optimal clusterings, based on a
quantitative measure of the degree of correlation between the cluster
memberships and the distribution of a set of observables, the labels, not
employed for the clustering. In this paper we discuss the applications of
CLaSPS to two simple astronomical datasets, both composed of extragalactic
sources with photometric observations at different wavelengths from large area
surveys. The first dataset, CSC+, is composed of optical quasars
spectroscopically selected in the SDSS data, observed in the X-rays by Chandra
and with multi-wavelength observations in the near-infrared, optical and
ultraviolet spectral intervals. One of the results of the application of CLaSPS
to the CSC+ is the re-identification of a well-known correlation between the
alphaOX parameter and the near ultraviolet color, in a subset of CSC+ sources
with relatively small values of the near-ultraviolet colors. The other dataset
consists of a sample of blazars for which photometric observations in the
optical, mid and near infrared are available, complemented for a subset of the
sources, by Fermi gamma-ray data. The main results of the application of CLaSPS
to such datasets have been the discovery of a strong correlation between the
multi-wavelength color distribution of blazars and their optical spectral
classification in BL Lacs and Flat Spectrum Radio Quasars and a peculiar
pattern followed by blazars in the WISE mid-infrared colors space. This pattern
and its physical interpretation have been discussed in details in other papers
by one of the authors.Comment: 18 pages, 9 figures, accepted for publication in Ap
Optimizing a sustainable ultrasound assisted extraction method for the recovery of polyphenols from lemon by-products:comparison with hot water and organic solvent extractions
Response surface methodology (RSM) based on a three-factor and three-level Box–Behnken design was employed for optimizing the aqueous ultrasound-assisted extraction (AUAE) conditions, including extraction time (35–45 min), extraction temperature (45–55 °C) and ultrasonic power (150–250 W), for the recovery of total phenolic content (TPC) and rutin from lemon by-products. The independent variables and their values were selected on the basis of preliminary experiments, where the effects of five extraction parameters (particle size, extraction time and temperature, ultrasonic power and sample-to-solvent ratio) on TPC and rutin extraction yields were investigated. The yields of TPC and rutin were studied using a second-order polynomial equation. The optimum AUAE conditions for TPC were extraction time of 45 min, extraction temperature of 50 °C and ultrasonic power of 250 W with a predicted value of 18.10 ± 0.24 mg GAE/g dw, while the optimum AUAE conditions for rutin were extraction time of 35 min, extraction temperature of 48 °C and ultrasonic power of 150W with a predicted value of 3.20 ± 0.12 mg/g dw. The extracts obtained at the optimum AUAE conditions were compared with those obtained by a hot water and an organic solvent conventional extraction in terms of TPC, total flavonoid content (TF) and antioxidant capacity. The extracts obtained by AUAE had the same TPC, TF and ferric reducing antioxidant power as those achieved by organic solvent conventional extraction. However, hot water extraction led to extracts with the highest flavonoid content and antioxidant capacity. Scanning electron microscopy analysis showed that all the extraction methods led to cell damage to varying extents
Crowdsourcing Semantic Label Propagation in Relation Classification
Distant supervision is a popular method for performing relation extraction
from text that is known to produce noisy labels. Most progress in relation
extraction and classification has been made with crowdsourced corrections to
distant-supervised labels, and there is evidence that indicates still more
would be better. In this paper, we explore the problem of propagating human
annotation signals gathered for open-domain relation classification through the
CrowdTruth methodology for crowdsourcing, that captures ambiguity in
annotations by measuring inter-annotator disagreement. Our approach propagates
annotations to sentences that are similar in a low dimensional embedding space,
expanding the number of labels by two orders of magnitude. Our experiments show
significant improvement in a sentence-level multi-class relation classifier.Comment: In publication at the First Workshop on Fact Extraction and
Verification (FeVer) at EMNLP 201
- …
