337,337 research outputs found
Boilerplate Removal using a Neural Sequence Labeling Model
The extraction of main content from web pages is an important task for
numerous applications, ranging from usability aspects, like reader views for
news articles in web browsers, to information retrieval or natural language
processing. Existing approaches are lacking as they rely on large amounts of
hand-crafted features for classification. This results in models that are
tailored to a specific distribution of web pages, e.g. from a certain time
frame, but lack in generalization power. We propose a neural sequence labeling
model that does not rely on any hand-crafted features but takes only the HTML
tags and words that appear in a web page as input. This allows us to present a
browser extension which highlights the content of arbitrary web pages directly
within the browser using our model. In addition, we create a new, more current
dataset to show that our model is able to adapt to changes in the structure of
web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape
Revising Knowledge Discovery for Object Representation with Spatio-Semantic Feature Integration
In large social networks, web objects become increasingly popular. Multimedia object classification and representation is a necessary step of multimedia information retrieval. Indexing and organizing these web objects for the purpose of convenient browsing and search of the objects, and to effectively reveal interesting patterns from the objects. For all these tasks, classifying the web objects into manipulable semantic categories is an essential procedure. One important issue for classification of objects is the representation of images. To perform supervised classification tasks, the knowledge is extracted from unlabeled objects through unsupervised learning. In order to represent the images in a more meaningful and effective way rather than using the basic Bag-of-words (BoW) model, a novel image representation model called Bag-of-visual phrases(BoP) is used. In this model visual words are obtained using hierarchical clustering and visual phrases are generated by vector classifier of visual words. To obtain the Spatio-semantic correlation knowledge the frequently co-occurring pairs are calculated from visual vocabulary. After the successful object representation, the tags, comments, and descriptions of web objects are separated by using most likelihood method. The spatial and semantic differentiation power of image features can be enhanced via this BoP model and likelihood method.
DOI: 10.17762/ijritcc2321-8169.15065
Retrieval Models for Genre Classification
Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribution related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatically and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct lightweight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time
Library of medium-resolution fiber optic echelle spectra of F, G, K, and M field dwarfs to giants stars
We present a library of Penn State Fiber Optic Echelle (FOE) observations of
a sample of field stars with spectral types F to M and luminosity classes V to
I. The spectral coverage is from 3800 AA to 10000 AA with nominal a resolving
power 12000. These spectra include many of the spectral lines most widely used
as optical and near-infrared indicators of chromospheric activity such as the
Balmer lines (H_alpha, H_beta), Ca II H & K, Mg I b triplet, Na I D_{1} and
D_{2}, He I D_{3}, and Ca II IRT lines. There are also a large number of
photospheric lines, which can also be affected by chromospheric activity, and
temperature sensitive photospheric features such as TiO bands. The spectra have
been compiled with the goal of providing a set of standards observed at medium
resolution. We have extensively used such data for the study of active
chromosphere stars by applying a spectral subtraction technique. However, the
data set presented here can also be utilized in a wide variety of ways ranging
from radial velocity templates to study of variable stars and stellar
population synthesis. This library can also be used for spectral classification
purposes and determination of atmospheric parameters (T_eff, log{g}, [Fe/H]). A
digital version of all the fully reduced spectra is available via ftp and the
World Wide Web (WWW) in FITS format.Comment: Latex file with 17 pages, 4 figures. Full postscript (text and
figures) available at http://www.ucm.es/info/Astrof/fgkmsl/FOEfgkmsl.html To
be published in ApJ
Satellite Image Spoofing: Creating Remote Sensing Dataset with Generative Adversarial Networks (Short Paper)
The rise of Artificial Intelligence (AI) has brought up both opportunities and challenges for today\u27s evolving GIScience. Its ability in image classification, object detection and feature extraction has been frequently praised. However, it may also apply for falsifying geospatial data. To demonstrate the thrilling power of AI, this research explored the potentials of deep learning algorithms in capturing geographic features and creating fake satellite images according to the learned \u27sense\u27. Specifically, Generative Adversarial Networks (GANs) is used to capture geographic features of a certain place from a group of web maps and satellite images, and transfer the features to another place. Corvallis is selected as the study area, and fake datasets with \u27learned\u27 style from three big cities (i.e. New York City, Seattle and Beijing) are generated through CycleGAN. The empirical results show that GANs can \u27remember\u27 a certain \u27sense of place\u27 and further apply that \u27sense\u27 to another place. With this paper, we would like to raise both public and GIScientists\u27 awareness in the potential occurrence of fake satellite images, and its impacts on various geospatial applications, such as environmental monitoring, urban planning, and land use development
Malware distributions and graph structure of the Web
Knowledge about the graph structure of the Web is important for understanding
this complex socio-technical system and for devising proper policies supporting
its future development. Knowledge about the differences between clean and
malicious parts of the Web is important for understanding potential treats to
its users and for devising protection mechanisms. In this study, we conduct
data science methods on a large crawl of surface and deep Web pages with the
aim to increase such knowledge. To accomplish this, we answer the following
questions. Which theoretical distributions explain important local
characteristics and network properties of websites? How are these
characteristics and properties different between clean and malicious
(malware-affected) websites? What is the prediction power of local
characteristics and network properties to classify malware websites? To the
best of our knowledge, this is the first large-scale study describing the
differences in global properties between malicious and clean parts of the Web.
In other words, our work is building on and bridging the gap between
\textit{Web science} that tackles large-scale graph representations and
\textit{Web cyber security} that is concerned with malicious activities on the
Web. The results presented herein can also help antivirus vendors in devising
approaches to improve their detection algorithms
Feature Extraction from Degree Distribution for Comparison and Analysis of Complex Networks
The degree distribution is an important characteristic of complex networks.
In many data analysis applications, the networks should be represented as
fixed-length feature vectors and therefore the feature extraction from the
degree distribution is a necessary step. Moreover, many applications need a
similarity function for comparison of complex networks based on their degree
distributions. Such a similarity measure has many applications including
classification and clustering of network instances, evaluation of network
sampling methods, anomaly detection, and study of epidemic dynamics. The
existing methods are unable to effectively capture the similarity of degree
distributions, particularly when the corresponding networks have different
sizes. Based on our observations about the structure of the degree
distributions in networks over time, we propose a feature extraction and a
similarity function for the degree distributions in complex networks. We
propose to calculate the feature values based on the mean and standard
deviation of the node degrees in order to decrease the effect of the network
size on the extracted features. The proposed method is evaluated using
different artificial and real network datasets, and it outperforms the state of
the art methods with respect to the accuracy of the distance function and the
effectiveness of the extracted features.Comment: arXiv admin note: substantial text overlap with arXiv:1307.362
- …